So we have seen the importance of retention rates, and how they affect expected CLV. So the next question is: how do we estimate the retention rates? So far we’ve been assuming that \(r\) is already known.

Warning: This gets real math, real fast, but it has a nice elegant result at the end. Also remember that we’re still operating with the assumptions that customers pay for a subscription, and if they stop paying (i.e. leave) they’re never allowed back again. So a fantastic business model is what I’m saying. These are the basics remember, we’ll build up to more general and more useful models.

What kind of data do we need?

Model Parameters

Then equations (3) and (4) are used to derive the maximum likelihood estimate for the retention rate (\(r\)).

\[L(r)= \prod_{i=1}^{n_1}P(T={t_i})\prod_{i=1}^{n_0}S(c_i + 1) = \prod_{i=1}^{n_1}(1-r)r^{t_i - 1}\prod_{i=1}^{n_0}r^{c_i} \tag{10}\]

Which is maximized by waving your magic math wand and taking logs, differentiating with respect to r, equating the derivative to 0, and solving for r to get:

\[\hat{r} =\frac{ \sum t_t + \sum(c_t - n_1) }{ \sum t_t + \sum c_t } = 1 - \frac{ n_i }{ \sum t_t + \sum c_t } \tag{11}\]

So the denominator in the fractional term of Equation (11) gives the total number of periods in which customers can terminate. That is, the total number of opportunities that a customer has had to terminate their relationship, for the entire history of their relationship so far.

Then the numerator gives the number of customers who have terminated during the chosen time window.

So the fractional term gives the number of customers who have terminated their relationship during the time window, divided by the total number of opportunities that these customers have had to leave. Taking 1 - this proportion then gives the proportion of customers who have been retained.

This is a super important point that I found extremely cool and didn’t think of myself. You don’t just take the number of customers who left, and the number who stayed and divide them. You have to consider all the possible opportunities that a person had to leave, and see how many of those opportunities were actually acted upon.

This is our estimate of the retention rate.

Confusing. I’m with you. This didn’t make sense to me until I worked through an example:

EX 6: A company exists that acquires customers, and then has them pay a monthly subscription fee until they choose to leave. The table below shows one year of transaction history for eight customers who were acquired during the year. An R indicates that the customer was retained and a C indicates that the customer canceled. The rows in the table below represent a customer, while the columns represent a month in a year. Since we’re just looking at cancellation behaviour, for these examples we don’t care how or when they’re paying.

Remember that the variable \(t_i\) (shown in the table) is number of time periods that customer \(i\) was active for, before cancelling at time \(t\). And \(c_i\) is the number of periods that customer \(i\) has been active for if they haven’t terminated by the final month.

  #    Month number
  # Id  1 2 3 4 5 6 7 8 9 10 11 12 ti ci
  # 1             R R C            3
  # 2     R R R R R R R R R  R C   11
  # 3   R R R R R R R R R R  R R      12
  # 4             R R R R C        5
  # 5   R R R R R R R R R R  R R      12
  # 6           R R R R R R  R R      8
  # 7           R R R R R R  R C   8
  # 8   R R C                      3
  #

Estimate the retention rate for the educational service provider above.

So here we have

t = c(3,11,5,8,3)
c = c(12, 12, 8)
r_hat = retention_rate_mle(5, t, c)
r_hat
## [1] 0.9193548

So this example is useful for simple cases where you have a number of customers that you can still see if you print them into a table. But these concepts extend to much larger organisations as well. The next example shows you how.

EX 7: This problem examines 671 customers who were acquired over a year. The table below shows the number of customers that either cancel or are censored in each time period. For example, four customers canceled during their second month, 16 canceled during their third month, and three customers were censored during month 1. There were \(n_1\) = 245 customers who canceled the service during this year (and thus have observed cancelation times) and \(n_0\) = 426 censored cases.

I’ll give definitions of each column to hopefully reduce the amount of time you spend trying to figure out what they mean:

## # A tibble: 12 x 4
##    Active_Months Cancelled Censored Total
##            <int>     <dbl>    <dbl> <dbl>
##  1             1        0.       3.    3.
##  2             2        4.       0.    4.
##  3             3       16.       2.   18.
##  4             4       20.       1.   21.
##  5             5       37.       7.   44.
##  6             6       28.      33.   61.
##  7             7       61.      49.  110.
##  8             8       24.      63.   87.
##  9             9       19.      30.   49.
## 10            10       13.      16.   29.
## 11            11       10.      34.   44.
## 12            12       13.     188.  201.

Now that we have the data in a tibble, we can estimate the retention rate. The same formula for \(\hat{r}\) is used in this case. I’ve written an R function that does this for me. Both functions used in this post are included at the bottom of the blog post (retention_rate_mle and retention_rate_mle_tibble).

education %>% retention_rate_mle_tibble(Active_Months, Cancelled, Censored, Total)
## [1] 0.9579616

So we have a retention rate of ~95.8%. We must be a pretty great company for people to love us this much.


So this post has covered a method to estimate the retention rates from transaction data. Once you know these retention rates, you can use this estimated value in the CLV functions from the previous blogs.

The next blog will discuss some considerations when formulating these models. Most notably, which monetary costs it is appropriate to include and which should be ignored. I can tell you’re excited.

Here are the two functions that I defined and used above. The second function uses concepts around quosures which is… not a thing a beginner (or really anyone who doesn’t work for Hadley Wickham) should be worried about.

retention_rate_mle <- function(n1, t, c){
  1 - n1 / (sum(t) + sum(c))
}

retention_rate_mle_tibble <- function(tib, status, cancelled, censored, total){
  status <- enquo(status)
  cancelled <- enquo(cancelled)
  censored <- enquo(censored)
  total <- enquo(total)

  n1 <- tib %>% select(!!cancelled) %>% summarise(sum = sum(!!cancelled)) %>% pull()

  sumt_sumc <- tib %>%
    transmute(result = !!status * !!total) %>%
    summarise(sum = sum(result)) %>%
    pull()

  1 - n1 / sumt_sumc
}