Customer segmentation: An analytical approach to finding your kin

6 min readNov 16, 2022

Backstory:

As you know from my previous posts, I’ve been analyzing data from my blog to provide my readers with a better experience. In the last post in this series, we looked at how a user evolves from an anonymous click to someone I know by building the user journey table, you can find this here. This time we’ll be dividing our users into groups based on their behavior to distinguish my fans from skeptics, and to convert the latter into the former. What we’ll be doing today will feel like sitting on the Hogwarts seat with the “Sorting Hat” waiting to hear which house you belong to. So let’s dive in to see what our future beholds.

Problem:

Not every user that interacts with my blog does so in an identical way, let’s consider three types of users:

#1 Fans: Users who read all my blogs thoroughly (like my parents 😍)
#2 Friends: Users who occasionally will read some blogs if they happen to have the time (like my friends 😏)
#3 Skeptics: Users will read my blog once and never do so again (like my internet friends☹️)

Now I need to be able to look at a user’s user_id and know which bucket they fall into. This will help me in many ways, like evaluating a blog’s performance across groups, if a blog doesn’t do well with my fans then I know something’s wrong. It can also help me analyze user behavior across groups to design ways to graduate skeptics to friends or potential fans.

Data:

This time we’re revisiting our old friend the user_clickstream table which adds user info to the vanilla clickstream data.

+---------+------+-------------+------------------+-----+
| user_id | date |     ip      | duration(in sec) | ... |
+---------+------+-------------+------------------+-----+
|     123 | 1/31 | 101.102.1.5 |               10 | ... |
|     456 | 1/31 | 101.102.6.6 |              153 | ... |
|     789 | 1/31 | 101.102.7.7 |              450 | ... |
|     789 | 1/25 | 101.102.7.7 |              515 | ... |
|     456 | 1/7  | 101.102.6.6 |              121 | ... |
|     ... | ...  |         ... |              ... | ... |
+---------+------+-------------+------------------+-----+

STOP! Before reading ahead here’s where I want you to take some time and think about the problem yourself before reading my approach.

My approach:

We can divide our users in two ways, one approach would be to use user attributes to build a clustering model that learns these groups from the data itself. This approach doesn’t tell me much about the groups, like differences between #2 and #3 will be unclear, and also requires a lot of data for the model to learn meaningful insights from which isn’t true in my case. An alternative approach that we’ll be exploring today would perform this segmentation analytically using user behavior metrics.

Let’s think about metrics that help us capture how a user interacts with my blog, a straightforward one is how frequently the user interacts with my blogs captured by the total number of visits over a given period. However, this fails to capture the behavior of newer users. For example, let’s consider two users, user #1 who previously interacted with many blogs but is no longer doing so, and user #2 who recently discovered my blogs and is a true fan now. Over a given period the frequency of user #1 may be higher than user #2, however, going forward user #2 will be more valuable. Hence we need another metric that rewards recent interactions over the number of interactions. A user’s last visit date can be used to capture this.

We’re still missing something! The prior two metrics give us a good picture of a user’s interaction but they fail to the capture depth of the interaction. For example, a user may visit both older and recent blogs yielding a high value for both metrics but their visits may be shallow, where they click on the post and immediately bounce. To account for this we need a way to capture the depth of interaction. There are many ways to do this like page visits count, time duration of a visit, and responsiveness to the post(likes). For our use case let’s stick to the duration of the visit.

Now that we have our metrics figured out, let’s put pen to paper and think about how we can quantify these metrics to develop scores for our users. Firstly, the frequency score can be the total number of visits over the past year. Similarly, the depth score would be the total time spent on these visits. Finally, the recency score based on the most recent visit is a little tricky given its a date. However, rather than using the date as is, we can use the difference(delta) between the most recent visit date and the first date in our interval, this way we value recent visits over past visits. Let’s see how we can calculate these scores for our users using SQL:

-- Frequency score: Total visits by user in the past year
WITH frequency AS (
 SELECT user_id, COUNT(1) AS total_visits
 FROM user_clickstream
 WHERE date BETWEEN CURRENT_DATE() 
  AND DATE_SUB(CURRENT_DATE(),   INTERVAL 1 YEAR)
 GROUP BY user_id
),

-- Recency score: Delta between latest visit and first date of interval
recency AS (
 SELECT user_id, 
  DATE_DIFF(
   MAX(date), 
   DATE_SUB(CURRENT_DATE(), INTERVAL 1 YEAR),
   DAY
  ) AS last_visit_delta
 FROM user_clickstream
 WHERE date BETWEEN CURRENT_DATE() 
  AND DATE_SUB(CURRENT_DATE(),   INTERVAL 1 YEAR)
 GROUP BY user_id
),

-- Depth score: Total time spent on the blog by user in the past year
depth AS (
 SELECT user_id, SUM(duration) AS total_duration
 FROM user_clickstream
 WHERE date BETWEEN CURRENT_DATE() 
  AND DATE_SUB(CURRENT_DATE(),   INTERVAL 1 YEAR)
 GROUP BY user_id
)

-- Raw scores for users:
SELECT *
FROM frequency 
JOIN recency
 USING(user_id)
JOIN depth 
 USING(user_id)

The above query gets us the scores for every user, however, our main aim is to use these scores to segment our users into groups. This is a little difficult to do with these “raw” scores as it’s difficult to determine a threshold for a user to be called a “fan” or a “skeptic”. Would this be 10, 100, or 1000? Wouldn’t it also change over time as the blog grows? Hence a better way is to leverage quantiles for these scores which range from 0–100. This helps us determine how a user’s score compares to the rest and how it changes with the data. We can further convert these to deciles(groups of 10s) to make them easy to work with by narrowing them down to a 1–10 range.

To do this we need to switch gears to pandas or if you’re dealing with a lot of data it can be done with quantilediscretizer in pyspark. In pandas this is done with the qcut as follows:

# results from query read into a pandas DataFrame
df = pd.read_gbq(raw_scores_query) 

# Scores converted into deciles using qcut
# 1 is added to make the range 1-10
df['frequency'] = pd.qcut(df['total_visits'].values, 
  q=10, labels=False) + 1 

df['recency'] = pd.qcut(df['last_visit_delta'].values,
  q=10, labels=False) + 1

df['depth'] = pd.qcut(df['total_duration'].values,
  q=10, labels=False) + 1

This finally yields scores for each user as a band of 10. Now I can come up with rules to distinguish the three groups. For example, users with 6+ scores in all three metrics can be considered “fans”. What we did today is a common technique in marketing called RFM.

Food for thought:

What are some other use cases for segmenting my users?
How frequently would you evaluate these scores?
How would you use the scores to define the three groups?

Let me know how you’d approach it. If you found this helpful, share it. If you’re into this, you can find me on Twitter @abhishek27297 where I talk data. If you don’t want to miss my next post, consider subscribing here. It’s free and you’ll be notified when I post my next blog.