Machine Learning for Email Lead Scoring

A Case Study of Predictive Analytics in Marketing

Lucas O
15 min readDec 2, 2023

Note: This project was completed as part of the Python for Machine Learning and APIs by Business Science University. For the detailed write up, checkout my personal portfolio website.

Introduction

Businesses face challenges in identifying and prioritizing potential customers and/or identifying the future purchase potential of current customers based on their email interactions, leading to suboptimal allocation of resources and missed opportunities.

Email Lead Scoring plays a crucial role in determining the quality and conversion potential of leads generated through email marketing campaigns. Email lead scoring is a method used by marketers and sales teams to evaluate and prioritize leads based on their potential to become customers.

Machine learning based solutions can effectively evaluate the probability of leads converting into customers based on various data points extracted from email interactions. This solution should take into account factors like email open rates, click-through rates, response times, engagement patterns, and historical customer data to provide a comprehensive lead score.

Problem Statement & Objective

As mentioned earlier, this analysis provides a lead scoring solution for an online educational company. The company offers training courses (their main product) in data science and has a large email list of 100,000 subscribers (or leads), with a monthly growth rate of 6,000 new subscribers. The marketing team also sends out 5 emails per month and the business’s sales cycle generates approximately $250,000 in revenue per month.

However, the email list also experiences a significant number of unsubscribes, about 500 per email, resulting in a total of 2,500 unsubscribes per month.

This high unsubscribe rate indicates potential inefficiencies in the email marketing strategy. In addition, high unsubscribe rates can result in reduced revenue, especially if the business relies heavily on email marketing as a primary channel for generating leads and driving conversions. To sustain and increase revenue, it is crucial to optimize the email marketing approach and maximize customer conversion rates. The business also believes that nurturing lost customers could potential convert about 5% of them back into active customers.

Objective

Given these key insights, the problem at hand is to develop an effective email list scoring and segmentation strategy. The goal is to identify and prioritize the most valuable leads (in terms of probability of making a purchase) to target for sales emails, while also identifying leads with a low probability of purchase to nurture and increase their likelihood to purchase.

Uncovering Hidden Costs

Given the values highlighted above, we can estimate the monthly lost revenue (we’ll refer to this as cost going forward) due to unsubscribes to be around $250K per month (or $3M annually), not factoring in email list growth rate.

After factoring in a 3.5% monthly email list growth rate, we can expect the lost revenue due to unsubscribes to rise to around $364K per month (or $4.3M per year), an increase of 46% in lost revenue. The table below shows this scenario -

Further analysis of this hidden costs factors in uncertainty/variability in email list growth rate and conversion rates. The heatmap below shows a cost simulation with such variability —

We can see that regardless of how the drivers (email list growth rate and conversion rate) vary, we can still expect to see annual costs ranging from $2.61M to $4.38M. Thus this is definitely a problem worth solving.

Proposed Solution

An end-to-end email lead scoring solution to help marketing stakeholders prioritize leads for sales emails based on their high probability to make a purchase, while at the same time nurturing low probability leads and increase their likelihood of making a purchase also.

This solution involves end-to-end data science implementation project from business understanding to model deployment.

We’ve already touched on uncovering the hidden costs. Other phases of the project include —

  • Data Understanding
  • Machine Learning (Pycaret)
  • Return on Investment (ROI) Analysis
  • Backend Deployment (FastAPI)
  • Frontend Deployment (Streamlit)

We’ll go through each phase briefly below. For a more detailed write-up, please see this link.

Data Understanding / Exploratory Data Analysis

The goal of this phase was to understand the relationship between the various data points available and their impact on our target (made_purchase). Let’s highlight a few below —

Proportion of Subscribers with Previous Purchase

Observation: Only 5% of leads have made a previous purchase, meaning we are dealing with a highly imbalanced dataset.

Tag Count vs Made Purchase

Observation: We can see that if a subscriber has 40 or more tags (events attended, such as webinars), they are 100% likely to make a purchase. That likelihood drops as tag_count decreases. Note that a lead with 0 tags only have a 2% likelihood of making a purchase. For those with 0 tags (meaning they have not attended any events yet) we may not want to send them emails just yet. We may want to try and nurture them to attend more events before trying to get them to make a purchase. Overall if the business can get leads to attend more events, it drastically increases their likelihood of making a purchase.

Correlations

Observation: These correlation values further validate some of the data we saw earlier. We can see that tag_count and member_rating do show a fairly high correlation with made_purchase.

Feature Engineering

Feature Engineering is the process of selecting, manipulating and transforming raw data into features that can be used in supervised learning algorithms. Through exploratory data analysis, we gained a better understanding of what features might be predictive of a subscriber making a purchase. The next phase in the workflow involved creating processing pipelines to get our data in the right form for machine learning. This includes the creation of additional predictive features from existing features.

In this case, the following additional features were created —

  • optin_days - Generated from optin_time. This is the number of days the subscriber has been on the company’s email list.
  • activity_per_time - Generated by dividing tag_count (count of events a subscriber has attended) by the newly created optin_days.
  • One-To-Many Features (tags) — These are binary features (0s or 1s) for each tag (event) to indicate if a user attend the event or not.
  • Reducing High Cardinality — Applied to country_code. Using a threshold of 6, this process lumps countries that have less than 6 subscribers in the dataset into an other category.

The plot below is an example of some of the new features created and their relationship with our made_purchase (our target).

Looking at the first event (tag_learning_lab_09) we can see that if a lead attended that event, they have a 19% likelihood of making a purchase vs only 5% if they did not attend the event. The same can be seen for all the other events. The difference in purchase rates between attendees and non-attendees indicates a correlation between event attendance and purchasing behavior.

Machine Learning

Image Source

This phase of the analysis focuses on encoding algorithms for email lead scoring. As a reminder the goal is to predict and score email subscribers that are likely to make a purchase, based on features identified and engineered in previous sections. Therefore this is a binary classification problem. For modeling, we use the Pycaret python package. Pycaret is an open-source, low code machine learning library in Python that automates machine learning workflows.

Testing Multiple Models

Several models were initially tested, using Area Under the Curve (AUC) as the key metric. Higher AUC indicates a better-performing model in distinguishing positive and negative leads. The chart below shows the AUC along with other metrics from initial modeling. We can see the top 3 models in terms of AUC are Gradient Boosting Classifier (0.8044) CatBoost Classifier (0.8015) and Ada Boost Classifier (0.7965).

For details on model metrics, checkout the full article.

Experiment Tracking/Logging (MLflow)

MLflow is an open-source platform designed to help data scientists and machine learning engineers track and manage their machine learning experiments. It provides tools for experiment logging, reproducibility, and model management. Developed by Databricks, MLflow aims to simplify the machine learning lifecycle by enabling users to keep track of experiments, compare different models, and efficiently share and deploy ML projects.

PyCaret, can be seamlessly integrated with MLflow to leverage its powerful experiment tracking capabilities. The integration allows data scientists using PyCaret to log their experiments automatically into MLflow, making it easy to keep track of multiple experiments and compare different models efficiently.

Return On Investment (ROI) Analysis

After creating predictive models, our focus shifts to a critical aspect of the project; Return on Investment (ROI) analysis. This involves tying machine learning models to financial value by determining the potential cost or savings for a machine learning model’s prediction. This process is further sub-divided into 2 phases;

  1. Identifying a threshold for which to categorize leads as Hot-Leads or Cold-Leads based on the their score (probability of purchase) from the model. This allows us to determine expected value (sales minus cost) from only targeting Hot-Leads.
  2. Optimize the threshold for maximum efficiency and ROI based on senior management decisions.

Again the primary goal here is to evaluate the financial implications of classifying leads as Hot-Leads (high probability of purchase) or Cold-Leads (low probability of purchase). Hot-Leads will be targeting with sales emails while cold leads will be targeted with value emails such as free products or CTAs to attend webinars that are highly correlated with making a purchase. There is an inherent cost versus savings trade-off in targeting different types of leads. By not targeting Cold-Leads, we potentially miss out on some revenue, however there is also a savings as we nurture the Cold-Leads and potentially gain more sales in the future.

Initial Threshold & Cost Savings

This step involved calculating the cumulative gain of our machine learning model to enhance our return on investment (ROI). Here the predictions are sorted based on the model’s score (probability) of making a purchase. The Cumulative Gain then measures the proportion of Hot-Leads and Cold-Leads based on a threshold. An arbitrary threshold of 0.95 is used here as the cut off for categorizing Hot-Leads and Cold-Leads.

The table below shows the proportion of hot and cold leads based on this arbitrary 0.95 threshold.

This approach refines our marketing strategy. Instead of targeting all leads with sales emails, we now only focus on sending sales emails to Hot-Leads, while simultaneously nurturing Cold-Leads, whom we’ll expect to make a purchase within 90 days. The result of this would be some initial loss in sales as not all leads are targeted. Based on the Cold-Lead row in the table above, we can see that 5528 (27%) of leads will NOT be targeted for sales emails, resulting in 49 potential lost purchases. However, the initial loss in sales will be offset as we nurture cold leads and get them to make a purchase within 90 days.

Expected Value (Savings vs Cost)

As highlighted above, there is as cost and savings trade-off from only targeting Hot-Leads while only nurturing Cold-Leads. Expected Value in this case, is the financial value associated with various cost and saving scenarios. Let’s deep dive into this expected value calculations for this project, based on our threshold of 0.95 for determining hot and cold leads and the proportion of hot and cold leaders shown the previous table.

Before proceeding, we’ll also need some of the values stated earlier in the Business Understanding section -

Current Company KPIs

These current KPIs are then used, along with the Hot-Lead/Cold-Lead counts calculated earlier to generate some preliminary expected value calculations —

Preliminary Expected Value Calculations

Finally these preliminary calculations are then used to calculate expected value. The table below is a breakdown of cost vs savings for this new strategy vs our old strategy of targeting all leads with sales emails.

Expected Value with Email Lead Scoring Strategy

Key Takeaways: The proposed new strategy, focusing only on Hot-Leads, is expected to generate sales of approximately $237,266 in the first month. This approach differs from the previous strategy where we targeted all leads with sales emails. Despite this initial month’s sales being 5% lower than our usual monthly sales of $250,000, we will save $52,882 each month by not sending sales emails to Cold-Leads. More importantly, by nurturing Cold-Leads effectively and converting them into buyers within 90 days, we anticipate a significant increase in sales. We project our net sales to reach $290,148, marking a 16% rise compared to our current monthly sales.

For details on how these values are calculated, checkout this link.

Threshold Optimization

Our initial ROI analysis used a 95% threshold to classify leads as Hot (likely to buy) or Cold (less likely to buy). This section explores how changing this threshold affects our expected profits and cost savings.

The aim is to find the sweet spot that maximizes profit. However, we must also consider management’s concerns. For example, using the 95% threshold, we saw a 5% drop in sales in the first month. Senior management may be hesitant about any strategy that reduces monthly sales, even if it could lead to higher sales later on.

To address this, we’re now focusing on optimizing the threshold. This means finding a balance that maintains a certain level of monthly sales while still aiming for long-term profit growth. We’ll present different threshold options to management, showing how each one impacts monthly costs and potential savings. This way, management can make informed decisions, balancing immediate sales with long-term profitability

The plot below shows optimization results, with thresholds ranging from 0 to 1 on the x-axis and their corresponding expected values on the y-axis. A significant point is marked by a red dashed line at the 88% threshold (for categorizing leads as Hot or Cold), where we achieve the maximum expected value of $322,893. This maximum expected value is achieved while also ensuring that at least 87% of usual monthly sales ($219,595) are retained under the new email lead scoring strategy. This 87% minimum monthly sales required is what we’ll call Monthly Sales Reduction Safeguard, balancing maximum value realization with sales stability.

Expected Value with Monthly Sales Reduction Safeguard of 87%

However, management might consider this 87% safeguard to be too low. They may prefer a higher safeguard, for example, 90%. This means the new strategy should retain at least 90% of our monthly sales. The plot also shows how the expected value changes when we apply this 90% safeguard.

Expected Value with Monthly Sales Reduction Safeguard of 90%

Notice that with a 90% safeguard, the expected value drops to $317,695. However we see an increase in the current monthly sales retained, $224,532 in this case.

In conclusion the core of our decision-making process lies in striking a balance between two crucial objectives; achieving higher expected value in the long term vs maintaining current monthly sales levels in the short term. This trade-off presents a strategic challenge that all stakeholders must collectively navigate. The choice between prioritizing immediate sales stability and pursuing potentially greater profits down the line is pivotal. The optimization exercise shows the implications of various thresholds, but it ultimately falls to the collective agreement of all involved parties to determine the most suitable path forward.

Backend Deployment (FastAPI)

The goal of this phase was to create infrastructure to integrate our lead scoring models into the business process. This integration is achieved through the creation of Application Programming Interfaces (APIs). The APIs serve as communication gateways between our data models and the user-facing application, in this case, a Streamlit app.

API Endpoints

Our API, developed using [FastAPI](https://fastapi.tiangolo.com/), features several endpoints,

each serving a specific purpose;

1. Main Endpoint (“/”)

This is the landing page of our API. It provides users with a welcoming interface and guides them to the API documentation. This endpoint is crucial for user orientation and ease of use.

2. Get Email Subscribers (“/get_email_subscribers”)

This GET endpoint exposes our email subscriber data. When accessed, it returns the data with scored leads in JSON format. It’s vital for stakeholders to view and understand the current leads database.

3. Data Reception (“/data”)

A POST endpoint designed to receive data. Users can submit data in JSON format, which is then processed and stored. This endpoint is essential for updating our leads database with new information.

4. Lead Scoring Prediction (“/predict”)

This POST endpoint is the heart of our API. It accepts lead data and returns scored leads, using our proprietary lead scoring models. It enables the application of our predictive model to new or existing data for real-time lead scoring.

5. Lead Scoring Strategy Calculation (“/calculate_lead_strategy”)

Another POST endpoint, it calculates and optimizes lead scoring strategies based on various parameters (e.g., sales reduction safeguard, email list size). This is crucial for strategic decision-making and marketing optimization.

APIs play a pivotal role in the data science lifecycle, especially in the deployment phase. They enable the seamless integration of data science models into business processes, making predictive insights accessible and actionable for decision-makers. APIs facilitate real-time data processing and interaction, which is essential for dynamic and responsive business strategies.

Frontend Deployment (Streamlit)

In this final phase of the email lead scoring project, we transition from the development of API endpoints for our lead scoring strategy to their practical application. Recognizing the importance of deployment in the lifecycle of any data science project, we integrated these APIs into a user-friendly Streamlit application. This step is instrumental in transforming our analytical insights into actionable tools, directly accessible to senior management and other marketing stakeholders.

Please note that the Streamlit application is currently in the final stages of development and I’m still learning how to host a Streamlit application that uses FastAPI on a cloud platform. As soon as the deployment is complete, a direct link will be added to this write-up.

When first accessed, the link brings the user to an authentication screen where they will need to input their secret keys.

Once logged in successfully, the user will be prompted to upload a csv file with the leads scored data. In an enterprise setting, we will be connecting the app to the company’s database.

Next, the user can adjust the monthly sales and monthly sales safeguard based on the strategy they would like to implement, then hit the Run Analysis button.

Once the analysis is complete, the user will be able to see the expected value table and plot based on their monthly sales and monthly sales safeguard inputs.

Additionally the user can also see and download the lead scoring strategy data with leads information and their Hot-Lead/Cold-Lead classification.

In conclusion, the successful deployment of our Streamlit application is a crucial step in our email lead scoring project. This user-friendly platform is key in making our advanced data models accessible and practical for everyday use within the company. By providing senior management and marketing teams with this tool, we enable them to easily measure results and understand the financial impact of our lead scoring strategy.

Conclusion

This project has successfully demonstrated the integration of advanced data science techniques into practical business solutions for email lead scoring. The journey from conceptualization to the development of a functional model highlights several key achievements and insights including -

  • Strategic application of data science.
  • Technical implementation and enablement.
  • Financial implications and future recommendations.

In conclusion, this project showcases the practical application of machine learning in a business context. The insights and methodologies developed here can serve as a blueprint for similar initiatives in the future, driving forward the agenda of data-centric decision-making in the business world.

Next Steps

  • Deploy Streamlit app to a cloud platform.
  • Continue learning how to integrate FastAPI with web applications.

Reproducible code available on Github.

--

--

Lucas O

Analytics professional, passionate about using data to solve business problems. Interested in Marketing Analytics, AB Testing and Causal Inference.