• SoniaSamipillai

Predict Customer Spend Behavior based on historical spend using Regression models in Python

Updated: May 7, 2021

"How much will the customer spend?"

Business Scenario

Making predictions is one of the primary goals in marketing data science. For some problems we are asked to predict customer spend or future sales for a company or make sales forecasts for product lines. Let us suppose that the business would like to predict the customer spending behavior.


We have some sample historical transaction data from 2010 and 2011.We want to predict the spend for the year 2011. So, we could use 2010 data to predict 2011 spending behavior in advance, unless the market or business has changed significantly since the time period the data was used to fit the model.


Since we use historical data to predict the future data, the methods that we use will fall under Supervised Machine Learning. And the output we are looking for is continuous. Hence we will talk about Regression. Regression is a supervised learning technique used to predict continuous outcomes. The model learns a function that maps input to the corresponding output in the feature space based on the labelled data that has been provided.

Machine Learning Flow

  1. Data Exploration

  2. Data Wrangling

  3. Feature Engineering or Data Transformation

  4. Data Modelling

  5. Interpretation of results

  6. Hypertuning (if and when required)

  7. Deployment of the model

  8. Monitoring

We will be focusing on up till the interpretation of results. The steps 6 to 8 are beyond the scope of this blogpost which I might address in the future blogs. We have the data, business scenario and idea of the ML methodology, let us see if the data we have can be engineered to get the output we are looking for.

A glance at our sample data

The variables we have are Order number, Stock code, Description, Quantity, Order date, UnitPrice, CustomerID and Country.

Data Wrangling


OrderNo          int64
StockCode       object
Description     object
Quantity         int64
OrderDate       object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object

First observations from the above data glanced.

We need to,

  1. Convert OrderDate to pandas datetime

  2. Add year column as we need to segregate the data for training and prediction

  3. Address missing values and outliers.

Before we proceed further into data wrangling, we need to establish a logic and a plan to get to the output with the given inputs. Because the inputs that we need are kinda hidden in this data and has to be engineered.

Formulate a Theoretical model based on Logic

Initially we only knew that we have a set of input features and the output. We knew that we gonna predict a continuous variable and that the historical data for input-output pairs is available. we are good so far. Now we had a peek into the sample data and there are few things to decide.

  1. What are the input variables(Independent Variables) and what is the output variable(Dependent or response variable)?

  2. How can I get the inputs and output I need, from the given data?

I am trying to predict how much will the customer spend in the year 2011. Hence the output variable is Customer Spend. We need to know how frequently the customer buys, How much money do they spend in each transaction and how recently did they do business with us? Yes, we are looking at RFM data. Recency, Frequency and Monetary behavior for our customers. Do you know that the Recency factor is more important than the frequency and monetary factors. That is why R comes before F and M. Your recent customer who bought a product last week for USD 10 is more valuable than a customer who bought a product 2 years ago for USD 100.

How can we can extract this information from the given data? Here comes Feature Engineering. In Feature Engineering, we transform the existing raw data into features, that would capture the essence of the given data and helps to predict the outcome of interest.

Data Wrangling and Feature Engineering in action

I have used pandas package to process the data. The link to the notebook and the data are given at the end of this post. Here are the steps performed.

1. Convert OrderDate to pandas datetime format.

2. Calculate spend using the quantity and unitprice grouped for each order

3. Add year column as we need to segregate the data for training and prediction.

4. Calculate the recency from the last purchase the customer has made.

5. Calculate the spend per customerId by grouping the invoice orders for each of the customer id.

6. Calculate the frequency from number of orders the customers have placed.

7. Calculate the Average spend per order.

8. Calculate the customer spend for 2010 in order to predict the customer spend for 2011.

The transformed dataset is shown below

Addressing Missing Values and Outliers

One needs to check and the address the missing values and outliers.

Missing Values - The nan and inf values

The missing values here are caused by customers who were active only in the year 2010 or 2011. This brings in the customer churn which I will discuss in another blogpost. For now , we remove the missing values and proceed for the next step. This would mean that our model would make the prediction for 2011 under the assumption that the customer remains active.


To see if the data is affected by outliers, we will look at its distribution.

As we can see in this chart, the minimum customer spend in data is 12.45 while the maximum is approx. 27k.

The majority of data falls under the range of 0 to around 2500.

Any value beyond three standard deviations from the central tendency measure is considered to be an outlier. In data that is skewed by the presence of an outlier, the statistician looks the median value as the central tendency rather than the mean as the mean tends to get affected by outliers. There are different treatments for outliers. Outliers needs to be definitely looked at before building a model as they do not reflect normal behavior and would have an disproportionate effect on data. In this case we remove the outliers so that they wouldn't affect the predictions. The data is now ready for the next step.

Data Exploration of Features

We need to do a quick check on the features to assess their distributions and relationship between the different features.

With the sns pairplot we can observe both histogram and scatterplot that shows if two variables are correlated or not. In addition, we could also check the correlation by using the pandas corr() function.

The last row of the table is the response variable, '2011 customer spend'. We see it has positive correlation with the '2010 customer spend' ,'number of purchases' and 'avg spend per order' and has negative correlation with 'days since last purchase'. The variable 'days since first purchase' has a very weak correlation with the response variable and hence we can drop it. Correlation only tells us about the strength and the direction of the relationship. It does not mean causation.

Data Modelling

We will try the following Regression models with KFold validation and based on the predictive power, we can select the best model. From the list of models below, one can see that we have chosen both linear and non linear models for prediction. Based on the evaluation metrics, we will know which model is a good fit for this data.

Regression Models used

1. Linear_Regression

2. Lasso_Regression

3. Ridge_Regression

4. Decision Tree Regressor

5. Random ForestRegressor

6. Extra Trees Regressor

6. XGB Regressor

7. SVM

8. KNN Regressor

9. Neural net -Multilayer Perceptrons

Evaluation Metrics

The evaluation measures used are

1. Mean Absolute Error, MAE

2. Mean Squared Error

3. Root Mean Squared Error, RMSE

4. RSquared R2

5. Pearsonr

The common metrics used to evaluate regression models rely on the concepts of residuals and errors, which are quantifications of how much a model errs in predicting a particular data point.

1. MAE or the Mean Absolute Error takes the absolute value of all residuals and calculates the average. Therefore, a value of zero would mean that your model predicts everything perfectly, and larger values mean a less accurate model. Since we have multiple models, we can look at the MAE and prefer the model with the lower value.

2. MAE takes the average of big and small errors. There is no penalty for big errors. In order to penalize the larger residuals, we look at the squared error. By squaring the error term, large errors are weighted more heavily than small ones that add up to the same total amount of error.

3. MSE can be more difficult to interpret as the squared values of the larger residuals could be really huge numbers. Therefore, it is common to take the root of the mean squared error, resulting in the root mean squared error (RMSE).

4. The r2_score function computes the coefficient of determination, usually denoted as R².

It represents the proportion of variance (of y) that has been explained by the independent variables in the model. It provides an indication of goodness of fit and therefore a measure of how well unseen samples are likely to be predicted by the model, through the proportion of explained variance.

5. Pearson correlation coefficient

The Pearson correlation coefficient, r, can take on values between -1 and 1. The further away r is from zero, the stronger the linear relationship between the two variables. The sign of r corresponds to the direction of the relationship. If r is positive, then as one variable increases, the other tends to increase. If r is negative, then as one variable increases, the other tends to decrease. A perfect linear relationship (r=-1 or r=1) means that one of the variables can be perfectly explained by a linear function of the other.

According to scipy.stats.pearsonr, the second value that is the p-value, is defined as, "The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets.", which with regards to our results means that the positive correlation between the actual and the predicted is unlikely to have occurred by chance.

Results and Interpretation

Below is the metrics of each of the models for our predicted data. Analyzing the results, one can see that SVM Regressor's Pearsonr and p-value implies that the model is insignificant. So we cannot use SVM for this data. Checking for the least MAE values, we see that Neural nets model has the least MAE of all. However its R2 is way more lesser than the other models. Hence we will not proceed with Neuralnets. The next least MAE comes from Linear Regression, Lasso Regression and Ridge Regression. These models looks promising with the MAE, R2 and Pearsonr and would be more suitable to model this data.


Now that we have successfully built the model, and decided the one to proceed with, there is a lot of room for improving the accuracy of the model. We can work upon increasing its predictive power by adding more data, adding more quality features, removing the features that have very less explanatory power and hypertuning the algorithm. The codes for the models and the entire work described can be found in the notebook link .

Hope this post was helpful!!. If you’re interested to read more, please subscribe and be notified when the next article is published.

139 views0 comments

Recent Posts

See All