Portfolio Project — Data Science

Retail Customer
Lifetime Value
Optimization Engine

An end-to-end machine learning pipeline built on 1M+ wholesale retail transactions. Predicts customer lifetime value, identifies churn risk, segments the customer base, and optimises promotional targeting using uplift modelling.

1,067,371Raw Transactions
5,862Unique Customers
£17.4MTotal Revenue
4 ModelsBG/NBD · GG · XGBoost · T-Learner
01 — Dataset Overview

The Raw Data

We used the Online Retail II dataset from the UCI Machine Learning Repository. It contains all transactions from a UK-based wholesale gift retailer between December 2009 and December 2011. Their customers are businesses — gift shops, boutiques, and resellers — who buy in bulk to stock their shelves.

1,067,371Total Rows
8Columns
5,942Unique Customers
53,628Unique Invoices
£476Avg Order Value
2 YearsDate Range
ColumnTypeExampleWhat it means
Invoicestring489434Unique order ID. Invoices starting with 'C' are cancellations.
StockCodestring85048Unique product identifier.
Descriptionstring15CM CHRISTMAS GLASS BALLProduct name. Some rows have missing descriptions.
Quantityint12Units purchased. Negative values indicate returns.
InvoiceDatedatetime2009-12-01 07:45Date and time of the transaction.
Pricefloat6.95Price per unit in GBP. Zero prices indicate errors.
Customer IDfloat13085.0Unique customer identifier. 243,007 rows have no Customer ID.
CountrystringUnited KingdomCountry of the customer. UK represents 83.8% of revenue.
Important context: Each row is one product line from one invoice — not one order. A single order can contain 20+ rows, one per product. Average order value of £476 confirms this is a wholesale business — not a typical consumer retailer.
02 — Data Cleaning

From Raw to Reliable

Raw transaction data is never production-ready. We applied five cleaning steps, each with a clear business reason. We always kept the original raw data untouched and stored clean data in a new variable.

Raw Data — First 5 Rows (before any cleaning)

This is exactly what we loaded from the Excel file. Notice: Customer ID is a float (13085.0), some rows have no Customer ID (NaN), invoices starting with 'C' are cancellations, and there is no TotalPrice column yet.

InvoiceStockCodeDescriptionQuantityInvoiceDatePriceCustomer IDCountry
4894348504815CM CHRISTMAS GLASS BALL 20 LIGHTS122009-12-01 07:456.9513085.0United Kingdom
48943479323PPINK CHERRY LIGHTS122009-12-01 07:456.7513085.0United Kingdom
48943479323WWHITE CHERRY LIGHTS122009-12-01 07:456.7513085.0United Kingdom
48943422041RECORD FRAME 7" SINGLE SIZE482009-12-01 07:452.1013085.0United Kingdom
C48944921258VICTORIAN SEWING BOX LARGE-52009-12-01 09:0110.95NaNUnited Kingdom

Row 5 shows two problems at once — a cancellation (C489449) and a missing Customer ID (NaN). Both will be removed during cleaning.

1

Remove rows with no Customer ID

243,007 rows had no customer identifier. Without a Customer ID we cannot attribute the purchase to any individual — making the row useless for customer-level analysis. These are likely guest checkouts or system entries.

− 243,007 rows removed → 824,364 remaining
2

Remove cancelled invoices

Invoices starting with 'C' (e.g. C489434) represent cancellations or returns. These are not real purchases — including them would artificially inflate revenue and distort purchase frequency counts.

− 18,744 rows removed → 805,620 remaining
3

Remove invalid quantities and prices

71 rows had a price of zero or negative. A transaction with zero price is either a data entry error or a system glitch — it cannot represent a real sale.

− 71 rows removed → 805,549 remaining
4

Remove non-product stock codes

Some rows represent internal charges, not product sales: POST (postage — 1,838 rows), M (manual adjustments — 709), BANK CHARGES (32), PADS (17), DOT (16), D (discounts — 5). These are accounting entries that would corrupt any revenue analysis.

− 2,617 rows removed → 802,932 remaining
5

Fix data types and derive columns

Customer ID was stored as a float (13085.0) — converted to a clean string "13085". InvoiceDate was stored as text — parsed to proper datetime for date calculations. Three new columns derived: TotalPrice (Quantity × Price), invoice_dow (day of week 0–6), invoice_hour (0–23).

0 rows removed — type corrections and column additions only

Clean Data — First 5 Rows (after all 5 cleaning steps)

Now Customer ID is a clean string, InvoiceDate is a proper datetime, TotalPrice is calculated, cancellations are gone, and all rows have valid quantities and prices.

InvoiceStockCodeDescriptionQuantityInvoiceDatePriceCustomer IDCountryTotalPrice
4894348504815CM CHRISTMAS GLASS BALL 20 LIGHTS122009-12-01 07:456.9513085United Kingdom£83.40
48943479323PPINK CHERRY LIGHTS122009-12-01 07:456.7513085United Kingdom£81.00
48943479323WWHITE CHERRY LIGHTS122009-12-01 07:456.7513085United Kingdom£81.00
48943422041RECORD FRAME 7" SINGLE SIZE482009-12-01 07:452.1013085United Kingdom£100.80
48943421232STRAWBERRY CERAMIC TRINKET BOX242009-12-01 07:451.2513085United Kingdom£30.00
1,067,371Raw Rows
264,439Rows Removed
802,932Clean Rows
75.2%Data Retained
03 — Exploratory Data Analysis

Understanding the Business

Before building any models, we explored the data visually to understand revenue patterns, customer geography, and seasonal behaviour. This shapes all modelling decisions that follow.

What This Dataset Represents

This is a UK-based wholesale retailer. Their customers are businesses — gift shops, boutiques, and online resellers — who buy products in bulk and then resell them to end consumers at a higher price.

Customer 13085 bought 12 units of "15CM CHRISTMAS GLASS BALL" at £6.95 each → Total: £83.40

That is not an individual person buying one Christmas ornament for their home. That is a shop owner buying 12 of them to stock their shelves and sell individually to their retail customers.

This explains two things we observed in the data:

When we say "customer" in this project, we mean a business — not an individual consumer. CLV, churn, and segmentation all apply to these business buyers.

Monthly Revenue Trend

Total revenue grouped by month across the full 2-year observation period (Dec 2009 – Dec 2011). December 2011 appears low because the dataset only covers the first 9 days of that month.

Monthly Revenue (£)
Oct–Nov peaks clearly visible in both years — driven by Christmas stocking

Monthly Revenue Breakdown

YearMonthRevenue (£)% of TotalNotes
2009December£681,5303.9%Partial month (dataset starts Dec 1)
2010January£539,0083.1%Post-Christmas slowdown
2010February£499,9502.9%Slowest month in dataset
2010March£668,9783.8%
2010April£588,1263.4%
2010May£595,0873.4%
2010June£632,6863.6%
2010July£584,0493.3%
2010August£597,1303.4%
2010September£808,5454.6%Pre-Christmas ramp begins
2010October£1,015,9895.8%🔺 Peak month
2010November£1,163,1546.7%🔺 Highest month in dataset
2010December£877,8145.0%
2011January£564,2273.2%Post-Christmas drop again
2011February£443,5462.5%Second lowest month
2011March£585,2633.4%
2011April£455,6672.6%
2011May£660,9183.8%
2011June£654,9333.8%
2011July£593,2323.4%
2011August£637,1573.7%
2011September£941,8015.4%Pre-Christmas ramp begins again
2011October£1,006,3425.8%🔺 Peak month
2011November£1,143,2466.6%🔺 Second highest month
2011December£513,3802.9%Partial — only 9 days of data
Key insight — seasonal pattern repeats perfectly: October and November peak in both 2010 and 2011, with almost identical revenue figures (£1.01M–£1.16M). January and February are consistently the weakest months. This predictable seasonality means the business should plan inventory and staffing around this cycle every year.

Revenue by Country

The UK dominates with 83.8% of all revenue. The remaining 16.2% is spread across 37 other countries, mostly in Europe.

Revenue by Country (Top 10)
United Kingdom accounts for £14.6M of the £17.4M total
RankCountryRevenue (£)% of TotalBusiness Insight
1United Kingdom£14,627,41983.8%Home market — overwhelmingly dominant
2Ireland (EIRE)£602,0583.4%Close neighbour, easy logistics
3Netherlands£549,9533.2%Strong EU trading relationship with UK
4Germany£388,9602.2%Largest economy in Europe
5France£315,7141.8%Close neighbour
6Australia£168,4851.0%English-speaking market
7Spain£98,8410.6%
8Switzerland£93,6240.5%
9Sweden£86,0790.5%
10Denmark£68,5600.4%
27 other countries£456,0832.6%Tiny individual contributions
Business risk — over-reliance on UK market: 83.8% of revenue from one country is a concentration risk. If the UK economy slows down, the entire business is exposed. The business already has footholds in Ireland (3.4%), Netherlands (3.2%), and Germany (2.2%) — expanding these markets would significantly reduce this risk.
04 — Feature Engineering

Building Customer-Level Features

Raw data has one row per product line per invoice. Machine learning models need one row per customer. We transformed 802,932 transaction rows into a clean feature table of 5,862 customers — each described by 12 behavioural features. Final RFM table shape: (5862, 12).

Snapshot Date: 10th December 2011. All customer metrics are calculated relative to this single reference point — the day after the last transaction in the dataset (9th December 2011). We add one day to ensure no customer has a recency of 0. Think of it as the "as of this morning" date for the analysis.

How Each Feature Is Calculated

TotalPrice (derived column)
TotalPrice = Quantity × Price
Example: 12 units × £6.95 = £83.40
Added during cleaning. Every product line now has a monetary value. This is the foundation for all spend-based calculations. Not a customer-level feature — exists at transaction level.
Recency
Recency = Snapshot Date − MAX(InvoiceDate)
Example: 10 Dec 2011 − 8 Dec 2011 = 2 days
How many days since the customer last placed an order. Measured from snapshot date. Low recency = recently active. High recency = has gone quiet and may be churning.
Frequency
Frequency = COUNT(DISTINCT Invoice)
Example: Customer placed 8 unique invoices = Frequency 8
Total number of distinct orders placed by the customer. Each unique invoice number counts as one order regardless of how many product lines it contains.
Monetary
Monetary = SUM(TotalPrice) ÷ Frequency
Example: £5,633 total ÷ 8 orders = £704.16 avg order
Average order value in £. This is not total lifetime spend — it is the average amount spent each time the customer placed an order. Captures spending intensity per visit.
TotalSpend
TotalSpend = SUM(Quantity × Price)
Example: All orders combined = £5,633.32
Cumulative lifetime revenue from this customer. Unlike Monetary which is an average, TotalSpend is the running total across all orders ever placed. A high-frequency low-spend customer can still have high TotalSpend.
UniqueProducts
UniqueProducts = COUNT(DISTINCT StockCode)
Example: Bought 126 different StockCodes = 126 unique products
Number of different products the customer has purchased across all their orders. High value = explores the catalogue widely and buys variety. Low value = repeatedly buys the same product(s).
AvgBasketSize
AvgBasketSize = MEAN(Quantity per transaction row)
Example: Average 13 units per product line
Average number of physical units ordered per product line. Different from Monetary which measures £ value. A customer with AvgBasketSize=2000 is a bulk buyer of cheap products. One with AvgBasketSize=2 buys small quantities of expensive items.
Tenure
Tenure = MAX(InvoiceDate) − MIN(InvoiceDate)
Example: First order Jan 2010, last order Jan 2011 = 365 days tenure
Number of days between the customer's first and last purchase. Measures how long the customer was actively buying from this retailer. Different from Recency — Tenure measures their active lifespan, Recency measures their current silence.
PurchaseVelocity
PurchaseVelocity = MEAN(days between consecutive orders)
Example: Orders on Jan 1, Feb 1, Apr 1 → gaps: 31, 59 days → avg 45 days
Average number of days between consecutive orders. Represents the customer's natural buying rhythm. A customer with velocity=7 days buying weekly is very different from one with velocity=90 days buying quarterly — even if their recency is the same.
PreferredDayOfWeek
PreferredDayOfWeek = MODE(InvoiceDate.dayofweek)
Example: 4 out of 8 orders placed on Tuesday → PreferredDayOfWeek = 1
The day of the week (0=Monday, 6=Sunday) on which the customer most commonly places orders. Used for personalising promotion send timing — reach them the day before they naturally buy.
PreferredHour
PreferredHour = MODE(InvoiceDate.hour)
Example: Most orders placed at 14:00 → PreferredHour = 14
The hour of day (0–23) at which the customer most commonly places orders. Complements PreferredDayOfWeek for precise timing of marketing outreach.
PreferredDayName
PreferredDayName = MAP(PreferredDayOfWeek → name)
Example: 1 → "Tuesday"
Human-readable version of PreferredDayOfWeek. Maps 0–6 to Monday–Sunday. Added purely for readability in reports and dashboards — the model uses the numeric version.

RFM Feature Table — First 8 Rows

Shape: (5862, 12). One row per customer. This table is the input for all subsequent modelling steps.

Customer ID Recency Frequency Monetary TotalSpend UniqueProducts AvgBasketSize Tenure PurchaseVelocity PrefDOW PrefHour PrefDayName
1234632612£6,463£77,556272,184.940035.9013Monday
1234728£704£5,63312613.040257.0114Tuesday
12348755£332£1,6582458.836290.5014Monday
12349193£1,226£3,6791379.4570285.039Thursday
123503101£294£2941612.300.0216Wednesday
12352369£192£1,7337211.139349.1214Wednesday
123532042£203£4071314.5144144.0310Thursday
12356365£402£2,0154913.3425106.3214Wednesday
Reading a customer row — Customer 12347:
Recency=2 → bought 2 days before snapshot (very active) · Frequency=8 → placed 8 orders · Monetary=£704 → average £704 per order · TotalSpend=£5,633 → spent £5,633 total · UniqueProducts=126 → bought 126 different products · Tenure=402 → was an active customer for 402 days · PurchaseVelocity=57 → orders every 57 days on average · Preferred day: Tuesday at 14:00.
05 — Probabilistic CLV Modelling

BG/NBD + Gamma-Gamma

Standard ML models see "no purchase in 60 days" and predict churn. BG/NBD is smarter — it asks whether this silence is normal for this specific customer's buying rhythm, or genuinely alarming. It models two hidden processes simultaneously for every customer.

🛒

Purchase Process (NBD)

While alive, each customer buys at their own personal rate λ. Some buy weekly, some monthly, some quarterly. These rates vary across the population following a Gamma distribution — that's the Negative Binomial (NBD) part.

👻

Dropout Process (BG)

After each purchase, a customer has some probability p of never returning — quietly churning. Each customer has their own p, and these vary across the population following a Beta distribution — that's the Beta-Geometric (BG) part.

What the model needs (inputs)

frequency (BG/NBD)

Number of repeat purchases after the first. First purchase excluded — it's the starting point. Minimum 1 required. One-time buyers (1,681 customers) are filtered out before fitting.

recency (BG/NBD)

Days between first and last purchase. Not the same as RFM recency. RFM recency measures from snapshot date. BG/NBD recency measures within the customer's own history.

T (observation window)

Days between first purchase and snapshot date. Always ≥ recency. The gap (T − recency) = days the customer has been silent. This silence is the key signal for P(Alive).

Model parameters learned from your data

r = 1.517
Shape of purchase rate distribution
Controls how spread out buying frequencies are across your 4,181 customers. r ≈ 1.5 means moderate variety — there's a mix of fast and slow buyers but not extremely polarised.
α = 103.875
Scale of purchase rate distribution
Controls average purchase rate. High alpha (103) means customers buy roughly once every 103 days on average — slow but consistent. Expected for a wholesale retailer where businesses reorder seasonally.
a = 0.097
Beta distribution — churn tendency
Very small value. Very few customers have a high dropout tendency after each purchase. Most customers who buy once tend to come back.
b = 0.976
Beta distribution — retention tendency
Larger than a — confirms most customers tend to stay active after each purchase. Your customer base is loyal and sticky. Low overall churn risk population.

What the model outputs (per customer)

P(Alive)

Probability that the customer is still an active buyer. Between 0 and 1. A score of 0.985 means 98.5% confident still active. A score of 0.31 means likely churned.


Why it's powerful: Two customers with the same 300-day silence get very different P(Alive) if one naturally buys every 6 months and the other used to buy weekly.

Expected Future Purchases

How many times the model expects the customer to buy in the next N days (we used 90 days). A value of 1.47 means the customer is expected to place about 1–2 orders in the next quarter.


Why it's useful: Multiplied by average order value and profit margin gives us CLV — a financial number the business can act on directly.

Gamma-Gamma Spend Model

What it does: Estimates how much each customer will spend per order in the future, based on their historical spending pattern.

Key assumption verified: How often a customer buys (frequency) must be independent of how much they spend per order (monetary). We verified this — Pearson correlation was 0.023, well below the 0.30 threshold. Safe to proceed.

Parameters learned: p = 11.75 (spending consistency within each customer — high p means predictable spenders), q = 0.884 (spread of spending across customers — less than 1 means high variety between customers), v = 11.72 (population average spend anchor).
06 — CLV Calculation

90-Day Customer Lifetime Value

We combined both models to produce a single financial number per customer — how much profit they are expected to generate in the next 90 days.

CLV (90d) = E[future purchases] × E[avg order value] × profit margin × discount factor Where: E[future purchases] → from BG/NBD model (predicted_purchases_90d) E[avg order value] → from Gamma-Gamma model Profit margin → 20% (assumed gross margin) Discount rate → 1% monthly (money today > money tomorrow) Forecast horizon → 3 months (90 days)
4,181Customers Scored
£104Mean 90d CLV
£42Median 90d CLV
£21,705Max 90d CLV
81%Avg P(Alive)
Mean (£104) vs Median (£42): The mean is more than double the median. A small number of very high-value Champions are pulling the average up significantly. The median (£42) is a more honest representation of what a typical customer is worth. This gap is the Pareto effect in action.

Sample CLV scores

Customer IDFrequencyRecency (BG)TP(Alive)Pred. Purchases 90dCLV 90d
12347740240598.5%1.47£207.63
12346740072650.6%0.46£1,009.45
12348436343994.7%0.86£61.07
12353120440973.5%0.32£6.39
07 — Customer Segmentation

K-Means Clustering

We grouped all 5,862 customers into 4 segments using K-Means clustering. Both the Elbow Curve and Silhouette Score pointed to K=4 as optimal. We capped features at the 99th percentile before clustering to prevent extreme outliers from dominating the results.

Why we scaled features before clustering: K-Means measures distance between customers. Without scaling, TotalSpend (£77,000) would completely dominate UniqueProducts (27). StandardScaler converts every feature to zero mean and unit variance so all features contribute equally to the distance calculation.
Segment 3

Champions

Customers162 (2.8%)
Avg Recency37 days
Avg Frequency47 orders
Avg TotalSpend£46,443
Avg CLV 90d£1,148
Avg P(Alive)95%
✅ Retain & Reward. VIP treatment, exclusive access, dedicated account manager. Never send generic promotions.
Segment 1

Loyal Customers

Customers976 (16.6%)
Avg Recency43 days
Avg Frequency14 orders
Avg TotalSpend£5,659
Avg CLV 90d£139
Avg P(Alive)95%
🔵 Nurture & Grow. Personalised recommendations and moderate incentives. Some can become Champions.
Segment 2

At Risk

Customers2,382 (40.6%)
Avg Recency104 days
Avg Frequency4 orders
Avg TotalSpend£1,298
Avg CLV 90d£45
Avg P(Alive)88%
🟠 Win-Back Campaign. Time-limited offer within 2 weeks before they drift to Lost/Dormant.
Segment 0

Lost / Dormant

Customers2,342 (39.9%)
Avg Recency376 days
Avg Frequency1.7 orders
Avg TotalSpend£561
Avg CLV 90d£9
Avg P(Alive)30%
🔴 Low-cost reactivation only. 70% probability already churned. One bulk email max.
The Pareto insight: Champions are just 2.8% of the customer base but their average 90d CLV (£1,148) is 128× higher than Lost/Dormant (£9). Losing 10 Champions has the same revenue impact as losing 1,280 dormant customers.
08 — Churn Prediction

XGBoost Churn Classifier

We trained an XGBoost classifier to predict churn probability for every customer. XGBoost builds hundreds of decision trees sequentially, each correcting the mistakes of the previous one — a technique called gradient boosting.

Churn Label Definition

Churned = 1 IF Recency > 90 days AND P(Alive) < 0.5 Churned = 0 otherwise Churn rate: 10.5% (613 out of 5,862 customers)
Why a hybrid label? Using recency alone would mislabel slow buyers as churned. A customer who naturally orders every 4 months has a 120-day recency — but they're not churned, they're just slow. Adding the BG/NBD P(Alive) check ensures we only label a customer as churned if both the time signal AND the probabilistic model agree.
0.9994ROC-AUC
99%Overall Accuracy
90%Churn Precision
98%Churn Recall

SHAP Feature Importance

RankFeatureDirectionBusiness meaning
1RecencyHigh → more churnSingle strongest signal. Customers who haven't bought recently are far more likely to churn.
2FrequencyHigh → less churnCustomers who buy regularly are much less likely to leave.
3PurchaseVelocityLow → less churnShort gaps between orders = less likely to churn.
4TenureLong + silent → more churnA long-standing customer who suddenly goes quiet is a strong churn signal.
5–10TotalSpend, Monetary etc.Minimal impactSpending behaviour has almost no predictive power for churn.
Key insight: Churn is driven by time-based signals, not spending. Don't wait for a customer's order value to drop before acting. Watch recency and purchase velocity. When a customer breaks their natural buying rhythm — act immediately.
09 — Uplift Modelling

T-Learner Promotion Response

A regular conversion model tells you who will buy. Uplift modelling tells you who will buy because of the promotion — not just buy anyway. This distinction is worth millions in reduced promotional waste.

The 4 Customer Types

Type
Without Promo
With Promo
Uplift Score
Action
Persuadables
Won't buy
Will buy
High +
✅ Target these
Sure Things
Will buy
Will buy
Near zero
❌ Wasted money
Lost Causes
Won't buy
Won't buy
Near zero
❌ Don't bother
Sleeping Dogs
Will buy
Won't buy
Negative −
❌ Never target

How T-Learner Works

Step 1: Split data into two groups Treatment group → customers who received the promotion Control group → customers who did not Step 2: Train two separate XGBoost models μ1(x) = P(convert | received promotion, features x) μ0(x) = P(convert | no promotion, features x) Step 3: For each new customer calculate uplift score Uplift = μ1(customer) − μ0(customer) Step 4: Rank customers by uplift score — target the top N%
The Promotion Experiment (Simulated): In a real business you would use actual A/B test logs from your CRM. In this project we simulated a randomised controlled trial (RCT). Treatment was randomly assigned using a 50/50 coin flip. Conversion probability was modelled as a function of CLV and recency, with uplift highest for mid-range customers (the inverted-U formula) — reflecting the real insight that Champions buy anyway and dormant customers don't respond regardless.
10 — Business Impact Simulation

ROI Comparison of Targeting Strategies

We simulated targeting the top 20% of customers using three different strategies and compared promotional ROI. This translates model outputs into the financial language that businesses care about.

Assumptions: Promotion cost per customer = £2.00 Average revenue per conversion = £35.00 Target = top 20% of customers ROI = (Revenue − Cost) / Cost × 100%
Strategy
Revenue
Cost
ROI
🎯 Uplift targeting (top 20% by uplift score)
£560
£468
+19.7%
📊 Random baseline (no targeting)
£350
£468
−25.2%
💰 CLV targeting (top 20% by CLV)
£175
£468
−62.6%
Key insight — high CLV ≠ persuadable: CLV targeting performed worst because Champions (high CLV) buy anyway — they don't need a promotion. The uplift model correctly identified mid-range customers who genuinely respond, achieving the only profitable ROI. This is the core value proposition of uplift modelling over simple conversion or CLV targeting.
🏆

Use uplift scores to target

Only send promotions to customers with high positive uplift scores. These are the Persuadables — the promotion genuinely moves the needle for them.

🛡️

Protect Champions differently

Champions don't need discounts — they buy anyway. Invest in loyalty programmes, exclusive access, and personalised service instead.

Act fast on At Risk

At Risk customers still have meaningful CLV (£45 average). A targeted win-back campaign now costs far less than losing them permanently.