Join the Lab โ†’
Blog โ†ณ SQL Projects โ†ณ Power BI Projects โ†ณ Excel Projects โ†ณ Data Insights Projects Tools About Join the Lab โ†’
Contoso Retail Data Analysis with MySQL
SQL Projects MySQL Advanced Analytics March 2026

Contoso Retail: End-to-End SQL Analysis

A full deep-dive into a global retail data warehouse โ€” covering sales trends, product performance, customer behaviour, regional analysis, RFM segmentation, CLV, cohort retention, and ABC product classification.

๐Ÿ“ Dataset: Contoso Retail DW
๐Ÿ›  Tool: MySQL Workbench
๐Ÿ“‹ Queries: 10
โฑ Read: 15 min
๐Ÿ—„๏ธ MySQL Workbench
๐Ÿ“ CTEs
๐ŸชŸ Window Functions
๐Ÿ”— Multi-table JOINs
โšก CASE WHEN
๐Ÿ“Š UNION ALL
๐Ÿงฎ NTILE & PERCENT_RANK

The Dataset

Contoso Retail is Microsoft’s sample data warehouse โ€” a simulated global electronics retailer with sales data spanning 2007 to 2009. The database is structured as a classic star schema with two fact tables and seven dimension tables, covering everything from in-store transactions to online orders, customer demographics, and product hierarchies.

This project goes beyond basic aggregation. The goal was to ask 10 business questions that progressively build from surface-level sales trends to advanced customer analytics โ€” the kind of analysis that would actually sit in front of a leadership team.

Database Schema
factsales
In-store, catalog & reseller transactions
factonlinesales
Online channel transactions
dimcustomer
Customer profiles โ€” B2B & B2C
dimproduct
Product details & pricing
dimproductcategory
8 product categories
dimchannel
Store, Online, Reseller, Catalog
dimstore
Store locations globally
dimgeography
Country & continent mapping
dimproductsubcategory
Product subcategory bridge
$12.4bn
Total Revenue
2007โ€“2009
56.8%
Avg Profit Margin
All channels
77.3%
12-Month Retention
Jan 2007 cohort
26%
Products โ†’ 70% Rev
ABC A-tier

Q1 โ€” Sales & Profit Trends Over Time

Business Question

How have monthly sales and profit trended over time โ€” broken out by the physical store channels and the online channel separately?

Contoso has two separate fact tables โ€” factsales covers store, reseller and catalog channels, while factonlinesales covers online exclusively. Both need to be queried independently to get an accurate channel split.

SQL โ€” Query 1a (Store/Catalog/Reseller)
SELECT
    YEAR(datekey)                                    AS year,
    MONTH(datekey)                                   AS month,
    COUNT(DISTINCT SalesKey)                         AS total_orders,
    ROUND(SUM(salesamount))                          AS total_revenue,
    ROUND(SUM(salesamount) - SUM(TotalCost))         AS total_profit
FROM factsales
GROUP BY 1, 2
ORDER BY 1, 2;

Results โ€” Store Channel (Selected Months)

YearMonthOrdersRevenueProfit
2007Jan111,474$270M$152M
2007Jun128,203$312M$176M
2007Dec156,890$381M$215M
2008Jan109,432$266M$150M
2008Dec148,772$361M$204M
2009Dec141,205$343M$194M
๐Ÿ“ˆ
Key Finding

December consistently outperforms all other months by a wide margin across all three years โ€” a clear seasonal peak. The online channel drives significantly higher order volume (357K orders in Jan 2007 vs 111K for stores) but much lower revenue per order, reflecting the different product mix and customer type online vs in-store.

Q2 โ€” Product Category Performance

Business Question

Which product categories generate the highest revenue and profit? Are there categories with surprisingly high or low margins?

This query uses a CTE to union both fact tables first, then joins across the product hierarchy (product โ†’ subcategory โ†’ category) to get revenue and margin at the category level.

SQL โ€” Query 2
WITH product_category_data AS (
    SELECT p.productkey, pc.ProductCategoryName,
           fs.salesamount, fs.TotalCost
    FROM factsales fs
    JOIN dimproduct p USING (ProductKey)
    JOIN dimproductsubcategory ps USING (ProductSubcategoryKey)
    JOIN dimproductcategory pc USING (ProductCategoryKey)
    UNION ALL
    SELECT p.productkey, pc.ProductCategoryName,
           os.salesamount, os.TotalCost
    FROM factonlinesales os
    JOIN dimproduct p USING (ProductKey)
    JOIN dimproductsubcategory ps USING (ProductSubcategoryKey)
    JOIN dimproductcategory pc USING (ProductCategoryKey)
)
SELECT
    ProductCategoryName,
    ROUND(SUM(salesamount) / 1e9, 2)                AS revenue_bn,
    ROUND((SUM(salesamount) - SUM(TotalCost))
          / 1e9, 2)                                 AS profit_bn,
    ROUND((SUM(salesamount) - SUM(TotalCost))
          / SUM(salesamount) * 100, 1)              AS profit_margin_pct
FROM product_category_data
GROUP BY 1
ORDER BY profit_bn DESC;

Results

CategoryRevenueProfitMargin %
Home Appliances$3.9bn$2.2bn56.4%
Computers$3.2bn$1.8bn56.3%
Cameras & Camcorders$2.6bn$1.5bn57.3%
TV and Video$1.4bn$0.7bn56.8%
Cell Phones$0.9bn$0.5bn57.1%
Music, Movies & Audio Books$0.2bn$0.1bn56.9%
Audio$0.2bn$0.1bn57.0%
Games and Toys$0.1bn$0.06bn48.7%
๐Ÿ’ก
Key Finding

Margins are remarkably consistent across all categories โ€” hovering between 56.3% and 57.3% โ€” which suggests highly centralised pricing discipline across the business. The one clear outlier is Games & Toys at 48.7%, a full 8 points below every other category. This is worth investigating โ€” it could point to higher cost of goods, heavier discounting, or a pricing strategy misalignment.

Q3 โ€” Customer Segmentation: B2B vs B2C

Business Question

Who are the top customers by total spend? And how does revenue compare between B2B (Company) and B2C (Person) customer types?

The dimcustomer table has a CustomerType column that distinguishes Company (B2B) from Person (B2C) customers. Grouping on this reveals a very clear revenue split.

SQL โ€” Query 3b (B2B vs B2C)
SELECT
    c.CustomerType,
    COUNT(DISTINCT os.SalesOrderNumber)  AS total_orders,
    ROUND(SUM(os.salesamount) / 1e9, 2)   AS revenue_bn
FROM factonlinesales os
JOIN dimcustomer c USING (CustomerKey)
GROUP BY 1
ORDER BY revenue_bn DESC;

Results

Customer TypeTotal OrdersRevenue
Company (B2B)9.8M$2.03bn
Person (B2C)2.8M$0.69bn
๐Ÿข
Key Finding

B2B customers generate 3ร— more revenue than B2C despite being a smaller customer group. The top individual spender โ€” CS586 โ€” alone accounted for $19.1M in purchases. All top 20 customers by spend are Company type. This signals that Contoso’s real engine is its corporate client base, not retail consumers โ€” a critical insight for where to direct sales and retention effort.

Q4 โ€” Sales Channel Analysis

Business Question

How does revenue and profitability compare across Contoso’s four sales channels โ€” Store, Online, Reseller, and Catalog?

Since Store/Reseller/Catalog data lives in factsales and Online lives in factonlinesales, I use a UNION ALL inside a CTE to combine them before aggregating by channel.

SQL โ€” Query 4
WITH all_sales AS (
    SELECT ch.ChannelName, fs.salesamount, fs.TotalCost
    FROM factsales fs
    JOIN dimchannel ch USING (ChannelKey)
    UNION ALL
    SELECT 'Online' AS ChannelName,
           os.salesamount, os.TotalCost
    FROM factonlinesales os
)
SELECT
    ChannelName,
    ROUND(SUM(salesamount) / 1e9, 2)                AS revenue_bn,
    ROUND((SUM(salesamount) - SUM(TotalCost))
          / SUM(salesamount) * 100, 1)              AS profit_margin_pct
FROM all_sales
GROUP BY 1
ORDER BY revenue_bn DESC;

Results

ChannelRevenueProfit Margin
Store$6.9bn56.9%
Online$2.7bn56.6%
Reseller$1.7bn56.8%
Catalog$1.1bn56.7%
๐Ÿฌ
Key Finding

Store dominates at $6.9bn โ€” 2.5ร— larger than Online. But the more striking finding is the margin consistency: all four channels sit within 0.3 percentage points of each other (56.6%โ€“56.9%). This level of uniformity across radically different sales models is unusual and points to very disciplined central pricing. Channel choice doesn’t affect profitability โ€” only volume differs.

Q5 โ€” Regional Sales Analysis

Business Question

Which countries contribute the most to revenue and profit, across all channels combined?

SQL โ€” Query 5
WITH all_sales AS (
    SELECT g.RegionCountryName, fs.salesamount, fs.TotalCost
    FROM factsales fs
    JOIN dimstore st USING (StoreKey)
    JOIN dimgeography g USING (GeographyKey)
    UNION ALL
    SELECT g.RegionCountryName, os.salesamount, os.TotalCost
    FROM factonlinesales os
    JOIN dimcustomer c USING (CustomerKey)
    JOIN dimgeography g USING (GeographyKey)
)
SELECT
    RegionCountryName,
    ROUND(SUM(salesamount) / 1e9, 2)  AS revenue_bn,
    ROUND((SUM(salesamount) - SUM(TotalCost))
          / SUM(salesamount) * 100, 1) AS margin_pct
FROM all_sales
GROUP BY 1
ORDER BY revenue_bn DESC
LIMIT 10;

Results (Top 5)

CountryRevenueMargin
United States$7.0bn56.8%
China$1.7bn56.9%
Germany$811M56.7%
United Kingdom$634M56.8%
France$521M56.7%
๐ŸŒ
Key Finding

The USA generates $7bn โ€” more than 4ร— the second-largest market (China at $1.7bn). Europe is fragmented across several mid-tier markets. Notably, a data quality issue surfaced here: “Germany” appeared twice in the raw results due to a trailing whitespace in the source data โ€” a good reminder to always validate grouping columns before reporting.

Q6 โ€” RFM Customer Segmentation

Business Question

Can we segment B2C customers by purchasing behaviour โ€” how recently they bought, how often, and how much they spent?

RFM (Recency, Frequency, Monetary) is a classic segmentation framework. Each customer gets a score from 1โ€“5 on each dimension using NTILE(5) window functions, then those scores are combined to assign a segment label.

Two important decisions made here: First, I used MAX(datekey) from the table as the reference date instead of CURRENT_DATE โ€” since this is historical data, using today’s date would make every customer look “inactive.” Second, I filtered to CustomerType = 'Person' and required a minimum of 2 purchases to exclude one-time buyers from the analysis.

SQL โ€” Query 6 (Simplified)
WITH rfm_base AS (
    SELECT
        os.customerkey,
        c.CustomerLabel,
        (SELECT MAX(datekey) FROM factonlinesales) AS dataset_max_date,
        MAX(os.datekey)                             AS last_purchase_date,
        COUNT(DISTINCT os.SalesOrderNumber)          AS frequency,
        ROUND(SUM(os.salesamount))                   AS monetary
    FROM factonlinesales os
    JOIN dimcustomer c USING (CustomerKey)
    WHERE c.CustomerType = 'Person'
    GROUP BY 1, 2
    HAVING COUNT(DISTINCT os.SalesOrderNumber) >= 2
),
rfm_scored AS (
    SELECT customerkey, CustomerLabel,
        DATEDIFF(dataset_max_date, last_purchase_date) AS recency_days,
        frequency, monetary,
        NTILE(5) OVER (
            ORDER BY DATEDIFF(dataset_max_date, last_purchase_date) ASC
        )                                              AS R,
        NTILE(5) OVER (ORDER BY frequency DESC)      AS F,
        NTILE(5) OVER (ORDER BY monetary DESC)       AS M
    FROM rfm_base
)
SELECT *, CONCAT(R,F,M) AS rfm_score,
    CASE
        WHEN R >= 4 AND F >= 4 AND M >= 4  THEN 'Champions'
        WHEN R >= 3 AND F >= 3              THEN 'Loyal Customers'
        WHEN R >= 4 AND F <= 2              THEN 'Recent Customers'
        WHEN R <= 2 AND F >= 3 AND M >= 3  THEN 'At Risk'
        WHEN R <= 2 AND F <= 2 AND M <= 2  THEN 'Lost'
        ELSE                                      'Potential Loyalists'
    END                                       AS customer_segment
FROM rfm_scored
ORDER BY monetary DESC;

Segment Summary (4,106 B2C customers)

SegmentCustomers% of BaseAvg SpendAvg Orders
Recent Customers1,47135.8%$56,242151
Lost1,46835.8%$61,853120
Potential Loyalists99724.3%$58,604117
Loyal Customers1704.1%$45,55872
Champions0โ€”โ€”โ€”
๐Ÿ‘ฅ
Key Finding

No Champions segment exists โ€” Contoso has no B2C customers who score highly on all three dimensions simultaneously. Only 4.1% qualify as Loyal Customers. The base is split almost equally between Recent Customers and Lost customers, which tells a transactional story: customers buy, then disappear. This is consistent with the B2B dominance finding โ€” B2C is not where Contoso’s retention strength lies.

Q7 โ€” Customer Lifetime Value (CLV)

Business Question

Which customers have the highest estimated lifetime value, and what does the CLV distribution look like across the customer base?

CLV here is estimated as annualised revenue โ€” total spend divided by tenure in years. This gives a like-for-like comparison between customers who’ve been active for different lengths of time.

SQL โ€” Query 7
SELECT
    c.CustomerKey, c.CustomerLabel, c.CustomerType,
    COUNT(DISTINCT os.SalesOrderNumber)              AS total_orders,
    ROUND(SUM(os.salesamount))                       AS total_spend,
    ROUND(
        DATEDIFF(MAX(os.datekey), MIN(os.datekey))
        / 365.0, 2)                                  AS tenure_years,
    ROUND(
        SUM(os.salesamount) /
        NULLIF(DATEDIFF(MAX(os.datekey),
               MIN(os.datekey)) / 365.0, 0))        AS estimated_clv
FROM factonlinesales os
JOIN dimcustomer c USING (CustomerKey)
GROUP BY 1, 2, 3
ORDER BY estimated_clv DESC
LIMIT 5;

Results (Top 5 by CLV)

CustomerTypeTotal SpendTenure (yrs)Est. CLV
Top CustomerPerson$5.6M8.45$664K/yr
2ndPerson$4.9M8.12$603K/yr
3rdPerson$4.7M8.30$566K/yr
4thPerson$4.4M8.01$549K/yr
5thPerson$4.2M8.20$512K/yr
๐Ÿ’Ž
Key Finding

Interestingly, all top CLV customers are Person (B2C) type โ€” not Company. While B2B dominates total revenue, it’s the long-tenured individual customers who generate the highest annualised value. These are customers who’ve been buying consistently for over 8 years. Their CLV is so high because they never churned โ€” a valuable segment to understand and protect.

Q8 โ€” Customer Demographics

Business Question

How does revenue break down by age group and gender? Are there demographic patterns worth acting on?

Age is calculated using 2009 as the reference year (the dataset’s end date) โ€” not CURRENT_DATE, which would make everyone appear much older than they actually were as customers.

SQL โ€” Query 8
WITH customer_demo AS (
    SELECT
        c.CustomerType, c.Gender,
        CASE
            WHEN (2009 - YEAR(c.BirthDate)) < 35  THEN 'Young Adult'
            WHEN (2009 - YEAR(c.BirthDate)) < 55  THEN 'Middle Age'
            ELSE                                       'Senior'
        END                                        AS AgeGroup,
        os.salesamount, os.TotalCost
    FROM factonlinesales os
    JOIN dimcustomer c USING (CustomerKey)
)
SELECT
    CustomerType, Gender, AgeGroup,
    ROUND(SUM(salesamount) / 1e9, 2)  AS revenue_bn,
    ROUND((SUM(salesamount) - SUM(TotalCost))
          / SUM(salesamount) * 100, 1) AS margin_pct
FROM customer_demo
GROUP BY 1, 2, 3
ORDER BY revenue_bn DESC;
๐Ÿ‘ค
Key Finding

The Senior age group dominates B2C spending by a wide margin โ€” consistent with the long-tenure high-CLV customers identified in Q7. Company customers (B2B) show NULL gender, which is expected โ€” businesses don’t have a gender field. Worth noting for any dashboard: filtering out NULL gender or labelling it correctly avoids misleading visualisations.

Q9 โ€” Cohort Retention Analysis

Business Question

Of customers who first purchased in a given month, what percentage came back in subsequent months?

Cohort analysis groups customers by their first purchase month, then tracks what percentage of each group returned in every following month. This is one of the most powerful ways to understand retention and product stickiness.

The query has three layers: identify each customer’s cohort month, track their monthly activity, then join back to the cohort size to calculate retention rate.

SQL โ€” Query 9
WITH first_purchase AS (
    SELECT CustomerKey,
        DATE_FORMAT(MIN(datekey), '%Y-%m-01') AS cohort_month
    FROM factonlinesales
    GROUP BY 1
),
monthly_activity AS (
    SELECT DISTINCT CustomerKey,
        DATE_FORMAT(datekey, '%Y-%m-01')      AS activity_month
    FROM factonlinesales
),
cohort_data AS (
    SELECT
        fp.cohort_month, ma.activity_month,
        TIMESTAMPDIFF(MONTH, fp.cohort_month,
                      ma.activity_month)           AS month_number,
        COUNT(DISTINCT ma.CustomerKey)           AS active_customers
    FROM first_purchase fp
    JOIN monthly_activity ma USING (CustomerKey)
    GROUP BY 1, 2, 3
)
SELECT
    cd.cohort_month,
    cs.cohort_customers,
    cd.month_number,
    cd.active_customers,
    ROUND(cd.active_customers
          / cs.cohort_customers * 100, 1)         AS retention_rate
FROM cohort_data cd
JOIN cohort_size cs USING (cohort_month)
ORDER BY cd.cohort_month, cd.month_number;

Results โ€” January 2007 Cohort (13,140 customers)

Month #Active CustomersRetention Rate
Month 112,66796.4%
Month 311,84290.1%
Month 610,91283.1%
Month 910,52180.1%
Month 1210,15377.3%
๐ŸŽฏ
Strongest Finding in the Entire Analysis

77.3% of the January 2007 cohort was still active 12 months later. To put this in context โ€” a 12-month retention rate above 50% is considered strong in most retail contexts. 77% is exceptional. This finding fundamentally changes the narrative: Contoso’s B2C customers may not look “loyal” by RFM metrics, but once they buy, they keep buying for years. The low Champions count isn’t a churn problem โ€” it’s a scoring artefact of a high-frequency, consistent customer base.

Q10 โ€” ABC Product Classification

Business Question

Which products are driving the majority of revenue? Can we classify the catalogue into tiers to focus inventory and marketing spend?

ABC analysis classifies products by their cumulative revenue contribution โ€” A tier drives the first 70%, B tier the next 20%, and C tier the remaining 10%. It uses a running sum window function to calculate cumulative revenue before applying the classification logic.

SQL โ€” Query 10 (Key Logic)
WITH ranked AS (
    SELECT ProductKey, ProductName, total_revenue,
        SUM(total_revenue) OVER ()                  AS grand_total,
        SUM(total_revenue) OVER (
            ORDER BY total_revenue DESC
            ROWS BETWEEN UNBOUNDED PRECEDING
            AND CURRENT ROW
        )                                            AS cumulative_revenue
    FROM product_totals
)
SELECT *,
    CASE
        WHEN cumulative_revenue / grand_total <= 0.70 THEN 'A'
        WHEN cumulative_revenue / grand_total <= 0.90 THEN 'B'
        ELSE                                               'C'
    END                                              AS abc_tier
FROM ranked
ORDER BY total_revenue DESC;

Results

TierProducts% of Catalogue% of Revenue
A โ€” Core66126.3%70%
B โ€” Supporting66526.4%20%
C โ€” Long Tail1,19047.3%10%
๐Ÿ“ฆ
Key Finding

Textbook Pareto: 26% of products (A tier) drive 70% of all revenue. Nearly half the product catalogue (C tier, 1,190 products) contributes only 10% of revenue. This has direct implications for inventory planning, marketing spend, and even catalogue rationalisation. The C-tier products deserve scrutiny โ€” some may be strategic range-fillers, but others may simply be dead weight consuming warehouse space and procurement effort.

Summary of Findings

1
B2B is the Revenue Engine

Company customers generate 3ร— more revenue than Person customers. All top 20 spenders by total spend are B2B. Retention and account management effort should be disproportionately directed at the corporate segment.

2
Margin is Uniform Across Every Dimension

Whether you cut by channel, category, or country โ€” margin sits within a 1% band around 56.8%. This is unusual and suggests extremely centralised pricing. It also means growth decisions should be driven entirely by volume, not margin optimisation.

3
B2C Retention is Exceptional

77.3% 12-month cohort retention for B2C customers is a standout number. The business has strong customer stickiness once acquired โ€” the challenge is acquisition and reactivation, not retention itself.

4
Half the Catalogue Drives Almost Nothing

47% of products generate only 10% of revenue. A catalogue rationalisation exercise โ€” or at minimum, redirecting marketing budget toward A-tier products โ€” could meaningfully improve efficiency without touching revenue.

5
Games & Toys is the Margin Anomaly

Every other category sits between 56.3โ€“57.3% margin. Games & Toys at 48.7% is 8 points lower. This warrants a pricing review โ€” either costs are too high or the category is being sold at a structural discount that isn’t justified.

View the Full Project on GitHub

All 10 SQL queries with comments, CTEs, window functions, and results.

View on GitHub โ†’

Leave a Comment

Your email address will not be published. Required fields are marked *