🚨 As a Data Analyst, if you're still building dashboards with fancy visualizations…

You’re missing the real value your company needs.
Dashboards don’t drive decisions — business insights do.
A beautiful chart means nothing if it doesn’t answer a real business question.
In today’s data-driven world, a Data Analyst must be able to:
🔍 Identify the problem

❓ Define clear business questions

📊 Analyze what matters

🧠 Provide actionable insights

📈 Recommend solutions

🔁 Then track and enhance continuously

That’s where the real impact lies.
Our job isn't just to visualize data; it's to solve a business problem by answering a core question and providing clear findings, a strong recommendation, and a monitoring plan.
Here's a look at my latest project, which aimed to solve a critical business problem for a product launch:
- Which sales method is the most profitable and time-efficient? 🎯

How many customers were there for each approach?
What does the spread of the revenue look like overall? And for each method?
Was there any difference in revenue over time for each of the methods?
Based on the data, which method would you recommend we continue to use?

notebook

1.Data validation¶

1.1 Setup and load the dataset¶

In [2]:

# import all the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# for Nicer plots
sns.set(style="whitegrid")
# load the dataset and print the first five rows
df = pd.read_csv("product_sales.csv")
df.head()

Out[2]:

	week	sales_method	customer_id	nb_sold	revenue	years_as_customer	nb_site_visits	state
0	2	Email	2e72d641-95ac-497b-bbf8-4861764a7097	10	NaN	0	24	Arizona
1	6	Email + Call	3998a98d-70f5-44f7-942e-789bb8ad2fe7	15	225.47	1	28	Kansas
2	5	Call	d1de9884-8059-4065-b10f-86eef57e4a44	11	52.55	6	26	Wisconsin
3	4	Email	78aa75a4-ffeb-4817-b1d0-2f030783c5d7	11	NaN	3	25	Indiana
4	3	Email	10e6d446-10a5-42e5-8210-1b5438f70922	9	90.49	0	28	Illinois

2. Data Validation and cleaning(column by column)¶

2.1 Validate (Week)¶

The goal here is to check that week is integer, with in 1-6, and has no missing values.

In [3]:

# check basic properties of 'week'
print(df['week'].describe())
print("Unique weeks:", sorted(df['week'].unique()))
print("Missing values:", df['week'].isna().sum())

# assert that week is in the expected range
assert df['week'].between(1,6).all, "week has values outside 1-6"

count    15000.000000
mean         3.098267
std          1.656420
min          1.000000
25%          2.000000
50%          3.000000
75%          5.000000
max          6.000000
Name: week, dtype: float64
Unique weeks: [1, 2, 3, 4, 5, 6]
Missing values: 0

From this check,week values are integers between 1 and 6 with no missing entries, so no cleaning is required.

2.2 Validate and clean(sales method)¶

Here the code checks the raw categories, then standarises them into a cleaned sales_method_clean column with only three levels: Email, Call, Email + Call.

In [4]:

# inspect raw categories
print(df['sales_method'].value_counts(dropna=False))

# Create a cleaned version of sales_method
df['sales_method_clean'] = df['sales_method'].str.strip()

# Standardise variations of "email" to "Email"
df['sales_method_clean'] = df['sales_method_clean'].str.replace(
    r'(?i)^email$', 'Email', regex=True
)

# fix typo like "em + call" to "Email + Call"
df['sales_method_clean'] = df['sales_method_clean'].replace({'em + call':"Email + Call"})

# Verify cleaned categories 
print(df['sales_method_clean'].value_counts(dropna=False))

Email           7456
Call            4962
Email + Call    2549
em + call         23
email             10
Name: sales_method, dtype: int64
Email           7466
Call            4962
Email + Call    2572
Name: sales_method_clean, dtype: int64

2.3 Validate(customer_id)¶

This code checks uniqueness and missingness of the customer IDs.

In [5]:

# Check the uniqueness and missing of customer IDs
print("Missing customer_id:", df['customer_id'].isna().sum())
print("Unique customer_id:", df['customer_id'].nunique())

# check if any customer_id appears with multiple methods
dups_ids = df.groupby('customer_id')['sales_method_clean'].nunique()
print("Customers with more tha one method", (dups_ids >1).sum())

Missing customer_id: 0
Unique customer_id: 15000
Customers with more tha one method 0

There are no missing customer_id values and that IDs appear unique per row, so no cleaning is nedded here.

2.4 Validate(nb_sold)¶

This code check the distribution of units sold and ensures all values are posotive integers.

In [6]:

print(df['nb_sold'].describe())
print("Unique nb_solds:", sorted(df['nb_sold'].unique())[:20])
print("Missing nb_solds", df['nb_sold'].isna().sum())

# check for non-positive values
print("any nb_solf <= 0:", (df['nb_sold']<= 0).any())

count    15000.000000
mean        10.084667
std          1.812213
min          7.000000
25%          9.000000
50%         10.000000
75%         11.000000
max         16.000000
Name: nb_sold, dtype: float64
Unique nb_solds: [7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
Missing nb_solds 0
any nb_solf <= 0: False

nb_sold ranges roughly from 7 to 16 with no missing or non-positive values, so no additional cleaning is required.

2.5 Validate (revenue)¶

check missingness and the basic distribution of revenue.

In [7]:

print(df['revenue'].describe())
print("Missing revenue:", df['revenue'].isna().sum())

# Look at a few rows with missing revenue
print(df[df['revenue'].isna()].head())

count    13926.000000
mean        93.934943
std         47.435312
min         32.540000
25%         52.470000
50%         89.500000
75%        107.327500
max        238.320000
Name: revenue, dtype: float64
Missing revenue: 1074
    week  sales_method  ...         state  sales_method_clean
0      2         Email  ...       Arizona               Email
3      4         Email  ...       Indiana               Email
16     2         Email  ...  Pennsylvania               Email
17     6  Email + Call  ...     Wisconsin        Email + Call
28     5         Email  ...       Florida               Email

[5 rows x 9 columns]

since there are 1,074 missing revenue values and no obvious way to impute them reliably, the analysis keeps those rows for non-revenue summaries but excludes them for any revenue based statistics.

2.6 Validate(years_as_customer)¶

This code confirms that customer tenue is non-negative and examines its distribution.

In [8]:

print(df['years_as_customer'].describe())
print("MIssing Values:", df['years_as_customer'].isna().sum())
print("Any negative tenure", (df['years_as_customer']<= 0).any())

count    15000.000000
mean         4.965933
std          5.044952
min          0.000000
25%          1.000000
50%          3.000000
75%          7.000000
max         63.000000
Name: years_as_customer, dtype: float64
MIssing Values: 0
Any negative tenure True

The output shows a reasonable distribution(with some customers at 0 years and some much older) and no missing values, so no changes are needed.

2.7 validate (nb_site_visits)¶

This step checks that site visits are non-negative integers and not minnig values.

In [9]:

print(df['nb_site_visits'].describe())
print("Missing Values:", df['nb_site_visits'].isna().sum())
print("Any nb_site_visits <=0", (df['nb_site_visits'] <= 0).any())

count    15000.000000
mean        24.990867
std          3.500914
min         12.000000
25%         23.000000
50%         25.000000
75%         27.000000
max         41.000000
Name: nb_site_visits, dtype: float64
Missing Values: 0
Any nb_site_visits <=0 False

The value looks reasonable(roughly 12-14 visits) and there are no missing entries, so no cleaning is needed.

2.8 Validate (state)¶

This code inspects the state field cna checks for missing values.

In [10]:

print("Missing Values:", df['state'].isna().sum())
print(df['state'].value_counts().head())

Missing Values: 0
California    1872
Texas         1187
New York       965
Florida        904
Illinois       617
Name: state, dtype: int64

No cleanin is required for state, and it can be used for geographical breakdowns.

In [11]:

# verify the cleaning of the dataset by printing the first 10 rows
df.head(10)

Out[11]:

	week	sales_method	customer_id	nb_sold	revenue	years_as_customer	nb_site_visits	state	sales_method_clean
0	2	Email	2e72d641-95ac-497b-bbf8-4861764a7097	10	NaN	0	24	Arizona	Email
1	6	Email + Call	3998a98d-70f5-44f7-942e-789bb8ad2fe7	15	225.47	1	28	Kansas	Email + Call
2	5	Call	d1de9884-8059-4065-b10f-86eef57e4a44	11	52.55	6	26	Wisconsin	Call
3	4	Email	78aa75a4-ffeb-4817-b1d0-2f030783c5d7	11	NaN	3	25	Indiana	Email
4	3	Email	10e6d446-10a5-42e5-8210-1b5438f70922	9	90.49	0	28	Illinois	Email
5	6	Call	6489e678-40f2-4fed-a48e-d0dff9c09205	13	65.01	10	24	Mississippi	Call
6	4	Email	eb6bd5f1-f115-4e4b-80a6-5e67fcfbfb94	11	113.38	9	28	Georgia	Email
7	1	Email	047df079-071b-4380-9012-2bfe9bce45d5	10	99.94	1	22	Oklahoma	Email
8	5	Email	771586bd-7b64-40be-87df-afe884d2af9e	11	108.34	10	31	Massachusetts	Email
9	5	Call	56491dae-bbe7-49f0-a651-b823a01103d8	11	53.82	7	23	Missouri	Call

Exploratory analysis and graphics¶

This section shows code to the main business question and generate the plots.

3.1 Customer per sales approach¶

The code below calculates how many customers are in each sales methods and plots a bar chart.

In [12]:

# Customers per cleaned sales method
method_counts = df['sales_method_clean'].value_counts().reset_index()
method_counts.columns = ['sales_method', 'n_customers']
print(method_counts)

# Bar chart: number of customers per method
plt.figure(figsize=(6,6))
ax = sns.barplot(data=method_counts,
           x='sales_method',
           y='n_customers',
           palette='Blues_d')


# Add data labels on each bar
for p in ax.patches:
    ax.annotate(
        format(p.get_height(), '.0f'),
        (p.get_x() + p.get_width()/2., p.get_height()),
        ha='center',
        va='bottom'
    )
plt.title("Number of Customers per Sales Method")
plt.show()

   sales_method  n_customers
0         Email         7466
1          Call         4962
2  Email + Call         2572

No description has been provided for this image

From this output 7,466 Email customers, 6,962 Call custoemrs, and 3,572 Email + Call customers, confirming that Emai has the broadest reach.

3.2 Revenue distribution overall¶

This code produces summary statistics and a histogram for revenue across all the methods combined.

In [13]:

# Drop missing revenue for distribution analysis
df_rev = df.dropna(subset=['revenue'])

# Summary statistics
print(df_rev['revenue'].describe())

# Histogram of overall revenue
plt.figure(figsize=(8,6))
sns.histplot(df_rev['revenue'], bins=30, color='blue')
plt.title("Distribution of Revenue jper Customer (All Methods)")
plt.xlabel("Revenue")
plt.ylabel("Count")
plt.tight_layout()
plt.show()

count    13926.000000
mean        93.934943
std         47.435312
min         32.540000
25%         52.470000
50%         89.500000
75%        107.327500
max        238.320000
Name: revenue, dtype: float64

The summary and histogram show that revenue per customer spans roughly from the mid-30s to around 200, with distinct peaks driven by the different sales methods.

3.3 Revenue distribution by methods.¶

This code summarises revenue by sales_method_clean and draws a boxplot to compare distributios.

In [14]:

# Summary stats by method
rev_by_method = df_rev.groupby('sales_method_clean')['revenue'].describe()
print(rev_by_method)

# Boxplot of revenue by method
plt.figure(figsize=(6,6))
sns.boxplot(data=df_rev,
           x='sales_method_clean',
           y='revenue',
           palette='Pastel1')
plt.title("Revenue per customer by Sales Method")
plt.xlabel("Sales Method")
plt.ylabel("Revenue")
plt.tight_layout()
plt.show()

                     count        mean        std  ...     50%     75%     max
sales_method_clean                                 ...                        
Call                4781.0   47.597467   8.609899  ...   49.07   52.68   71.36
Email               6922.0   97.127684  11.210469  ...   95.58  105.17  148.97
Email + Call        2223.0  183.651233  29.083924  ...  184.74  191.11  238.32

[3 rows x 8 columns]

From the groupped statistics, mean revenues are about 47.76 for call, 97.1 for Email, and 183.7 for Email + Call, with Email + Call clearly highest and also more variable. The Boxplot visually confrims that the median ad upper quartiles aree lowest for call, higher for Email, and highest for Email + Call.

3.4 Revenue ove time by method¶

This code calculates mean revenue per week and moethod, then plots time trends.

In [15]:

# Mean revenue by week and method (excluding missing revenue)
weekly_rev = (df_rev.groupby(['week','sales_method_clean'])['revenue'].mean().reset_index())
print(weekly_rev.head())

# Line plot: revenue orver the time by method
plt.figure(figsize=(7,4))
sns.lineplot(data=weekly_rev,
            x='week',
            y='revenue',
            hue='sales_method_clean',
            marker='o')
plt.title("Mean Revenue over Time by Sales Method")
plt.xlabel("Weekly since launch")
plt.ylabel("Mean Revenue")
plt.xticks(sorted(df['week'].unique()))
plt.tight_layout()
plt.show()

   week sales_method_clean     revenue
0     1               Call   35.350243
1     1              Email   87.496401
2     1       Email + Call  128.895649
3     2               Call   43.602606
4     2              Email  100.139121

The table and line chart shows that mean revenue increases from week 1 to week 6 for all methods and the Email + Calls is consistently highest, Email is in the middle, and call is lowest at each week.

3.5 Customer characteristics by method¶

This code compares years_as_customer and nb_site_visits across methods to understand customer mix.

In [16]:

cust_char = df.groupby('sales_method_clean')[['years_as_customer', 'nb_site_visits']].agg(['mean', 'median','std'])
print(cust_char)

                   years_as_customer         ... nb_site_visits          
                                mean median  ...         median       std
sales_method_clean                           ...                         
Call                        5.176743    4.0  ...           24.0  3.599864
Email                       4.983793    3.0  ...           25.0  3.281552
Email + Call                4.507387    3.0  ...           27.0  3.357916

[3 rows x 6 columns]

The results show that tenure is smaller across method (around 4.5-5.2 years on average), Email + Call customer have slightly higher site visits on average, suggesting they tend to be more digitally engaged.

4. Business metric¶

The main business metric is revenue per minute of sales effort, based on assumed time per customer for each method.

4.1 Time assumptions and mean revenue per customer¶

This code uses the dataset to compute mean revenue per customer by method.

In [17]:

# Mean revenue per customer by method (excluding missing revenue)
mean_rev_by_method = (df_rev.groupby('sales_method_clean')['revenue'].mean().reset_index())
mean_rev_by_method.columns = ['sales_method', 'mean_revenue']
print(mean_rev_by_method)

   sales_method  mean_revenue
0          Call     47.597467
1         Email     97.127684
2  Email + Call    183.651233

The output confrims appropriate means of 47.6(Call), 97.1(Email), 183.7(Email + Calls).

4.2 Revenue per minute¶

Let revenue per minute=(mean revenue per customer)/(minutes of effort per customer) The code below applies the above formula using the assumed effort: Email 5 minutes, Call 30 minutes, Email + Call 15 minutes.

In [18]:

 # Assumed minutes of sales effort per customer for each method
minutes_per_customer = {
    'Email': 5,
    'Call': 30,
    'Email + Call': 15
}

# Map minutes into the summary table
mean_rev_by_method['minutes_per_customer'] = \
    mean_rev_by_method['sales_method'].map(minutes_per_customer)

# Compute revenue per minute
mean_rev_by_method['revenue_per_minute'] = (
    mean_rev_by_method['mean_revenue'] /
    mean_rev_by_method['minutes_per_customer']
)

print(mean_rev_by_method)

   sales_method  mean_revenue  minutes_per_customer  revenue_per_minute
0          Call     47.597467                    30            1.586582
1         Email     97.127684                     5           19.425537
2  Email + Call    183.651233                    15           12.243416

The output show that revenue per minute of roughly 1.59 for call, 19.43 for Email,12.24 for Email + call, Showing that Email is the most time-efficnent method, followed by Email + Call.

5. Summary and Recommendation¶

The following code prints all the summary to findings.

In [19]:

print("=== Customers per method ===")
print(method_counts)

print("\n=== Revenue per customer by method ===")
print(rev_by_method[['mean', '50%', 'std']])  # mean, median (50%), std

print("\n=== Weekly mean revenue by method ===")
print(weekly_rev.sort_values(['sales_method_clean', 'week']).head(18))

print("\n=== Customer characteristics by method ===")
print(cust_char)

print("\n=== Efficiency metric: revenue per minute ===")
print(mean_rev_by_method)

=== Customers per method ===
   sales_method  n_customers
0         Email         7466
1          Call         4962
2  Email + Call         2572

=== Revenue per customer by method ===
                          mean     50%        std
sales_method_clean                               
Call                 47.597467   49.07   8.609899
Email                97.127684   95.58  11.210469
Email + Call        183.651233  184.74  29.083924

=== Weekly mean revenue by method ===
    week sales_method_clean     revenue
0      1               Call   35.350243
3      2               Call   43.602606
6      3               Call   41.759528
9      4               Call   51.446048
12     5               Call   53.146087
15     6               Call   66.166089
1      1              Email   87.496401
4      2              Email  100.139121
7      3              Email   92.762573
10     4              Email  108.877141
13     5              Email  108.892313
16     6              Email  130.976519
2      1       Email + Call  128.895649
5      2       Email + Call  154.246959
8      3       Email + Call  150.423272
11     4       Email + Call  180.822633
14     5       Email + Call  187.588343
17     6       Email + Call  227.770635

=== Customer characteristics by method ===
                   years_as_customer         ... nb_site_visits          
                                mean median  ...         median       std
sales_method_clean                           ...                         
Call                        5.176743    4.0  ...           24.0  3.599864
Email                       4.983793    3.0  ...           25.0  3.281552
Email + Call                4.507387    3.0  ...           27.0  3.357916

[3 rows x 6 columns]

=== Efficiency metric: revenue per minute ===
   sales_method  mean_revenue  minutes_per_customer  revenue_per_minute
0          Call     47.597467                    30            1.586582
1         Email     97.127684                     5           19.425537
2  Email + Call    183.651233                    15           12.243416

5.1 Summary¶

From these outpts, Email + Call yields the highest revenue per customer, Email is mid-range, and Call is lowest; yet Email has the highest revenue per minute, followed by Email + Call, making Call the least efficient. The weekly table confirms that these relative rankings hold in every week from lauch to week six, while all methods show increasing revenue over time.

5.2 Recommendations¶

Make Email the defalt sales method

Email reaches the laregest share of customers (7,466 vs. 4,962 Call and 2,572 Email + Calls), so standardising on Email maximises coverage with minimal extra operational complexity.
Email also delivers strong revenue per customer (mean ≈ 97) while being far more time-efficent than phone-based approach, making it the best baseline method.

Use Email + Call selectively for high-potential Customers

Email + Calls generates the highest revenue per customer (mean ≈ 184), almost double Email and nearly four times Call, so it is worth the extra effort where upside is large.
Because Email + Call customer show slightly higher website engagement on average, the combined method should be targeted to more engaged or strategically important accounts (e.g. high site visits, key logos)

Reduce or Phase out Call-only campaigns for this product line

Call-only customers have the lowest mean revenue ( ≈ 48) and the lowest revenue per minute ( ≈ 1.6), making this the least efficient use of the sales team's time.
Given Email and Email + Call both outperform Call in absolute revenue and efficiency, phone-only outreach should be reserved for exceptional cases where a direct call is clearly required.

Adopt "revenue per minute" as the primary perfomance mertric

Track revenue per minute of sales effort by method, using the current estimates ( ≈ 19.4 for Email, ≈ 12.2 for Email + Call, ≈ 1.6 for Call) as initial benchmarks and targets.
Monitor this metric weekly by method and compare against historical values to detect drops in efficiency early and trigger reviews of scripts, targeting, or cadences.

Segment and monitor by customer engagement and tenure

Since Email + Call is currently used more with engaged customers (higher site visits), segment the revenue-per-minute metric by engagement band to ensure perfomance changes are not just shifts in customer mix.
Similarly, monitor results across tenure bands (e.g. 0-2 years vs 3-7 vs 5+ years) to confrim that one method is not over- or under-serving specific relationship stages.

Improve data quality and re-run the analysis periodically

Investigate and fix the missing revenue in the source system so future reports can use compplete revenue and validate that current conclusions remain robust.
Re-run the same analysis and metric after each major campaign or every quarter to refine targeting rules (which customer get Email vs. Email + Calls) as more data accumulates.

Product Sales Analysis: Optimizing Sales Methods for New Stationery Line

AI Summary

🚨 As a Data Analyst, if you're still building dashboards with fancy visualizations…

1.Data validation¶

1.1 Setup and load the dataset¶

2. Data Validation and cleaning(column by column)¶

2.1 Validate (Week)¶

2.2 Validate and clean(sales method)¶

2.3 Validate(customer_id)¶

2.4 Validate(nb_sold)¶

2.5 Validate (revenue)¶

2.6 Validate(years_as_customer)¶

2.7 validate (nb_site_visits)¶

2.8 Validate (state)¶

Exploratory analysis and graphics¶

3.1 Customer per sales approach¶

3.2 Revenue distribution overall¶

3.3 Revenue distribution by methods.¶

3.4 Revenue ove time by method¶

3.5 Customer characteristics by method¶

4. Business metric¶

4.1 Time assumptions and mean revenue per customer¶

4.2 Revenue per minute¶

5. Summary and Recommendation¶

5.1 Summary¶

5.2 Recommendations¶

AI Summary

🚨 As a Data Analyst, if you're still building dashboards with fancy visualizations…

1.Data validation¶

1.1 Setup and load the dataset¶

2. Data Validation and cleaning(column by column)¶

2.1 Validate (Week)¶

2.2 Validate and clean(sales method)¶

2.3 Validate(customer_id)¶

2.4 Validate(nb_sold)¶

2.5 Validate (revenue)¶

2.6 Validate(years_as_customer)¶

2.7 validate (nb_site_visits)¶

2.8 Validate (state)¶

Exploratory analysis and graphics¶

3.1 Customer per sales approach¶

3.2 Revenue distribution overall¶

3.3 Revenue distribution by methods.¶

3.4 Revenue ove time by method¶

3.5 Customer characteristics by method¶

4. Business metric¶

4.1 Time assumptions and mean revenue per customer¶

4.2 Revenue per minute¶

5. Summary and Recommendation¶

5.1 Summary¶

5.2 Recommendations¶

We respect your privacy