Analytics: April 2012

Tuesday, April 10, 2012

Advanced CRM Analytics

Recently, I did some research (web mining) on CRM analytics. To my surprise, I found that almost 95% of the websites are talking about Info cubes, OLAP cubes, CRM reports, C-SAT scores, sales reports, sales dashboards, marketing dashboards, sales charts blah blah blah…. Well, I feel CRM analytics is not all about creating glittery BI dashboards.

What else can we do apart from regular BI reporting and basic averages tracking?

1. Market basket analysis/Affinity analysis/Association rules

Use historical data & relevant statistical measures to quantify association between products and build association rules accordingly. These rules are use for Up-Selling and Cross-Selling.
Eg rule : A + B + C -->E + F (This rule states that if products A, B, and C are chosen, products E and F are proposed. Products E and F are only proposed if all three products are selected)
Statistical techniques

Cross tabulation , Multiple Response tables
Correlation coefficient, regression, odds ratio etc.,
Chi-square test of independence, concordance and discordance tables
Sequence Analysis, Link Analysis etc.,
Bayes and Conditional probabilities

2. Customer segmentation & profiling:

Customer segmentation will help us in Target marketing, loyalty & retention programs, Up selling/Cross-Selling etc.,

Identifying homogeneous customer segments that are similar in specific ways relevant to marketing such as

Demographics(Age, Gender, Education, Income, Home ownership, etc.)
Psychographics(Lifestyle, Attitude, Beliefs, Personality, Buying motives, etc.)
Brand Loyalty(Geography, State, ZIP, City size, Rural vs. Urban, etc.)

Statistical methods used:

Cluster Analysis(K-means clustering, Hierarchical clustering),
CHAID, CART etc.,
Logistical Regression & Discriminant Analysis

3. Customer lifetime value analysis (Customer scorecard building)

Build predictive models based on customer profile data and historical behaviors to assess how likely a customer is to exhibit a specific behavior in the future in order to improve sales

Score card takes customer profile variables (like Age, Gender, Education, Income, Home ownership, Lifestyle, Attitude, Beliefs, Personality, Buying motives, Brand Loyalty, Geography, State, ZIP, City size, Rural vs. Urban, spending patterns etc.) as input and gives a simple score that indicates customer value to company.

Statistical methods used:

Logistic & liner regression model building
Weight of evidence & information value
Trees and segmentation

4. Customer Satisfaction analysis & drivers of customer satisfaction

Survey Analysis: Analysis of customer response data to find the overall C-SAT and satisfaction by various cuts
Analysis of C-SAT time series data, calculation of control limits, seasonality & trends etc.,
Calculation of net promoter score & C-SAT peer comparison
Identification of most impacting factors on customer satisfaction

5. Text mining

Analysis of customer verbatim data to define overall customer satisfaction
Effective algorithms to identify positive, negative & neutral comments
Summarization of overall comments into main themes (most frequent topics) and their positive & negative frequencies

Conclusion: There are several other advanced analytics techniques. Above five are most generic ones & they can be applied in any business.

I wish to thank my CRM guru Mukul Biswas

Saturday, April 7, 2012

Don't trust outliers

Outlier, missing value treatment is indispensable in model building. In this post I am going to demonstrate how few outliers can screw your final model.

I will prove my point in less than 10 strait simple points

1) out·li·er - /ˈoutˌlīər/ (Noun)

1. Outlier is an observation that is numerically distant from the rest of the data.

2. A person or thing situated away or detached from the main body or system.

3. A person or thing excluded from a group; an outsider.

2) Sample Datasets

Below two data sets have 34 observations each. I am going to build two regression models (Y on X).

Dataset-1		Dataset-2
X	Y	X	Y
3409	2593	3409	2593
2130	3872	2130	3872
237	5773	2139	3864
3973	2037	3973	2037
393	5617	393	5617
3726	2281	3726	2281
1211	4794	1211	4794
3987	2014	3987	2014
3329	2680	3329	2680
4285	1721	4285	1721
3565	2439	3565	2439
3863	2146	3863	2146
1920	4083	1920	4083
2573	3431	2573	3431
2951	3051	2951	3051
1284	4725	1284	4725
2639	3371	2639	3371
3150	2851	3150	2851
620	5381	620	5381
3703	2303	3703	2303
3700	2305	3700	2305
1135	4874	1135	4874
2139	3864	5100	45000
3145	2857	3145	2857
561	5449	561	5449
4191	1814	4191	1814
631	5375	631	5375
1325	4683	1325	4683
2399	3606	2399	3606
2881	3121	2881	3121
3687	2323	3687	2323
1610	4395	1610	4395
1999	4002	1999	4002
564	5444	564	5444

3) Regression Lines

· Data Set1: Regression line: Y = (-1) X + 6006.9, Inference - Y decreases when X creases.

· Data Set2: Regression line: Y = (1) X +2137.8, Inference - Y increases when X creases

4) Lines in action (Do you smell a Rat?)

5) Observations:

· Yes, there is one outlier.

· This is just one data point. How can this one data poit change the whole regression line?

· In fact, Dataset -1 and Dataset-2 are same, except one data point.

· Yes, this outlier almost inversed the relation between X & Y

6) What to do now? (Outlier treatment)

· There is no standard method for treating outliers; it depends on the % of outliers in the data.

· Flooring and Capping are the most widely used methods. Sometimes we replace the outlier observation with the mean or median, sometimes we bin the variables to find the most suitable estimate for outlier(I will write about missing value and outlier treatment in my next blog)

7) The outlier: so here is the outlier observation,

1135	4874
5100	45000
3145	2857

· We can replace this outlier with mean of the rest of the observations, which is 3500

· Or Replace it with overall median, which is 3401

· Or replace it with Max of the rest of the observations, which is 5617

· We can also look for 4500-5500 range in X, replace the outlier with the mean or median of this bin

8) Regression Lines after outlier treatment

y = -0.8871x + 5787.4

y = -0.7866x + 5593

y = -0.8828x + 5779

None of the lines show a positive coefficient.

9) Concluding remarks

· Just by looking at the data we will not get the complete idea about outliers, missing values and default data entries

· A scatter plot or frequency plot is crucial before starting the model building

· Before building model one should take care of Outliers, missing values, multicolliniearity, heteroscedasticity, normality assumption tests etc.,

· Last but not the least don’t just blindly interpret the regression coefficients in isolation (like I did above).