Tuesday, April 10, 2012

Advanced CRM Analytics

Recently, I did some research (web mining) on CRM analytics. To my surprise, I found that almost 95% of the websites are talking about Info cubes, OLAP cubes, CRM reports, C-SAT scores, sales reports, sales dashboards, marketing dashboards, sales charts blah blah blah…. Well, I feel CRM analytics is not all about creating glittery BI dashboards.

What else can we do apart from regular BI reporting and basic averages tracking?

1.   Market basket analysis/Affinity analysis/Association rules
  • Use historical data & relevant statistical measures to quantify association between products and build association rules accordingly. These rules are use for Up-Selling and Cross-Selling.
  • Eg rule : A + B + C -->E + F (This rule states that if products A, B, and C are chosen, products E and F are proposed. Products E and F are only proposed if all three products are selected)
  • Statistical techniques
    • Cross tabulation , Multiple Response tables
    • Correlation coefficient, regression, odds ratio etc.,
    • Chi-square test of independence, concordance and discordance tables
    • Sequence Analysis, Link Analysis etc.,
    • Bayes and Conditional probabilities 
2.    Customer segmentation & profiling:
Customer segmentation will help us in Target marketing, loyalty & retention programs, Up selling/Cross-Selling etc.,
  • Identifying homogeneous customer segments that are similar in specific ways relevant to marketing such as
    • Demographics(Age, Gender, Education, Income, Home ownership, etc.)
    • Psychographics(Lifestyle, Attitude, Beliefs, Personality, Buying motives, etc.)
    • Brand Loyalty(Geography, State, ZIP, City size, Rural vs. Urban, etc.) 
  • Statistical methods used:
    • Cluster Analysis(K-means clustering, Hierarchical clustering),
    • CHAID, CART etc.,
    • Logistical Regression & Discriminant Analysis
3.   Customer lifetime value analysis (Customer scorecard building)
Build predictive models based on customer profile data and historical behaviors to assess how likely a customer is to exhibit a specific behavior in the future in order to improve sales
Score card takes customer profile variables (like Age, Gender, Education, Income, Home ownership, Lifestyle, Attitude, Beliefs, Personality, Buying motives, Brand Loyalty, Geography, State, ZIP, City size, Rural vs. Urban, spending patterns etc.) as input and gives a simple score that indicates customer value to company.
  • Statistical methods used:
    • Logistic & liner regression model building
    • Weight of evidence & information value
    • Trees and segmentation
4.   Customer Satisfaction analysis & drivers of customer satisfaction
  • Survey Analysis: Analysis of customer response data to find the overall C-SAT and satisfaction by various cuts
  • Analysis of C-SAT time series data, calculation of control limits, seasonality & trends etc.,
  • Calculation of net promoter score & C-SAT peer comparison
  • Identification of  most impacting factors on customer satisfaction
5.   Text mining
  • Analysis of customer verbatim data to define overall customer satisfaction
  • Effective algorithms to identify positive, negative & neutral comments
  • Summarization of overall comments into main themes (most frequent topics) and their positive & negative frequencies

Conclusion: There are several other advanced analytics techniques. Above five are most generic ones & they can be applied in any business.


I wish to thank my CRM guru Mukul Biswas 





Saturday, April 7, 2012



Don't trust outliers
Outlier, missing value treatment is indispensable in model building. In this post I am going to demonstrate how few outliers can screw your final model
I will prove my point in less than 10 strait simple points
1)   out·li·er - /ˈoutˌər/  (Noun)
1.       Outlier is an observation that is numerically distant from the rest of the data.
2.       A person or thing situated away or detached from the main body or system.
3.       A person or thing excluded from a group; an outsider.



2) Sample Datasets
Below two data sets have 34 observations each. I am going to build two regression models (Y on X).
Dataset-1

Dataset-2
X
Y

X
Y
3409
2593

3409
2593
2130
3872

2130
3872
237
5773

2139
3864
3973
2037

3973
2037
393
5617

393
5617
3726
2281

3726
2281
1211
4794

1211
4794
3987
2014

3987
2014
3329
2680

3329
2680
4285
1721

4285
1721
3565
2439

3565
2439
3863
2146

3863
2146
1920
4083

1920
4083
2573
3431

2573
3431
2951
3051

2951
3051
1284
4725

1284
4725
2639
3371

2639
3371
3150
2851

3150
2851
620
5381

620
5381
3703
2303

3703
2303
3700
2305

3700
2305
1135
4874

1135
4874
2139
3864

5100
45000
3145
2857

3145
2857
561
5449

561
5449
4191
1814

4191
1814
631
5375

631
5375
1325
4683

1325
4683
2399
3606

2399
3606
2881
3121

2881
3121
3687
2323

3687
2323
1610
4395

1610
4395
1999
4002

1999
4002
564
5444

564
5444
                        
3) Regression Lines
·         Data Set1: Regression line: Y = (-1) X + 6006.9, Inference - Y decreases when X creases.
·         Data Set2: Regression line: Y = (1) X +2137.8, Inference - Y increases when X creases
4) Lines in action (Do you smell a Rat?)


5) Observations:
·         Yes, there is one outlier.
·         This is just one data point. How can this one data poit change the whole regression line?
·         In fact, Dataset -1 and Dataset-2 are same, except one data point.
·         Yes, this outlier almost inversed the relation between X & Y

6) What to do now?  (Outlier treatment)
·         There is no standard method for treating outliers; it depends on the % of outliers in the data.
·         Flooring and Capping are the most widely used methods. Sometimes we replace the outlier observation with the mean or median, sometimes we bin the variables to find the most suitable estimate for outlier(I will write about missing value and outlier treatment in my next blog)
7) The outlier: so here is the outlier observation,
1135
4874
5100
45000
3145
2857
·         We can replace this outlier with mean of the rest of the observations, which is 3500
·         Or Replace it with overall median, which is 3401
·         Or replace it with Max of the rest of the observations, which is 5617
·         We can also look for 4500-5500 range in X, replace the outlier with the mean or median of this bin

8) Regression Lines after outlier treatment
y = -0.8871x + 5787.4
y = -0.7866x + 5593
y = -0.8828x + 5779

None of the lines show a positive coefficient.

9) Concluding remarks
·         Just by looking at the data we will not get the complete idea about outliers, missing values and default data entries
·         A scatter plot or frequency plot is crucial before starting the model building
·         Before building model one should take care of Outliers, missing values, multicolliniearity, heteroscedasticity, normality assumption tests etc.,
·         Last but not the least don’t just blindly interpret the regression coefficients in isolation (like I did above).