Monday, June 4, 2012

WTF is Model building?


What the FISH is Model Building?


Model building is like driving a car by looking at the rear view mirror all the time, hoping that the road in front of us will be same as the road travelled so far. 

Definition 1.1: Model Building: Predicting the probability of a future event using historical data 
Example 1.1: (RGV ki aag)

Few months back, I forced my friend Suresh to come to a movie on the day of its release. I thought it will be a good movie coz I know about the director, he made some really good movies in the past.  Unfortunately that movie was a disaster. Again after two weeks, I insisted him for another movie and I bought tickets myself, this time ;). Well disappointingly…. this was a tragedy too.


Next month, I am going to ask him for another movie…what will be his response? Can you predict the probability of suersh’s YES/NO?  How do you know the probability of Suresh saying yes to my proposal? You know what; you just built a model (Predicting the probability of a future event using historical data)

Definition 1.2:Model Building: Predicting the probability of a future event by assigning appropriate weight to all the important factors/variables in historical data

Example 1. 2:(John Cena V.S  Venkat Reddy)

Metrics
John Cena
Venkat Reddy
Weight
242 lbs
133 lbs
Height
6' 1"
5' 8"
Age
34 Years
26 years
Occupation
Wrestler
Data Analyst
Time spent in Gym
> 20 Hours per week
 < 20 minutes per week
Time spent in front of computer
< 2 hours per week
> 60 hours per week
Programing Skills
None
C++, SAS, SPSS, R

Wrestling match between them: What is the probability of Venkat winning the match?
A data mining challenge between them: What is the probability that Venkat will finish first?

When it comes to wrestling, height, weight & time spent in the gym matter the most, programing skills doesn’t matter. But in data mining, statistical and other soft skills are important. It doesn’t matter how much your body weighs. (Model Building: Predicting the probability of a future event by assigning appropriate weight to all the important factors/variables in historical data)

Example 1.3: Below is the Win vs Loss record of horses after grilling the historical data of a particular horse racing tack
  • Long legs: 75% - A horse with relatively long legs won 75% of the times)
  • Breed A(Paint): 45%, Breed B(Morgan): 25 %  others : 30%
  • T/L (Tummy to length) ratio <1/2 :75 %  - Horses with tummy to length ratio <1/2 have won 75% of times
  • Gender: Male -65%
  • Head size: Small 30%, Medium 45% Large 25%
  • Country: Africa -65%
Now, given the historical data, which one of these two horses would you bet on?
 Horse Name
Jigga
Rodeo
Length of legs
150 cm
110 cm
Breed
A
F
T/L ratio
0.3
0.6
Gender
Male
Female
Head size
Large
Small
Country
India
India


Obviously, everyone’s bet is on Jigga. 

Given the historical data, How do you pick the best horse or top 3 horses from below list?


Horse-1
Horse-2
Horse-3
Horse-4
Horse-5
Horse-6
Horse-7
Horse-8
Length of legs
109
114
134
130
149
120
104
117
Breed
C
A
B
A
F
K
L
B
T/L ratio
0.1
0.8
0.5
1
0.3
0.3
0.3
0.6
Gender
Male
Female
Male
Female
Male
Female
Male
Female
Head size
L
S
M
M
L
L
S
M
Country
Africa
India
Aus
NZ
Africa
Africa
India
India


We have historical data across all attributes (height, weight, gender, breed, age, win/loss percentage etc.,). Given a horse and its characteristics (realizations of above metrics), our objective is to predict its probability of winning the race.

Here is the approach
Scrutinize the historical data to identify common characteristics in winning horses, assign weights to each of these characteristics/attributes based on their importance, in simple words solve below equation
Probability of winning = W1(Att1)+W2(Att2)+W3(Att3)+W4(Att4)+…….+wk(Attk)
This is nothing but a model; we calculate these weights using some optimization techniques (regression in general).

Here are some main steps in model building.
  • Horses from Europe never won a race so never bet on them (Exclusions)
  • I would never bet on a ‘Breed-B, Female horse from Australia’ (objective function /Bad definition)
  • Asian horses behave entirely different from African horses (Segmentation)
  • Why not consider length of the tail?(Variable selection techniques)
  • I think length of legs should be given more weight no…the breed no…the country ….(Fitting regression line)
  • I am betting on these three good horses, they got maximum score. Are they going to win? (Validation)

My Future posts will cover each of these in detail




I wish to thank my analytics guru Gopal Prasad Malakar

-Venkat Reddy
Trendwise Analytics

<<Advanced CRM Analytics                                              Variable Selection Techniques>>                           
  

Tuesday, April 10, 2012

Advanced CRM Analytics

Recently, I did some research (web mining) on CRM analytics. To my surprise, I found that almost 95% of the websites are talking about Info cubes, OLAP cubes, CRM reports, C-SAT scores, sales reports, sales dashboards, marketing dashboards, sales charts blah blah blah…. Well, I feel CRM analytics is not all about creating glittery BI dashboards.

What else can we do apart from regular BI reporting and basic averages tracking?

1.   Market basket analysis/Affinity analysis/Association rules
  • Use historical data & relevant statistical measures to quantify association between products and build association rules accordingly. These rules are use for Up-Selling and Cross-Selling.
  • Eg rule : A + B + C -->E + F (This rule states that if products A, B, and C are chosen, products E and F are proposed. Products E and F are only proposed if all three products are selected)
  • Statistical techniques
    • Cross tabulation , Multiple Response tables
    • Correlation coefficient, regression, odds ratio etc.,
    • Chi-square test of independence, concordance and discordance tables
    • Sequence Analysis, Link Analysis etc.,
    • Bayes and Conditional probabilities 
2.    Customer segmentation & profiling:
Customer segmentation will help us in Target marketing, loyalty & retention programs, Up selling/Cross-Selling etc.,
  • Identifying homogeneous customer segments that are similar in specific ways relevant to marketing such as
    • Demographics(Age, Gender, Education, Income, Home ownership, etc.)
    • Psychographics(Lifestyle, Attitude, Beliefs, Personality, Buying motives, etc.)
    • Brand Loyalty(Geography, State, ZIP, City size, Rural vs. Urban, etc.) 
  • Statistical methods used:
    • Cluster Analysis(K-means clustering, Hierarchical clustering),
    • CHAID, CART etc.,
    • Logistical Regression & Discriminant Analysis
3.   Customer lifetime value analysis (Customer scorecard building)
Build predictive models based on customer profile data and historical behaviors to assess how likely a customer is to exhibit a specific behavior in the future in order to improve sales
Score card takes customer profile variables (like Age, Gender, Education, Income, Home ownership, Lifestyle, Attitude, Beliefs, Personality, Buying motives, Brand Loyalty, Geography, State, ZIP, City size, Rural vs. Urban, spending patterns etc.) as input and gives a simple score that indicates customer value to company.
  • Statistical methods used:
    • Logistic & liner regression model building
    • Weight of evidence & information value
    • Trees and segmentation
4.   Customer Satisfaction analysis & drivers of customer satisfaction
  • Survey Analysis: Analysis of customer response data to find the overall C-SAT and satisfaction by various cuts
  • Analysis of C-SAT time series data, calculation of control limits, seasonality & trends etc.,
  • Calculation of net promoter score & C-SAT peer comparison
  • Identification of  most impacting factors on customer satisfaction
5.   Text mining
  • Analysis of customer verbatim data to define overall customer satisfaction
  • Effective algorithms to identify positive, negative & neutral comments
  • Summarization of overall comments into main themes (most frequent topics) and their positive & negative frequencies

Conclusion: There are several other advanced analytics techniques. Above five are most generic ones & they can be applied in any business.


I wish to thank my CRM guru Mukul Biswas 





Saturday, April 7, 2012



Don't trust outliers
Outlier, missing value treatment is indispensable in model building. In this post I am going to demonstrate how few outliers can screw your final model
I will prove my point in less than 10 strait simple points
1)   out·li·er - /ˈoutËŒlÄ«É™r/  (Noun)
1.       Outlier is an observation that is numerically distant from the rest of the data.
2.       A person or thing situated away or detached from the main body or system.
3.       A person or thing excluded from a group; an outsider.



2) Sample Datasets
Below two data sets have 34 observations each. I am going to build two regression models (Y on X).
Dataset-1

Dataset-2
X
Y

X
Y
3409
2593

3409
2593
2130
3872

2130
3872
237
5773

2139
3864
3973
2037

3973
2037
393
5617

393
5617
3726
2281

3726
2281
1211
4794

1211
4794
3987
2014

3987
2014
3329
2680

3329
2680
4285
1721

4285
1721
3565
2439

3565
2439
3863
2146

3863
2146
1920
4083

1920
4083
2573
3431

2573
3431
2951
3051

2951
3051
1284
4725

1284
4725
2639
3371

2639
3371
3150
2851

3150
2851
620
5381

620
5381
3703
2303

3703
2303
3700
2305

3700
2305
1135
4874

1135
4874
2139
3864

5100
45000
3145
2857

3145
2857
561
5449

561
5449
4191
1814

4191
1814
631
5375

631
5375
1325
4683

1325
4683
2399
3606

2399
3606
2881
3121

2881
3121
3687
2323

3687
2323
1610
4395

1610
4395
1999
4002

1999
4002
564
5444

564
5444
                        
3) Regression Lines
·         Data Set1: Regression line: Y = (-1) X + 6006.9, Inference - Y decreases when X creases.
·         Data Set2: Regression line: Y = (1) X +2137.8, Inference - Y increases when X creases
4) Lines in action (Do you smell a Rat?)


5) Observations:
·         Yes, there is one outlier.
·         This is just one data point. How can this one data poit change the whole regression line?
·         In fact, Dataset -1 and Dataset-2 are same, except one data point.
·         Yes, this outlier almost inversed the relation between X & Y

6) What to do now?  (Outlier treatment)
·         There is no standard method for treating outliers; it depends on the % of outliers in the data.
·         Flooring and Capping are the most widely used methods. Sometimes we replace the outlier observation with the mean or median, sometimes we bin the variables to find the most suitable estimate for outlier(I will write about missing value and outlier treatment in my next blog)
7) The outlier: so here is the outlier observation,
1135
4874
5100
45000
3145
2857
·         We can replace this outlier with mean of the rest of the observations, which is 3500
·         Or Replace it with overall median, which is 3401
·         Or replace it with Max of the rest of the observations, which is 5617
·         We can also look for 4500-5500 range in X, replace the outlier with the mean or median of this bin

8) Regression Lines after outlier treatment
y = -0.8871x + 5787.4
y = -0.7866x + 5593
y = -0.8828x + 5779

None of the lines show a positive coefficient.

9) Concluding remarks
·         Just by looking at the data we will not get the complete idea about outliers, missing values and default data entries
·         A scatter plot or frequency plot is crucial before starting the model building
·         Before building model one should take care of Outliers, missing values, multicolliniearity, heteroscedasticity, normality assumption tests etc.,
·         Last but not the least don’t just blindly interpret the regression coefficients in isolation (like I did above).