Monday, June 4, 2012

WTF is Model building?


What the FISH is Model Building?


Model building is like driving a car by looking at the rear view mirror all the time, hoping that the road in front of us will be same as the road travelled so far. 

Definition 1.1: Model Building: Predicting the probability of a future event using historical data 
Example 1.1: (RGV ki aag)

Few months back, I forced my friend Suresh to come to a movie on the day of its release. I thought it will be a good movie coz I know about the director, he made some really good movies in the past.  Unfortunately that movie was a disaster. Again after two weeks, I insisted him for another movie and I bought tickets myself, this time ;). Well disappointingly…. this was a tragedy too.


Next month, I am going to ask him for another movie…what will be his response? Can you predict the probability of suersh’s YES/NO?  How do you know the probability of Suresh saying yes to my proposal? You know what; you just built a model (Predicting the probability of a future event using historical data)

Definition 1.2:Model Building: Predicting the probability of a future event by assigning appropriate weight to all the important factors/variables in historical data

Example 1. 2:(John Cena V.S  Venkat Reddy)

Metrics
John Cena
Venkat Reddy
Weight
242 lbs
133 lbs
Height
6' 1"
5' 8"
Age
34 Years
26 years
Occupation
Wrestler
Data Analyst
Time spent in Gym
> 20 Hours per week
 < 20 minutes per week
Time spent in front of computer
< 2 hours per week
> 60 hours per week
Programing Skills
None
C++, SAS, SPSS, R

Wrestling match between them: What is the probability of Venkat winning the match?
A data mining challenge between them: What is the probability that Venkat will finish first?

When it comes to wrestling, height, weight & time spent in the gym matter the most, programing skills doesn’t matter. But in data mining, statistical and other soft skills are important. It doesn’t matter how much your body weighs. (Model Building: Predicting the probability of a future event by assigning appropriate weight to all the important factors/variables in historical data)

Example 1.3: Below is the Win vs Loss record of horses after grilling the historical data of a particular horse racing tack
  • Long legs: 75% - A horse with relatively long legs won 75% of the times)
  • Breed A(Paint): 45%, Breed B(Morgan): 25 %  others : 30%
  • T/L (Tummy to length) ratio <1/2 :75 %  - Horses with tummy to length ratio <1/2 have won 75% of times
  • Gender: Male -65%
  • Head size: Small 30%, Medium 45% Large 25%
  • Country: Africa -65%
Now, given the historical data, which one of these two horses would you bet on?
 Horse Name
Jigga
Rodeo
Length of legs
150 cm
110 cm
Breed
A
F
T/L ratio
0.3
0.6
Gender
Male
Female
Head size
Large
Small
Country
India
India


Obviously, everyone’s bet is on Jigga. 

Given the historical data, How do you pick the best horse or top 3 horses from below list?


Horse-1
Horse-2
Horse-3
Horse-4
Horse-5
Horse-6
Horse-7
Horse-8
Length of legs
109
114
134
130
149
120
104
117
Breed
C
A
B
A
F
K
L
B
T/L ratio
0.1
0.8
0.5
1
0.3
0.3
0.3
0.6
Gender
Male
Female
Male
Female
Male
Female
Male
Female
Head size
L
S
M
M
L
L
S
M
Country
Africa
India
Aus
NZ
Africa
Africa
India
India


We have historical data across all attributes (height, weight, gender, breed, age, win/loss percentage etc.,). Given a horse and its characteristics (realizations of above metrics), our objective is to predict its probability of winning the race.

Here is the approach
Scrutinize the historical data to identify common characteristics in winning horses, assign weights to each of these characteristics/attributes based on their importance, in simple words solve below equation
Probability of winning = W1(Att1)+W2(Att2)+W3(Att3)+W4(Att4)+…….+wk(Attk)
This is nothing but a model; we calculate these weights using some optimization techniques (regression in general).

Here are some main steps in model building.
  • Horses from Europe never won a race so never bet on them (Exclusions)
  • I would never bet on a ‘Breed-B, Female horse from Australia’ (objective function /Bad definition)
  • Asian horses behave entirely different from African horses (Segmentation)
  • Why not consider length of the tail?(Variable selection techniques)
  • I think length of legs should be given more weight no…the breed no…the country ….(Fitting regression line)
  • I am betting on these three good horses, they got maximum score. Are they going to win? (Validation)

My Future posts will cover each of these in detail




I wish to thank my analytics guru Gopal Prasad Malakar

-Venkat Reddy
Trendwise Analytics

<<Advanced CRM Analytics                                              Variable Selection Techniques>>