What the FISH is Model Building?
Definition 1.1: Model Building: Predicting
the probability of a future event using historical data
Example 1.1: (RGV ki aag)
Few
months back, I forced my friend Suresh to come to a movie on the day of its
release. I thought it will be a good movie coz I know about the director, he
made some really good movies in the past. Unfortunately that movie was a
disaster. Again after two weeks, I insisted him for another movie and I bought
tickets myself, this time ;). Well disappointingly…. this was a tragedy too.
Next month, I am going to ask him for another movie…what will be his response? Can you predict the probability of suersh’s YES/NO? How do you know the probability of Suresh saying yes to my proposal? You know what; you just built a model (Predicting the probability of a future event using historical data)
Next month, I am going to ask him for another movie…what will be his response? Can you predict the probability of suersh’s YES/NO? How do you know the probability of Suresh saying yes to my proposal? You know what; you just built a model (Predicting the probability of a future event using historical data)
Definition 1.2:Model Building:
Predicting the probability of a future event by assigning appropriate weight to
all the important factors/variables in historical data
Example 1.
2:(John Cena V.S Venkat Reddy)
Metrics
|
John Cena
|
Venkat Reddy
|
Weight
|
242 lbs
|
133 lbs
|
Height
|
6' 1"
|
5' 8"
|
Age
|
34 Years
|
26 years
|
Occupation
|
Wrestler
|
Data Analyst
|
Time spent in Gym
|
> 20 Hours per week
|
< 20 minutes per week
|
Time spent in front of computer
|
< 2 hours per week
|
> 60 hours per week
|
Programing Skills
|
None
|
C++, SAS, SPSS, R
|
Wrestling
match between them: What is the probability of Venkat winning the match?
A
data mining challenge between them: What is the probability that Venkat will
finish first?
When it comes to wrestling, height, weight &
time spent in the gym matter the most, programing skills doesn’t matter. But in
data mining, statistical and other soft skills are important. It doesn’t matter
how much your body weighs. (Model Building: Predicting the probability of a
future event by assigning appropriate weight to all the important
factors/variables in historical data)
Example 1.3: Below is the Win vs
Loss record of horses after grilling the historical data of a particular horse
racing tack
- Long legs: 75% - A horse with relatively long legs won 75% of the times)
- Breed A(Paint): 45%, Breed B(Morgan): 25 % others : 30%
- T/L (Tummy to length) ratio <1/2 :75 % - Horses with tummy to length ratio <1/2 have won 75% of times
- Gender: Male -65%
- Head size: Small 30%, Medium 45% Large 25%
- Country: Africa -65%
Horse Name
|
Jigga
|
Rodeo
|
Length of legs
|
150 cm
|
110 cm
|
Breed
|
A
|
F
|
T/L ratio
|
0.3
|
0.6
|
Gender
|
Male
|
Female
|
Head size
|
Large
|
Small
|
Country
|
India
|
India
|
Obviously, everyone’s bet is on Jigga.
Given the historical data, How
do you pick the best horse or top 3 horses from below list?
Horse-1
|
Horse-2
|
Horse-3
|
Horse-4
|
Horse-5
|
Horse-6
|
Horse-7
|
Horse-8
|
|
Length of legs
|
109
|
114
|
134
|
130
|
149
|
120
|
104
|
117
|
Breed
|
C
|
A
|
B
|
A
|
F
|
K
|
L
|
B
|
T/L ratio
|
0.1
|
0.8
|
0.5
|
1
|
0.3
|
0.3
|
0.3
|
0.6
|
Gender
|
Male
|
Female
|
Male
|
Female
|
Male
|
Female
|
Male
|
Female
|
Head size
|
L
|
S
|
M
|
M
|
L
|
L
|
S
|
M
|
Country
|
Africa
|
India
|
Aus
|
NZ
|
Africa
|
Africa
|
India
|
India
|
We have historical data across all attributes
(height, weight, gender, breed, age, win/loss percentage etc.,). Given a horse
and its characteristics (realizations of above metrics), our objective is to
predict its probability of winning the race.
Here
is the approach
Scrutinize the historical data to identify common
characteristics in winning horses, assign weights to each of these
characteristics/attributes based on their importance, in simple words solve
below equation
Probability of winning = W1(Att1)+W2(Att2)+W3(Att3)+W4(Att4)+…….+wk(Attk)
This is nothing but a model; we calculate these
weights using some optimization techniques (regression in general).
Here
are some main steps in model building.
- Horses from Europe never won a race so never bet on them (Exclusions)
- I would never bet on a ‘Breed-B, Female horse from Australia’ (objective function /Bad definition)
- Asian horses behave entirely different from African horses (Segmentation)
- Why not consider length of the tail?(Variable selection techniques)
- I think length of legs should be given more weight no…the breed no…the country ….(Fitting regression line)
- I am betting on these three good horses, they got maximum score. Are they going to win? (Validation)
My Future posts will cover each of these in detail
-Venkat Reddy
Trendwise Analytics