Analytics: June 2012

What the FISH is Model Building?

Model building is like driving a car by looking at the rear view mirror all the time, hoping that the road in front of us will be same as the road travelled so far.

Definition 1.1: Model Building: Predicting the probability of a future event using historical data

Example 1.1: (RGV ki aag)

Few months back, I forced my friend Suresh to come to a movie on the day of its release. I thought it will be a good movie coz I know about the director, he made some really good movies in the past. Unfortunately that movie was a disaster. Again after two weeks, I insisted him for another movie and I bought tickets myself, this time ;). Well disappointingly…. this was a tragedy too.

Next month, I am going to ask him for another movie…what will be his response? Can you predict the probability of suersh’s YES/NO? How do you know the probability of Suresh saying yes to my proposal? You know what; you just built a model (Predicting the probability of a future event using historical data)

Definition 1.2:Model Building: Predicting the probability of a future event by assigning appropriate weight to all the important factors/variables in historical data

Example 1. 2:(John Cena V.S Venkat Reddy)

Metrics	John Cena	Venkat Reddy
Weight	242 lbs	133 lbs
Height	6' 1"	5' 8"
Age	34 Years	26 years
Occupation	Wrestler	Data Analyst
Time spent in Gym	> 20 Hours per week	< 20 minutes per week
Time spent in front of computer	< 2 hours per week	> 60 hours per week
Programing Skills	None	C++, SAS, SPSS, R

Wrestling match between them: What is the probability of Venkat winning the match?

A data mining challenge between them: What is the probability that Venkat will finish first?

When it comes to wrestling, height, weight & time spent in the gym matter the most, programing skills doesn’t matter. But in data mining, statistical and other soft skills are important. It doesn’t matter how much your body weighs. (Model Building: Predicting the probability of a future event by assigning appropriate weight to all the important factors/variables in historical data)

Example 1.3: Below is the Win vs Loss record of horses after grilling the historical data of a particular horse racing tack

Long legs: 75% - A horse with relatively long legs won 75% of the times)
Breed A(Paint): 45%, Breed B(Morgan): 25 % others : 30%
T/L (Tummy to length) ratio <1/2 :75 % - Horses with tummy to length ratio <1/2 have won 75% of times
Gender: Male -65%
Head size: Small 30%, Medium 45% Large 25%
Country: Africa -65%

Now, given the historical data, which one of these two horses would you bet on?

Horse Name	Jigga	Rodeo
Length of legs	150 cm	110 cm
Breed	A	F
T/L ratio	0.3	0.6
Gender	Male	Female
Head size	Large	Small
Country	India	India

Obviously, everyone’s bet is on Jigga.

Given the historical data, How do you pick the best horse or top 3 horses from below list?

	Horse-1	Horse-2	Horse-3	Horse-4	Horse-5	Horse-6	Horse-7	Horse-8
Length of legs	109	114	134	130	149	120	104	117
Breed	C	A	B	A	F	K	L	B
T/L ratio	0.1	0.8	0.5	1	0.3	0.3	0.3	0.6
Gender	Male	Female	Male	Female	Male	Female	Male	Female
Head size	L	S	M	M	L	L	S	M
Country	Africa	India	Aus	NZ	Africa	Africa	India	India

We have historical data across all attributes (height, weight, gender, breed, age, win/loss percentage etc.,). Given a horse and its characteristics (realizations of above metrics), our objective is to predict its probability of winning the race.

Here is the approach

Scrutinize the historical data to identify common characteristics in winning horses, assign weights to each of these characteristics/attributes based on their importance, in simple words solve below equation

Probability of winning = W1(Att1)+W2(Att2)+W3(Att3)+W4(Att4)+…….+wk(Attk)

This is nothing but a model; we calculate these weights using some optimization techniques (regression in general).

Here are some main steps in model building.

Horses from Europe never won a race so never bet on them (Exclusions)
I would never bet on a ‘Breed-B, Female horse from Australia’ (objective function /Bad definition)
Asian horses behave entirely different from African horses (Segmentation)
Why not consider length of the tail?(Variable selection techniques)
I think length of legs should be given more weight no…the breed no…the country ….(Fitting regression line)
I am betting on these three good horses, they got maximum score. Are they going to win? (Validation)

My Future posts will cover each of these in detail

I wish to thank my analytics guru Gopal Prasad Malakar

-Venkat Reddy

Trendwise Analytics

<<Advanced CRM Analytics Variable Selection Techniques>>

Analytics

Monday, June 4, 2012

WTF is Model building?

What the FISH is Model Building?