Saturday, April 7, 2012



Don't trust outliers
Outlier, missing value treatment is indispensable in model building. In this post I am going to demonstrate how few outliers can screw your final model
I will prove my point in less than 10 strait simple points
1)   out·li·er - /ˈoutˌər/  (Noun)
1.       Outlier is an observation that is numerically distant from the rest of the data.
2.       A person or thing situated away or detached from the main body or system.
3.       A person or thing excluded from a group; an outsider.



2) Sample Datasets
Below two data sets have 34 observations each. I am going to build two regression models (Y on X).
Dataset-1

Dataset-2
X
Y

X
Y
3409
2593

3409
2593
2130
3872

2130
3872
237
5773

2139
3864
3973
2037

3973
2037
393
5617

393
5617
3726
2281

3726
2281
1211
4794

1211
4794
3987
2014

3987
2014
3329
2680

3329
2680
4285
1721

4285
1721
3565
2439

3565
2439
3863
2146

3863
2146
1920
4083

1920
4083
2573
3431

2573
3431
2951
3051

2951
3051
1284
4725

1284
4725
2639
3371

2639
3371
3150
2851

3150
2851
620
5381

620
5381
3703
2303

3703
2303
3700
2305

3700
2305
1135
4874

1135
4874
2139
3864

5100
45000
3145
2857

3145
2857
561
5449

561
5449
4191
1814

4191
1814
631
5375

631
5375
1325
4683

1325
4683
2399
3606

2399
3606
2881
3121

2881
3121
3687
2323

3687
2323
1610
4395

1610
4395
1999
4002

1999
4002
564
5444

564
5444
                        
3) Regression Lines
·         Data Set1: Regression line: Y = (-1) X + 6006.9, Inference - Y decreases when X creases.
·         Data Set2: Regression line: Y = (1) X +2137.8, Inference - Y increases when X creases
4) Lines in action (Do you smell a Rat?)


5) Observations:
·         Yes, there is one outlier.
·         This is just one data point. How can this one data poit change the whole regression line?
·         In fact, Dataset -1 and Dataset-2 are same, except one data point.
·         Yes, this outlier almost inversed the relation between X & Y

6) What to do now?  (Outlier treatment)
·         There is no standard method for treating outliers; it depends on the % of outliers in the data.
·         Flooring and Capping are the most widely used methods. Sometimes we replace the outlier observation with the mean or median, sometimes we bin the variables to find the most suitable estimate for outlier(I will write about missing value and outlier treatment in my next blog)
7) The outlier: so here is the outlier observation,
1135
4874
5100
45000
3145
2857
·         We can replace this outlier with mean of the rest of the observations, which is 3500
·         Or Replace it with overall median, which is 3401
·         Or replace it with Max of the rest of the observations, which is 5617
·         We can also look for 4500-5500 range in X, replace the outlier with the mean or median of this bin

8) Regression Lines after outlier treatment
y = -0.8871x + 5787.4
y = -0.7866x + 5593
y = -0.8828x + 5779

None of the lines show a positive coefficient.

9) Concluding remarks
·         Just by looking at the data we will not get the complete idea about outliers, missing values and default data entries
·         A scatter plot or frequency plot is crucial before starting the model building
·         Before building model one should take care of Outliers, missing values, multicolliniearity, heteroscedasticity, normality assumption tests etc.,
·         Last but not the least don’t just blindly interpret the regression coefficients in isolation (like I did above).



6 comments: