Don't trust outliers
Outlier, missing value treatment is indispensable in
model building. In this post I am going to demonstrate how few outliers can
screw your final model.
I will prove my point in less than 10 strait simple
points
1) out·li·er - /ˈoutˌlīər/ (Noun)
1. Outlier
is an observation
that is numerically distant from the rest of the data.
2. A person or thing situated away or detached from the main
body or system.
3. A person or thing excluded from a group; an outsider.
2) Sample
Datasets
Below two data sets have 34 observations each. I am
going to build two regression models (Y on X).
Dataset-1
|
Dataset-2
|
|||
X
|
Y
|
X
|
Y
|
|
3409
|
2593
|
3409
|
2593
|
|
2130
|
3872
|
2130
|
3872
|
|
237
|
5773
|
2139
|
3864
|
|
3973
|
2037
|
3973
|
2037
|
|
393
|
5617
|
393
|
5617
|
|
3726
|
2281
|
3726
|
2281
|
|
1211
|
4794
|
1211
|
4794
|
|
3987
|
2014
|
3987
|
2014
|
|
3329
|
2680
|
3329
|
2680
|
|
4285
|
1721
|
4285
|
1721
|
|
3565
|
2439
|
3565
|
2439
|
|
3863
|
2146
|
3863
|
2146
|
|
1920
|
4083
|
1920
|
4083
|
|
2573
|
3431
|
2573
|
3431
|
|
2951
|
3051
|
2951
|
3051
|
|
1284
|
4725
|
1284
|
4725
|
|
2639
|
3371
|
2639
|
3371
|
|
3150
|
2851
|
3150
|
2851
|
|
620
|
5381
|
620
|
5381
|
|
3703
|
2303
|
3703
|
2303
|
|
3700
|
2305
|
3700
|
2305
|
|
1135
|
4874
|
1135
|
4874
|
|
2139
|
3864
|
5100
|
45000
|
|
3145
|
2857
|
3145
|
2857
|
|
561
|
5449
|
561
|
5449
|
|
4191
|
1814
|
4191
|
1814
|
|
631
|
5375
|
631
|
5375
|
|
1325
|
4683
|
1325
|
4683
|
|
2399
|
3606
|
2399
|
3606
|
|
2881
|
3121
|
2881
|
3121
|
|
3687
|
2323
|
3687
|
2323
|
|
1610
|
4395
|
1610
|
4395
|
|
1999
|
4002
|
1999
|
4002
|
|
564
|
5444
|
564
|
5444
|
3) Regression
Lines
·
Data Set1: Regression line: Y = (-1) X + 6006.9, Inference - Y decreases when X creases.
·
Data Set2: Regression line: Y = (1) X +2137.8, Inference - Y increases when X creases
4) Lines in action
(Do you smell a Rat?)
5) Observations:
·
Yes, there is one outlier.
·
This is just one data point. How can this one data poit
change the whole regression line?
·
In fact, Dataset -1 and Dataset-2 are same, except one data point.
·
Yes, this outlier almost inversed the relation between X & Y
6) What to do now? (Outlier treatment)
·
There is no standard method for treating outliers; it depends on the %
of outliers in the data.
·
Flooring and Capping are the most widely used methods. Sometimes we
replace the outlier observation with the mean or median, sometimes we bin the variables
to find the most suitable estimate for outlier(I will write about missing value
and outlier treatment in my next blog)
7) The outlier:
so here is
the outlier observation,
1135
|
4874
|
5100
|
45000
|
3145
|
2857
|
·
We can replace this outlier with mean of the rest of the observations,
which is 3500
·
Or Replace it with overall median, which is 3401
·
Or replace it with Max of the rest of the observations, which is 5617
·
We can also look for 4500-5500 range in X, replace the outlier with the
mean or median of this bin
8) Regression Lines after outlier
treatment
y = -0.8871x + 5787.4
|
y = -0.7866x + 5593
|
y = -0.8828x + 5779
|
None
of the lines show a positive coefficient.
9) Concluding remarks
·
Just by looking at the data we will not get the complete idea about
outliers, missing values and default data entries
·
A scatter plot or frequency plot is crucial before starting the model
building
·
Before building model one should take care of Outliers, missing values,
multicolliniearity, heteroscedasticity,
normality assumption tests etc.,
·
Last but not the least don’t just blindly
interpret the regression coefficients in isolation (like I did above).
ReplyDeleteImpressive Content, do check out:
Data Mining software services India
Data Mining software service providers
Hello! DO check this link for amazing offers:Data Mining software
ReplyDeleteData Mining Service Providers in Bangalore
Hello, do check this out for more information.
ReplyDeleteData Mining Services India
Data Mining
ReplyDeleteThanks for sharing.
Data Mining software services India
Thanks for the Impressive content. Keep up the good work.
ReplyDeleteData Mining Services India
Nice blog Thank you.
ReplyDeleteanalytics companies in bangalore
top analytics companies in india
google analytics service provider