Analytics

Don't trust outliers

Outlier, missing value treatment is indispensable in model building. In this post I am going to demonstrate how few outliers can screw your final model.

I will prove my point in less than 10 strait simple points

1) out·li·er - /ˈoutˌlīər/ (Noun)

1. Outlier is an observation that is numerically distant from the rest of the data.

2. A person or thing situated away or detached from the main body or system.

3. A person or thing excluded from a group; an outsider.

2) Sample Datasets

Below two data sets have 34 observations each. I am going to build two regression models (Y on X).

Dataset-1		Dataset-2
X	Y	X	Y
3409	2593	3409	2593
2130	3872	2130	3872
237	5773	2139	3864
3973	2037	3973	2037
393	5617	393	5617
3726	2281	3726	2281
1211	4794	1211	4794
3987	2014	3987	2014
3329	2680	3329	2680
4285	1721	4285	1721
3565	2439	3565	2439
3863	2146	3863	2146
1920	4083	1920	4083
2573	3431	2573	3431
2951	3051	2951	3051
1284	4725	1284	4725
2639	3371	2639	3371
3150	2851	3150	2851
620	5381	620	5381
3703	2303	3703	2303
3700	2305	3700	2305
1135	4874	1135	4874
2139	3864	5100	45000
3145	2857	3145	2857
561	5449	561	5449
4191	1814	4191	1814
631	5375	631	5375
1325	4683	1325	4683
2399	3606	2399	3606
2881	3121	2881	3121
3687	2323	3687	2323
1610	4395	1610	4395
1999	4002	1999	4002
564	5444	564	5444

3) Regression Lines

· Data Set1: Regression line: Y = (-1) X + 6006.9, Inference - Y decreases when X creases.

· Data Set2: Regression line: Y = (1) X +2137.8, Inference - Y increases when X creases

4) Lines in action (Do you smell a Rat?)

5) Observations:

· Yes, there is one outlier.

· This is just one data point. How can this one data poit change the whole regression line?

· In fact, Dataset -1 and Dataset-2 are same, except one data point.

· Yes, this outlier almost inversed the relation between X & Y

6) What to do now? (Outlier treatment)

· There is no standard method for treating outliers; it depends on the % of outliers in the data.

· Flooring and Capping are the most widely used methods. Sometimes we replace the outlier observation with the mean or median, sometimes we bin the variables to find the most suitable estimate for outlier(I will write about missing value and outlier treatment in my next blog)

7) The outlier: so here is the outlier observation,

1135	4874
5100	45000
3145	2857

· We can replace this outlier with mean of the rest of the observations, which is 3500

· Or Replace it with overall median, which is 3401

· Or replace it with Max of the rest of the observations, which is 5617

· We can also look for 4500-5500 range in X, replace the outlier with the mean or median of this bin

8) Regression Lines after outlier treatment

y = -0.8871x + 5787.4

y = -0.7866x + 5593

y = -0.8828x + 5779

None of the lines show a positive coefficient.

9) Concluding remarks

· Just by looking at the data we will not get the complete idea about outliers, missing values and default data entries

· A scatter plot or frequency plot is crucial before starting the model building

· Before building model one should take care of Outliers, missing values, multicolliniearity, heteroscedasticity, normality assumption tests etc.,

· Last but not the least don’t just blindly interpret the regression coefficients in isolation (like I did above).

Analytics

Saturday, April 7, 2012

6 comments: