Inspired by this. Read it first: http://www.analyticsvidhya.com/blog/2013/06/art-structured-thinking-analyzing/
- Figure out the questions involved in the analytics project and decide which ones can be tackled
separately, and which ones are intertwined with others, and which ones need to be answered first
before tackling others. Then pick one.
0.5 Layout the data requirements and hypothesis before looking at what data is available - Actually Look at the data summary(dataframe.describe()) that includes mean, mode, std, and quartiles)
- Look for patterns in the summary. Think about what each of the values mean to your question? What
do questions do they lead to? How do they modify your question? - Figure out the ML problem use this.
- Go back to step 1 and 2 again and redo them with the ML problem .
- See if you have enough data (noise vs signal) or you need more samples or do you need more
features. (see http://scikit-learn.org/stable/modules/feature_selection.html)
First Model building time-split:
1.Descriptive analysis on the Data – 50% time
2.Data treatment (Missing value and outlier fixing) – 40% time
3.Data Modelling – 4% time
4.Estimation of performance – 6% time
Data Exploration steps:
Source Reference: https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
Below are the steps involved to understand, clean and prepare your data for building your predictive model:
1.Variable Identification
2.Univariate Analysis
3.Bi-variate Analysis
4.Missing values treatment
5.Outlier treatment
6.Variable transformation
7.Variable creation
Missing Value Treatment:
1.Deletion:
2.Mean/ Mode/ Median Imputation
3.Prediction Model:
4.KNN Imputation:
Outlier Treatment:
1.Data Entry Errors:
2. Measurement Error:
3. Experimental Error:
4. Intentional Outlier:
5. Data Processing Error:
6. Sampling error:
7. Natural Outlier: