Feature engineering using Spark
By Chris M Wang
View the Zeppelin notebook here: https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL213ODY2L2hvdXNlLXByaWNlcy9tYXN0ZXIvbm90ZS5qc29u
Since there are a mix sets of numerial (integer) and categorical (string) features, the following feature shaping techniques are identified:
Also, the features needs to assembled into vectors before feeding into the model by using:
All the above-mentioned feature shaping techniques are meet the minimal requirement for the linear regression to function properly.
As the label (SalePrice
) is continuous, an regression algorithm is required.
The Linear Regression is selected for its speed and simplicity.
The evaluation achieves a R2 of 0.9063243754152835. It definitely can be improved.
Below is how the evaluation metric R2 improves after applying the feature engineering techniques.
Feature Engineering Techniques | R2 Score |
---|---|
1+2+3 | 0.9063243754152835 |
1+2+3+4 | 0.6399090091007191 |
1+2+3+4+5 | 0.22497359098890068 |
The categorical (string) features benefit the most from the (1) One-Hot-Encoding (2) String-Indexing .
The numerical (integer) features benefits the most from the (4) Normalization (5) PCA