House Prices

By Chris M Wang

Usage

Since there are a mix sets of numerial (integer) and categorical (string) features, the following feature shaping techniques are identified:

Also, the features needs to assembled into vectors before feeding into the model by using:

All the above-mentioned feature shaping techniques are meet the minimal requirement for the linear regression to function properly.

As the label (SalePrice) is continuous, an regression algorithm is required.
The Linear Regression is selected for its speed and simplicity.

The evaluation achieves a R2 of 0.9063243754152835. It definitely can be improved.

One-Hot-Encoding : Refer to Part 1
String-Indexing : Refer to Part 1
Vector-Assembly : Refer to Part 1
Normalization: The numerical features have very different semantic meaning with high range of mean and variance. Normalizations help to equalize the influence of the features to the model for better performance of the model.
PCA: The original data has ~80 columns. By reducing the dimensionality of the data, PCA helps to improve the performance of the model.

Below is how the evaluation metric R2 improves after applying the feature engineering techniques.

Feature Engineering Techniques	R2 Score
1+2+3	0.9063243754152835
1+2+3+4	0.6399090091007191
1+2+3+4+5	0.22497359098890068

The categorical (string) features benefit the most from the (1) One-Hot-Encoding (2) String-Indexing .

The numerical (integer) features benefits the most from the (4) Normalization (5) PCA