Skip to content

Latest commit

 

History

History
139 lines (134 loc) · 6.35 KB

File metadata and controls

139 lines (134 loc) · 6.35 KB

Data Processing and Feature Extration Approchs

Trial 1:

  • Droped 'Id'.
  • One hot encoded all none neumerical features.
  • Replace all none neumerical features Nan with 'None' and one hot encoded them.
  • Filled all neumerical data Nan with means of the column.
  • Schewed Year data to be base on the minium of that column.

Problems:

  • Data contains outliers.
  • Some numerical features are catergical.
  • Fill numerical data with means is not a good approch because:
    • Numerical that contains Nan usualy becasue the house does not have this feature.
    • Outliers' effect the means greatly.
  • Target collumn 'SalePrice' is not in a normal disturbation.
  • Data that are highly correlated have repeted impact on the model.

Trial 2:

  • One hot encoded all catergical features.
  • Normoralized SalePrice distrubition to normal curve by taking.
train['LogSalePrice'] = np.log(train['SalePrice'])

Use:

train['SalePrice'] = np.exp(train['LogSalePrice'])

to return to orignal distuibition.

  • Reomved one feature from each set of features that have a corlation above 0.8, base on the disturbition graph.
    The feature that have the highest corlation with 'SalePrice' out of the two is removed.
  • Fill all numerical feature Nan with 0.

Trial 3:

All creddit of this methods gose to @Golden and her notebook

  • Filled all numerical Nan with 0.
  • Filled all categorical Nan with 'None'.
  • Removed outliers recomended by author:
train = train[train['GrLivArea']<4000]
  • Normoralized SalePrice.
  • One hot encoded all catergical features.

Model Approchs


Linear Regression:

  • Used hyperparameter tuning to tune a sklearn linear regression model.
  • Used polynomial features to expand feature space.
  • Use Root Mean Square Error (RMSE) as lost function since it is what the data is evluated by.

Result:

  • First degree poly feature showed the best result.
  • Optominal Alpha is less then 1000.
  • Scores:
    • Datas from 1: 0.24922.
    • Datas from 3: 0.31011.

Neural Network:

  • Implemented RMSE for both the default 'SalePrice' and 'LogSalePrice':
def root_mean_squared_error(y_true, y_pred):
       return K.sqrt(K.mean(K.square(y_pred - y_true)))
def exp_root_mean_squared_error(y_true, y_pred):
   return K.sqrt(K.mean(K.square(K.exp(y_pred)-K.exp(y_true))))
  • Established bsae line model:
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 512)               207872    
_________________________________________________________________
re_lu_1 (ReLU)               (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 512)               262656    
_________________________________________________________________
re_lu_2 (ReLU)               (None, 512)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 512)               262656    
_________________________________________________________________
re_lu_3 (ReLU)               (None, 512)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 513       
=================================================================
Total params: 733,697
Trainable params: 733,697
Non-trainable params: 0
_________________________________________________________________
  • 3 layers of 512 Dense ReLU neurons and one output neuron.
  • Train untile 'val_loss' stop increasing for 50 epochs.
  • Default 'adam' optimizer.
  • RMSE root_mean_squared_error as loss.

Attempts:

  • Structurs:
    • Increase / Decrease neuron number of each layer.
    • Increase / Decrease depths of the network.
  • Activitation functions:
    • Sigmoid.
    • Default LeakyReLU alpha = 0.1.
    • LeakyReLU alpha = 0.5.
  • Optimizers:
    • Adam with increas / decrease learning rates.
    • Default SGD.

Result:

  • Base Line Score: 0.21801.
  • Structurs:
    • Increasing model size and num of neurons resulted in the exact same score.
    • Decreasing it result in sigenfigent score.
  • Activitation functions:
    • Sigmoid did not converge under 10000 epochs.
    • Default LeakyReLU resulted in slightly better score: 0.21259.
    • LeakyReLU with alpha = 0.5 performed less then default, scored: 0.21337.
  • Optimizer:
    • Most optomal Adam learning_rate = 0.0001,scored of 0.21106.
    • SGD did not converge under 10000 epochs.
  • Combined Model:
    • Parameters: Defalut LeakyReLU, Adam learning_rate = 0.0001, 3 layers of 521 neurons.
    • Score: 0.21406, some how a combenation of these has increased the score.

Lasso Regression

This approch is build upon @Golden's notebook.

  • Used hyperparameter to turn a Lasso Regression model.
  • Golden's Parameter.

Result:

  • Golden's Score: 0.11888.
  • Hyper tuned best parameters.
Lasso(alpha = 0.0005, fit_intercept = True, normalize = False)
  • Score: 0.11744.

Conclusion


  • Neural Network is not an all-powerful solution.
  • Better data cleaning and feature engineering with a simple model could result in a much better model then neural networks can be.
  • The complexity of this data is manageable by humans, thus careful data cleaning and feature engineering should be done.
  • Traditional approach should be considered first before deep learning in these types of data.