# Preprocessing

# One Hot Encoding

If you have a column that has four values, you could assign a value to each value.

The problem is that a model might trat value__4 is better than value_1.

To fix this, Scikit-Learn offeres one hot encoding. Each value will be one column.

Source Wikipedia

Source: Wikipeida One Hot (opens new window)

ex:

from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)


cat_encoder.categories_

# Feature Scaling

# Min Max Scaling

Values are shifted and rescaled so that they can end up ranging from 0 to 1.

Source (Wikipedia)

# Standardization (Z-Score Normalization)

Source (Wikipedia)

Subtracts the mean form each observation and then divide by standard deviation.

Values are not bound within a certain range. Less Affected by outliers.

# Validation

# K-Fold Cross Validation

Source (Scikie-Learn)

Source: Scikit-Learn: cross_validation (opens new window)

from sklearn import metrics
scores = cross_val_score(
    clf, X, y, cv=5, scoring='f1_macro')
scores