Lecture 5: Preprocessing and sklearn pipelines

Firas Moosvi (Slides adapted from Varada Kolhatkar)

Announcements

Recap

Decision trees: Split data into subsets based on feature values to create decision rules
\(k\)-NNs: Classify based on the majority vote from k nearest neighbors
SVM RBFs: Create a boundary using an RBF kernel to separate classes

Synthesizing existing knowledge

Recap

Aspect	Decision Trees	K-Nearest Neighbors (KNN)	Support Vector Machines (SVM) with RBF Kernel
Main hyperparameters	Max depth, min samples split	Number of neighbors (\(k\))	C (regularization), Gamma (RBF kernel width)
Interpretability
Handling of Non-linearity
Scalability

Recap

Aspect	Decision Trees	K-Nearest Neighbors (KNN)	Support Vector Machines (SVM) with RBF Kernel
Sensitivity to Outliers
Memory Usage
Training Time
Prediction Time
Multiclass support

(iClicker) Exercise 5.1

iClicker cloud join link: https://join.iclicker.com/YJHS

Take a guess: In your machine learning project, how much time will you typically spend on data preparation and transformation?

1. ~80% of the project time
1. ~20% of the project time
1. ~50% of the project time
1. None. Most of the time will be spent on model building

The question is adapted from here.

(iClicker) Exercise 5.2

iClicker cloud join link: https://join.iclicker.com/YJHS

Select all of the following statements which are TRUE.

1. StandardScaler ensures a fixed range (i.e., minimum and maximum values) for the features.
1. StandardScaler calculates mean and standard deviation for each feature separately.
1. In general, it’s a good idea to apply scaling on numeric features before training \(k\)-NN or SVM RBF models.
1. The transformed feature values might be hard to interpret for humans.
1. After applying SimpleImputer The transformed data has a different shape than the original data.

(iClicker) Exercise 5.3

iClicker cloud join link: https://join.iclicker.com/YJHS

Select all of the following statements which are TRUE.

1. You can have scaling of numeric features, one-hot encoding of categorical features, and scikit-learn estimator within a single pipeline.
1. Once you have a scikit-learn pipeline object with an estimator as the last step, you can call fit, predict, and score on it.
1. You can carry out data splitting within scikit-learn pipeline.
1. We have to be careful of the order we put each transformation and model in a pipeline.
1. If you call cross_validate with a pipeline object, it will call fit and transform on the training fold and only transform on the validation fold.

Break

Let’s take a break!

Preprocessing motivation: example

You’re trying to find a suitable date based on:

Age (closer to yours is better).
Number of Facebook Friends (closer to your social circle is ideal).

Preprocessing motivation: example

You are 30 years old and have 250 Facebook friends.

Person	Age	#FB Friends	Euclidean Distance Calculation	Distance
A	25	400	√(5² + 150²)	150.08
B	27	300	√(3² + 50²)	50.09
C	30	500	√(0² + 250²)	250.00
D	60	250	√(30² + 0²)	30.00

Based on the distances, the two nearest neighbors (2-NN) are:

Person D (Distance: 30.00)
Person B (Distance: 50.09)

What’s the problem here?

Common transformations

Imputation: Fill the gaps! (🟩 🟧 🟦)

Fill in missing data using a chosen strategy:

Mean: Replace missing values with the average of the available data.
Median: Use the middle value.
Most Frequent: Use the most common value (mode).
KNN Imputation: Fill based on similar neighbors.

Example:

Fill in missing values like filling empty seats in a classroom with the average student.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

Scaling: Everything to the same range! (📉 📈)

Ensure all features have a comparable range.

StandardScaler: Mean = 0, Standard Deviation = 1.
MinMaxScaler: Scales features to a [0, 1] range.
RobustScaler: Scales features using median and quantiles.

Example:

Rescaling everyone’s height to make basketball players and gymnasts comparable.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

One-Hot encoding: 🍎 → 1️⃣ 0️⃣ 0️⃣

Convert categorical features into binary columns.

Creates new binary columns for each category.
Useful for handling categorical data in machine learning models.

Example:

Turn “Apple, Banana, Orange” into binary columns:

Fruit	🍎	🍌	🍊
Apple 🍎	1	0	0
Banana 🍌	0	1	0
Orange 🍊	0	0	1

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
X_encoded = encoder.fit_transform(X)

Ordinal encoding: Ranking matters! (⭐️⭐️⭐️⭐️⭐️ → 1️⃣)

Convert categories into integer values that have a meaningful order.

Assign integers based on order or rank.
Useful when there is an inherent ranking in the data.

Example:

Turn “Poor, Average, Good” into 1, 2, 3:

Rating	Ordinal
Poor	1
Average	2
Good	3

from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()
X_ordinal = encoder.fit_transform(X)

`sklearn` Transformers vs Estimators

Transformers

Are used to transform or preprocess data.
Implement the fit and transform methods.
- fit(X): Learns parameters from the data.
- transform(X): Applies the learned transformation to the data.
Examples:
- Imputation (SimpleImputer): Fills missing values.
- Scaling (StandardScaler): Standardizes features.

Estimators

Used to make predictions.
Implement fit and predict methods.
- fit(X, y): Learns from labeled data.
- predict(X): Makes predictions on new data.
Examples: DecisionTreeClassifier, SVC, KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier
tree_clf = DecisionTreeClassifier()

The golden rule in feature transformations

Never transform the entire dataset at once!
Why? It leads to data leakage — using information from the test set in your training process, which can artificially inflate model performance.
Fit transformers like scalers and imputers on the training set only.
Apply the transformations to both the training and test sets separately.

Example:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

`sklearn` Pipelines

Pipeline is a way to chain multiple steps (e.g., preprocessing + model fitting) into a single workflow.
Simplify the code and improves readability.
Reduce the risk of data leakage by ensuring proper transformation of the training and test sets.
Automatically apply transformations in sequence.

Example:

Chaining a StandardScaler with a KNeighborsClassifier model.

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

pipeline = make_pipeline(StandardScaler(), KNeighborsClassifier())

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

Lecture 5: Preprocessing and sklearn pipelines

Announcements

Recap

Synthesizing existing knowledge

Recap

Recap

(iClicker) Exercise 5.1

(iClicker) Exercise 5.2

(iClicker) Exercise 5.3

Break

Preprocessing motivation: example

Preprocessing motivation: example

Common transformations

Imputation: Fill the gaps! (🟩 🟧 🟦)

Example:

Scaling: Everything to the same range! (📉 📈)

Example:

One-Hot encoding: 🍎 → 1️⃣ 0️⃣ 0️⃣

Example:

Ordinal encoding: Ranking matters! (⭐️⭐️⭐️⭐️⭐️ → 1️⃣)

Example:

sklearn Transformers vs Estimators

Transformers

Estimators

The golden rule in feature transformations

Example:

sklearn Pipelines

Example:

See you next class!

`sklearn` Transformers vs Estimators

`sklearn` Pipelines