Lecture 3: ML fundamentals

Firas Moosvi (Slides adapted from Varada Kolhatkar)

Announcements

Remember, my office hours are MWF from 1:00 - 1:30 PM in DMP 310 (after class)
Homework 2 (hw2) was released on Wednesday, it is due May 20, 10:00 pm
- You are welcome to broadly discuss it with your classmates but final answers and submissions must be your own.
- Group submissions are not allowed for this assignment.
Advice on keeping up with the material
- Practice!
- Make sure you run the lecture notes on your laptop and experiment with the code.
- Start early on homework assignments.
If you are still on the waitlist, it’s your responsibility to keep up with the material and submit assignments!
Last day to drop without a W standing is today: May 16, 2025

Dropping lowest homework (Update)

CPSC 330 has 9 homework assignments that are all an integral part of your learning
To account for illnesses, other commitments, and to preserve your mental health, there has long been a policy of dropping your lowest HW score.
After some analysis from the data from previous terms (Learning Analytics!), there is a slight modification to this policy:

With the exception of HW5, we will drop your lowest homework grade - all students must complete HW5.

This is to encourage all students to complete HW5! It’s important!

Recap

Importance of generalization in supervised machine learning
Data splitting as a way to approximate generalization error
Train, test, validation, deployment data
Overfitting, underfitting, the fundamental tradeoff, and the golden rule.
Cross-validation

Recap

A typical sequence of steps to train supervised machine learning models

training the model on the train split
tuning hyperparamters using the validation split
checking the generalization performance on the test split

iClicker 3.1

Clicker cloud join link: https://join.iclicker.com/YJHS

Select all of the following statements which are TRUE.

1. A decision tree model with no depth (the default max_depth in sklearn) is likely to perform very well on the deployment data.
1. Data splitting helps us assess how well our model would generalize.
1. Deployment data is only scored once.
1. Validation data could be used for hyperparameter optimization.
1. It’s recommended that data be shuffled before splitting it into train and test sets.

Additional Resource

Reference: MLU-Explain - Data Splitting

iClicker 3.2

Clicker cloud join link: https://join.iclicker.com/YJHS

Select all of the following statements which are TRUE.

\(k\)-fold cross-validation calls fit \(k\) times
We use cross-validation to get a more robust estimate of model performance.
If the mean train accuracy is much higher than the mean cross-validation accuracy it’s likely to be a case of overfitting.
The fundamental tradeoff of ML states that as training error goes down, validation error goes up.
A decision stump on a complicated classification problem is likely to underfit.

Additional Resource

Reference: MLU-Explain - Cross Validation

Break

Let’s take a break!

Group Work: Class Demo & Live Coding

For this demo, each student should click this link to create a new repo in their accounts, then clone that repo locally to follow along with the demo from today.