Lecture 3: ML fundamentals
Firas Moosvi (Slides adapted from Varada Kolhatkar)
Announcements
Remember, my office hours are MWF from 1:00 - 1:30 PM in DMP 310 (after class)
Homework 2 (hw2) was released on Wednesday, it is due May 20, 10:00 pm
- You are welcome to broadly discuss it with your classmates but final answers and submissions must be your own.
- Group submissions are not allowed for this assignment.
Advice on keeping up with the material
- Practice!
- Make sure you run the lecture notes on your laptop and experiment with the code.
- Start early on homework assignments.
If you are still on the waitlist, it’s your responsibility to keep up with the material and submit assignments!
Last day to drop without a W standing is today: May 16, 2025
Dropping lowest homework (Update)
- CPSC 330 has 9 homework assignments that are all an integral part of your learning
- To account for illnesses, other commitments, and to preserve your mental health, there has long been a policy of dropping your lowest HW score.
- After some analysis from the data from previous terms (Learning Analytics!), there is a slight modification to this policy:
With the exception of HW5, we will drop your lowest homework grade - all students must complete HW5.
- This is to encourage all students to complete HW5! It’s important!
Recap
- Importance of generalization in supervised machine learning
- Data splitting as a way to approximate generalization error
- Train, test, validation, deployment data
- Overfitting, underfitting, the fundamental tradeoff, and the golden rule.
- Cross-validation
Recap
A typical sequence of steps to train supervised machine learning models
- training the model on the train split
- tuning hyperparamters using the validation split
- checking the generalization performance on the test split
iClicker 3.1
Clicker cloud join link: https://join.iclicker.com/YJHS
Select all of the following statements which are TRUE.
- A decision tree model with no depth (the default
max_depth
in sklearn) is likely to perform very well on the deployment data.
- Data splitting helps us assess how well our model would generalize.
- Deployment data is only scored once.
- Validation data could be used for hyperparameter optimization.
- It’s recommended that data be shuffled before splitting it into train and test sets.
iClicker 3.2
Clicker cloud join link: https://join.iclicker.com/YJHS
Select all of the following statements which are TRUE.
- \(k\)-fold cross-validation calls fit \(k\) times
- We use cross-validation to get a more robust estimate of model performance.
- If the mean train accuracy is much higher than the mean cross-validation accuracy it’s likely to be a case of overfitting.
- The fundamental tradeoff of ML states that as training error goes down, validation error goes up.
- A decision stump on a complicated classification problem is likely to underfit.
Break
Let’s take a break!
Group Work: Class Demo & Live Coding
For this demo, each student should click this link to create a new repo in their accounts, then clone that repo locally to follow along with the demo from today.