Lecture 3: ML fundamentals
Firas Moosvi (Slides adapted from Varada Kolhatkar)
Announcements
- Homework 2 (hw2) has been released (Due: Sept 16, 11:59pm)
- You are welcome to broadly discuss it with your classmates but final answers and submissions must be your own.
- Group submissions are not allowed for this assignment.
- Advice on keeping up with the material
- Practice!
- Make sure you run the lecture notes on your laptop and experiment with the code.
- Start early on homework assignments.
- If you are still on the waitlist, it’s your responsibility to keep up with the material and submit assignments.
- Last day to drop without a W standing: Sept 16, 2023
Recap
- Importance of generalization in supervised machine learning
- Data splitting as a way to approximate generalization error
- Train, test, validation, deployment data
- Overfitting, underfitting, the fundamental tradeoff, and the golden rule.
- Cross-validation
Recap
A typical sequence of steps to train supervised machine learning models
- training the model on the train split
- tuning hyperparamters using the validation split
- checking the generalization performance on the test split
iClicker 3.1
Clicker cloud join link: https://join.iclicker.com/VYFJ
Select all of the following statements which are TRUE.
- A decision tree model with no depth (the default
max_depth
in sklearn) is likely to perform very well on the deployment data.
- Data splitting helps us assess how well our model would generalize.
- Deployment data is only scored once.
- Validation data could be used for hyperparameter optimization.
- It’s recommended that data be shuffled before splitting it into train and test sets.
iClicker 3.2
Clicker cloud join link: https://join.iclicker.com/VYFJ
Select all of the following statements which are TRUE.
- \(k\)-fold cross-validation calls fit \(k\) times
- We use cross-validation to get a more robust estimate of model performance.
- If the mean train accuracy is much higher than the mean cross-validation accuracy it’s likely to be a case of overfitting.
- The fundamental tradeoff of ML states that as training error goes down, validation error goes up.
- A decision stump on a complicated classification problem is likely to underfit.
Group Work: Class Demo & Live Coding
For this demo, each student should click this link to create a new repo in their accounts, then clone that repo locally to follow along with the demo from today.
If you really don’t want to create a repo,
- Navigate to the
cpsc330-2024W1
repo
- run
git pull
to pull the latest files in the course repo
- Look for the demo file here:
lectures/102-Firas-lectures/class_demos/
.