Firas Moosvi (Slides adapted from Varada Kolhatkar)
Announcements
Things due this week
Homework 1 (hw1): Due Jan 12 , 11:59 pm
There are some assert statements in the notebook you can use to check your work! If they fail, you didn’t get the exercise correct <!– - Homework 2 (hw2) has been released (Due: Sept 16, 11:59pm)
There is some autograding in this homework. –>
You can find the tentative due dates for all deliverables here.
Please monitor Ed Discussion (especially pinned posts and instructor posts) for announcements.
I’ll assume that you’ve watched the pre-lecture videos.
Participation marks (5%) in Section 202
In this section (202) we will not be using iClickers for “attendance”
Instead, you will have weekly “Learning Logs” where you will reflect on the material covered for the week
The first Learning Log will be released tonight, and will be due Sunday Jan 11, 11:59 PM
Only students registered in section 202 will see these on PrairieLearn
Gradescope
Make sure you can submit your assignment before the hw1 due date!
It is required for you to work in a GitHub repository, please maintain your GitHub repo up-to-date.
Some students are having trouble getting registered on Gradescope, we’re working on a fix - come see me in the break or after class!
Checklist for you in the first week
Suggested Workflow for working with Jupyter Notebooks
Create a folder on your computer that will have all the CPSC 330 repos:
~/School/Year3/CPSC330/ <– Consider this your CPSC parent folder
Create subfolders for: hw, class, practice
In the hw folder, you will then clone hw1, hw2, hw3, etc…
In the class folder, you will clone the cpsc330-2025W2 repo which contains all the class jupyter notebooks
Do not make any changes to files in this directory/repo, you will have trouble when you pull stuff during each class.
If you did make changes, you can reset to the last commit and DESTROY any changes you made (be careful with this command) using: git reset --hard
In the practice folder, you can copy any notebooks (.ipynb) and files (like data/*.csv) you want to try running locally and experiment
🎯 Learning Outcomes
By the end of this lesson, you will be able to:
Define key machine learning terminology: features, targets, predictions, training, error, classification vs. regression, supervised vs. unsupervised learning, hyperparameters vs. parameters, baselines, decision boundaries
Build a simple machine learning model in scikit-learn, explaining the fit–predict workflow and evaluating performance with the score method
Describe at a high level how decision trees are trained (fitting) and how they make predictions
Implement and visualize decision trees in scikit-learn using DecisionTreeClassifier and DecisionTreeRegressor
Recap: What is ML?
ML uses data to build models that find patterns, make predictions, or generate content.
It helps computers learn from data to make decisions.
No one model works for every situation.
Class Participation using Agora
Visit: https://agora.students.cs.ubc.ca
Login with your UBC CWL
Use enrol code: agentic
Clicker 2.1: ML or not
Select all of the following statements which are suitable problems for machine learning.
Identifying objects within digital images, such as facial recognition in security systems or categorizing images based on content.
Determining if individuals meet the necessary criteria for government or financial services based on strict guidelines.
Identifying unusual patterns that may indicate fraudulent transactions in banking and finance.
Automatically analyzing images from MRIs, CT scans, or X-rays to detect abnormalities like tumors or fractures.
Addressing mental health issues where human empathy, understanding, and adaptability are key.
Imagine you’re in the fortunate situation where, after graduating, you have a few job offers and need to decide which one to choose. You want to pick the job that will likely make you the happiest. To help with your decision, you collect data from like-minded people.
Can you think of relevant features for this problem?
Is this a classification problem or a regression problem?
supportive_colleagues
salary
free_coffee
boss_vegan
happy?
0
0
70000
0
1
Unhappy
1
1
60000
0
0
Unhappy
2
1
80000
1
0
Happy
3
1
110000
0
1
Happy
4
1
120000
1
0
Happy
5
1
150000
1
1
Happy
6
0
150000
1
0
Unhappy
(Optional) Inference vs. Prediction
Inference asks: Why does something happen?
Goal: understand and quantify the relationship between variables
Often involves estimating model parameters and testing hypotheses
Example: Which factors influence happiness, and by how much?
Prediction asks: What will happen?
Goal: accurately predict the target without needing to fully explain the relationships
Example: Will you be happy in a particular job?
Of course these goals are related, and in many situations we need both.
Training
In supervised ML, the goal is to learn a function that maps input features (\(X\)) to a target (\(y\)).
The relationship between \(X\) and \(y\) is often complex, making it difficult to define mathematically.
We use algorithms to approximate this complex relationship between \(X\) and \(y\).
Training is the process of applying an algorithm to learn the best function (or model) that maps \(X\) to \(y\).
In this course, I’ll help you develop an intuition for how these models work and demonstrate how to use them in a machine learning pipeline.
Error and accuracy
Machine learning models are not perfect—they will make mistakes.
To judge whether a model is useful, we need to track its performance.
For classification problems, the most common (and default in sklearn) metric is accuracy:
\[
\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of examples}}
\]
Separating \(X\) and \(y\)
In order to train a model we need to separate \(X\) and \(y\) from the dataframe.
X = toy_happiness_df.drop(columns=["happy?"]) # Extract the feature set by removing the target column "happy?"y = toy_happiness_df["happy?"] # Extract the target variable "happy?"
Baseline
Let’s try a simplest algorithm of predicting the most popular target!
from sklearn.dummy import DummyClassifiermodel = DummyClassifier(strategy="most_frequent") # Initialize the DummyClassifier to always predict the most frequent classmodel.fit(X, y) # Train the model on the feature set X and target variable ytoy_happiness_df['dummy_predictions'] = model.predict(X) # Add the predicted values as a new column in the dataframetoy_happiness_df
supportive_colleagues
salary
free_coffee
boss_vegan
happy?
dummy_predictions
0
0
70000
0
1
Unhappy
Happy
1
1
60000
0
0
Unhappy
Happy
2
1
80000
1
0
Happy
Happy
3
1
110000
0
1
Happy
Happy
4
1
120000
1
0
Happy
Happy
5
1
150000
1
1
Happy
Happy
6
0
150000
1
0
Unhappy
Happy
Decision trees intuition
One intuitive way to build a model is by asking a series of yes/no questions, forming a tree.
Which question would help you best separate the happy and unhappy examples?
The questions (features and thresholds) used to split the data at each node.
Example: salary <= 75000 at the root node
Hyperparameters
Settings that control tree growth, like max_depth, which limits how deep the tree can go.
Decision boundary
A decision boundary is the line, curve, or surface that separates classes.
Points on one side \(\rightarrow\) Model predicts Class Happy
Points on the other side \(\rightarrow\) Model predicts Class Unhappy
Decision boundary with max_depth=1
Decision boundary with max_depth=2
Clicker 2.4: Baselines and Decision trees
iParticipate using Agora (code: agentic) https://join.iclicker.com/FZMQ
Select all of the following statements which are TRUE.
Change in features (i.e., binarizing features above) would change DummyClassifier predictions.
predict takes only X as argument whereas fit and score take both X and y as arguments.
For the decision tree algorithm to work, the feature values must be binary.
The prediction in a decision tree works by routing the example from the root to the leaf.
Summary
Terminology
sklearn basic steps
Decision tree intuition
Break
Let’s take a break!
Group Work: Class Demo & Live Coding
In some of the classes, we will do a bit of live coding to get your used to practical machine learning. You are highly encouraged to follow along - we won’t usually finish everything in the demo, but it should be a significant portion that you can finish off after class.
For this demo, each student should click this link to create a new repo in their accounts, then clone that repo locally to follow along with the demo from today.