What is the primary difference between Classification and Regression?

Classification predicts discrete categorical labels (e.g., 'Spam' or 'Not Spam'), whereas Regression predicts continuous numerical values (e.g., price or temperature). The nature of the target variable determines which approach is appropriate.

How does Binary Classification differ from Multiclass Classification?

Binary classification involves choosing between exactly two distinct classes, often using a Sigmoid function. Multiclass classification involves three or more categories and typically utilizes a Softmax function to produce a probability distribution across all possible classes.

When should you use the F1-Score instead of Accuracy?

The F1-Score should be used when you have an imbalanced dataset where one class significantly outnumbers the other. Accuracy can be misleadingly high in these cases, while the F1-Score provides a balanced measure of both precision and recall.

What is the risk of using a very small 'k' (e.g., k=1) in the k-Nearest Neighbors algorithm?

A very small 'k' makes the model highly sensitive to noise and outliers in the training data, leading to overfitting. The decision boundary becomes overly complex and jagged, failing to generalize well to new data.

Why is it a mistake to ignore feature scaling in Support Vector Machines (SVM)?

SVM calculates the margin based on distances between points; if one feature has a much larger range than others, it will dominate the distance metric. This leads to a biased model that ignores the information provided by smaller-scale features.

What happens if you evaluate a classification model on the same data used for training?

This results in an overoptimistic assessment of performance because the model may have simply 'memorized' the training instances (overfitting). You must use a separate test set to measure how well the model generalizes to unseen data.

Define the 'Decision Boundary' in the context of classification.

The decision boundary is a conceptual or mathematical line, plane, or surface that separates different classes in the feature space. The model assigns a class label to an input based on which side of this boundary the input's features fall.

What is a Confusion Matrix?

A Confusion Matrix is an $N \times N$ table used to evaluate the performance of a classifier, where $N$ is the number of classes. It compares the actual target values with those predicted by the model, showing counts of true positives, true negatives, false positives, and false negatives.

What is the Sigmoid Function formula and its purpose?

The formula is $\sigma(z) = \frac{1}{1 + e^{-z}}$. Its purpose is to squash any real-valued input into a range between 0 and 1, allowing the model to interpret the output as a probability for binary classification.

State the formula for Precision and explain its meaning.

Precision is calculated as $\frac{TP}{TP + FP}$. it represents the proportion of positive identifications that were actually correct, focusing on the cost of 'False Positives'.

Library Podcasts

Courses

Referral & Rewards

Combined Science A Gateway / Biology

Genes, Inheritance & Selection

Classification

Summary

Classification is a fundamental supervised learning task in which an algorithm learns to assign input data to specific, discrete categories or labels. By analyzing patterns in labeled training data, classification models establish decision boundaries that allow them to predict the class of new, unseen observations with high precision.

1. Definition & Core Concepts

Supervised Learning Framework: Classification is a type of supervised learning where the model is trained on a dataset containing input-output pairs, where the output is a categorical label.
Feature Space: Each input is represented as a vector of features in a multi-dimensional space, and the goal is to partition this space into regions corresponding to different classes.
Discrete vs. Continuous: Unlike regression, which predicts continuous numerical values, classification focuses on discrete outcomes, such as 'spam' vs. 'not spam' or 'Type A' vs. 'Type B'.
Decision Boundary: This is a hypersurface that partitions the underlying vector space into sets, one for each class; the classifier assigns a class to an input based on which side of the boundary it falls.

A 2D feature space plot showing two distinct classes separated by a linear decision boundary.

2. Underlying Principles

3. Methods & Techniques

Logistic Regression: Despite its name, it is a linear classification model that uses a logistic function to model a binary dependent variable.
k-Nearest Neighbors (k-NN): A non-parametric method where an object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its $k$ nearest neighbors.
Support Vector Machines (SVM): This technique finds the hyperplane that maximizes the margin between two classes, ensuring the boundary is as far as possible from the nearest data points of any class.
Decision Trees: These models use a tree-like structure of decisions where each internal node represents a 'test' on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label.

4. Key Distinctions

5. Evaluation Metrics

6. Exam Strategy & Tips