Definitions - ML Study Guide

ML Basics

Machine Learning

A field of AI that enables computers to learn from data without being explicitly programmed.

Spam filter that learns from examples of spam emails

Supervised Learning

Learning from labeled data where both input (X) and output (Y) are provided. The model learns the function Y = f(X).

Classification (spam/not spam), Regression (price prediction)

Unsupervised Learning

Learning from unlabeled data to find hidden patterns or structures.

Customer segmentation, anomaly detection

Feature

An input variable (column) used to make predictions. Also called predictor, attribute, or independent variable.

Age, income, location in customer data

Target

The output variable we want to predict. Also called label, ground truth, or dependent variable.

Whether customer will churn (Yes/No)

Training Example

A single row of data containing features and (in supervised learning) the target value.

One customer record with all attributes

Model

The function y = f(x) learned from training data that maps inputs to outputs.

Decision tree, neural network, SVM

Statistics

Mean

The average value. Sum all values and divide by count.

Mean of [2,4,6] = (2+4+6)/3 = 4

Median

The middle value when data is sorted. Less affected by outliers than mean.

Median of [1,2,100] = 2 (not 34.3 like mean)

Mode

The most frequently occurring value in a dataset.

Mode of [1,2,2,3,3,3] = 3

Variance

Average of squared differences from the mean. Measures spread of data.

High variance = data is spread out

Standard Deviation

Square root of variance. Same unit as the data, easier to interpret.

σ = 5 means typical values are within 5 units of mean

Correlation

Measures linear relationship between two variables. Range: -1 to +1.

ρ = 0.9 means strong positive relationship

Population

The entire group being studied.

All customers of a company

Sample

A subset of the population used for analysis.

1000 randomly selected customers

Data Preprocessing

EDA

Exploratory Data Analysis - examining data to summarize characteristics and find patterns.

Checking distributions, correlations, missing values

One-Hot Encoding

Converting categorical variables to binary columns (0 or 1).

Color: Red→[1,0,0], Blue→[0,1,0], Green→[0,0,1]

Label Encoding

Converting categories to numbers. May create false ordinal relationships.

Red=0, Blue=1, Green=2 (implies order)

StandardScaler

Transforms data to have mean=0 and std=1.

x_scaled = (x - mean) / std

MinMaxScaler

Transforms data to range [0, 1].

x_scaled = (x - min) / (max - min)

Data Leakage

When information from test set influences training, causing overly optimistic results.

Scaling entire dataset before train/test split

Imputation

Filling in missing values with estimated values.

Replace missing age with mean age

Model Evaluation

True Positive (TP)

Model correctly predicted positive class.

Predicted fraud, actually was fraud

True Negative (TN)

Model correctly predicted negative class.

Predicted not fraud, actually not fraud

False Positive (FP)

Model incorrectly predicted positive. Type I Error.

Predicted fraud, but was not fraud (false alarm)

False Negative (FN)

Model incorrectly predicted negative. Type II Error.

Predicted not fraud, but was fraud (missed)

Precision

Of all positive predictions, how many were correct? TP/(TP+FP)

When we predict fraud, how often are we right?

Recall

Of all actual positives, how many did we find? TP/(TP+FN)

Of all frauds, how many did we catch?

F1 Score

Harmonic mean of precision and recall. Balances both metrics.

F1 = 2 × (P × R) / (P + R)

Cross-Validation

Technique to evaluate model by training and testing on different data splits.

5-fold CV: train on 4 parts, test on 1, repeat 5 times

Algorithms

Linear Regression

Predicts continuous output as weighted sum of inputs: y = a + bx

Predicting house price from square footage

Logistic Regression

Predicts probability of binary outcome using sigmoid function.

Probability of customer churn

Decision Tree

Makes decisions by splitting data based on feature values. Easy to interpret.

If age > 30 AND income > 50k, then approve loan

Random Forest

Ensemble of decision trees trained on random subsets. Bagging method.

100 trees vote, majority wins

SVM

Support Vector Machine finds hyperplane that best separates classes.

Points closest to boundary are support vectors

Naive Bayes

Probabilistic classifier assuming features are independent.

Fast, good for text classification

KNN

K-Nearest Neighbors classifies based on closest training examples.

K=5: look at 5 nearest points, majority class wins

Optimization

Hyperparameter

Settings configured before training (not learned from data).

Learning rate, max_depth, number of trees

Parameter

Values learned during training.

Weights and biases in neural network

Grid Search

Tests all combinations of hyperparameter values.

Try max_depth=[3,5,7] × n_estimators=[50,100,200]

Overfitting

Model learns training data too well, fails on new data. High variance.

Train accuracy 99%, test accuracy 60%

Underfitting

Model too simple to capture patterns. High bias.

Both train and test accuracy are low

SMOTE

Synthetic Minority Over-sampling Technique - creates synthetic samples of minority class.