Key Definitions
All important terms you need to know
ML Basics
Machine Learning
A field of AI that enables computers to learn from data without being explicitly programmed.
Spam filter that learns from examples of spam emails
Supervised Learning
Learning from labeled data where both input (X) and output (Y) are provided. The model learns the function Y = f(X).
Classification (spam/not spam), Regression (price prediction)
Unsupervised Learning
Learning from unlabeled data to find hidden patterns or structures.
Customer segmentation, anomaly detection
Feature
An input variable (column) used to make predictions. Also called predictor, attribute, or independent variable.
Age, income, location in customer data
Target
The output variable we want to predict. Also called label, ground truth, or dependent variable.
Whether customer will churn (Yes/No)
Training Example
A single row of data containing features and (in supervised learning) the target value.
One customer record with all attributes
Model
The function y = f(x) learned from training data that maps inputs to outputs.
Decision tree, neural network, SVM
Statistics
Mean
The average value. Sum all values and divide by count.
Mean of [2,4,6] = (2+4+6)/3 = 4
Median
The middle value when data is sorted. Less affected by outliers than mean.
Median of [1,2,100] = 2 (not 34.3 like mean)
Mode
The most frequently occurring value in a dataset.
Mode of [1,2,2,3,3,3] = 3
Variance
Average of squared differences from the mean. Measures spread of data.
High variance = data is spread out
Standard Deviation
Square root of variance. Same unit as the data, easier to interpret.
σ = 5 means typical values are within 5 units of mean
Correlation
Measures linear relationship between two variables. Range: -1 to +1.
ρ = 0.9 means strong positive relationship
Population
The entire group being studied.
All customers of a company
Sample
A subset of the population used for analysis.
1000 randomly selected customers
Data Preprocessing
EDA
Exploratory Data Analysis - examining data to summarize characteristics and find patterns.
Checking distributions, correlations, missing values
One-Hot Encoding
Converting categorical variables to binary columns (0 or 1).
Color: Red→[1,0,0], Blue→[0,1,0], Green→[0,0,1]
Label Encoding
Converting categories to numbers. May create false ordinal relationships.
Red=0, Blue=1, Green=2 (implies order)
StandardScaler
Transforms data to have mean=0 and std=1.
x_scaled = (x - mean) / std
MinMaxScaler
Transforms data to range [0, 1].
x_scaled = (x - min) / (max - min)
Data Leakage
When information from test set influences training, causing overly optimistic results.
Scaling entire dataset before train/test split
Imputation
Filling in missing values with estimated values.
Replace missing age with mean age
Model Evaluation
True Positive (TP)
Model correctly predicted positive class.
Predicted fraud, actually was fraud
True Negative (TN)
Model correctly predicted negative class.
Predicted not fraud, actually not fraud
False Positive (FP)
Model incorrectly predicted positive. Type I Error.
Predicted fraud, but was not fraud (false alarm)
False Negative (FN)
Model incorrectly predicted negative. Type II Error.
Predicted not fraud, but was fraud (missed)
Precision
Of all positive predictions, how many were correct? TP/(TP+FP)
When we predict fraud, how often are we right?
Recall
Of all actual positives, how many did we find? TP/(TP+FN)
Of all frauds, how many did we catch?
F1 Score
Harmonic mean of precision and recall. Balances both metrics.
F1 = 2 × (P × R) / (P + R)
Cross-Validation
Technique to evaluate model by training and testing on different data splits.
5-fold CV: train on 4 parts, test on 1, repeat 5 times
Algorithms
Linear Regression
Predicts continuous output as weighted sum of inputs: y = a + bx
Predicting house price from square footage
Logistic Regression
Predicts probability of binary outcome using sigmoid function.
Probability of customer churn
Decision Tree
Makes decisions by splitting data based on feature values. Easy to interpret.
If age > 30 AND income > 50k, then approve loan
Random Forest
Ensemble of decision trees trained on random subsets. Bagging method.
100 trees vote, majority wins
SVM
Support Vector Machine finds hyperplane that best separates classes.
Points closest to boundary are support vectors
Naive Bayes
Probabilistic classifier assuming features are independent.
Fast, good for text classification
KNN
K-Nearest Neighbors classifies based on closest training examples.
K=5: look at 5 nearest points, majority class wins
Optimization
Hyperparameter
Settings configured before training (not learned from data).
Learning rate, max_depth, number of trees
Parameter
Values learned during training.
Weights and biases in neural network
Grid Search
Tests all combinations of hyperparameter values.
Try max_depth=[3,5,7] × n_estimators=[50,100,200]
Overfitting
Model learns training data too well, fails on new data. High variance.
Train accuracy 99%, test accuracy 60%
Underfitting
Model too simple to capture patterns. High bias.
Both train and test accuracy are low
SMOTE
Synthetic Minority Over-sampling Technique - creates synthetic samples of minority class.
Balance 100 fraud vs 10000 normal cases