Quick Reference
All key information on one page
Essential Formulas
Mean (Average)
\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \]
Variance
\[ \sigma^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n} \]
Standard Deviation
\[ \sigma = \sqrt{\sigma^2} \]
Accuracy
\[ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \]
Precision
\[ Precision = \frac{TP}{TP + FP} \]
Recall (Sensitivity)
\[ Recall = \frac{TP}{TP + FN} \]
F1 Score
\[ F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} \]
Specificity
\[ Specificity = \frac{TN}{TN + FP} \]
StandardScaler
\[ x_{scaled} = \frac{x - \mu}{\sigma} \]
MinMaxScaler
\[ x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}} \]
Pearson Correlation
\[ \rho = \frac{cov(X,Y)}{\sigma_X \cdot \sigma_Y} \]
Confusion Matrix
| Predicted + | Predicted - | |
|---|---|---|
| Actual + | TP (True Positive) | FN (Type II Error) |
| Actual - | FP (Type I Error) | TN (True Negative) |
Precision
TP/(TP+FP) - Of predicted +, how many correct?
Recall
TP/(TP+FN) - Of actual +, how many found?
ML Basics
Machine Learning
A field of AI that enables computers to learn from data without being explicitly programmed.
Supervised Learning
Learning from labeled data where both input (X) and output (Y) are provided. The model learns the function Y = f(X).
Unsupervised Learning
Learning from unlabeled data to find hidden patterns or structures.
Feature
An input variable (column) used to make predictions. Also called predictor, attribute, or independent variable.
Statistics
Mean
The average value. Sum all values and divide by count.
Median
The middle value when data is sorted. Less affected by outliers than mean.
Mode
The most frequently occurring value in a dataset.
Variance
Average of squared differences from the mean. Measures spread of data.
Data Preprocessing
EDA
Exploratory Data Analysis - examining data to summarize characteristics and find patterns.
One-Hot Encoding
Converting categorical variables to binary columns (0 or 1).
Label Encoding
Converting categories to numbers. May create false ordinal relationships.
StandardScaler
Transforms data to have mean=0 and std=1.
Model Evaluation
True Positive (TP)
Model correctly predicted positive class.
True Negative (TN)
Model correctly predicted negative class.
False Positive (FP)
Model incorrectly predicted positive. Type I Error.
False Negative (FN)
Model incorrectly predicted negative. Type II Error.
Algorithms
Linear Regression
Predicts continuous output as weighted sum of inputs: y = a + bx
Logistic Regression
Predicts probability of binary outcome using sigmoid function.
Decision Tree
Makes decisions by splitting data based on feature values. Easy to interpret.
Random Forest
Ensemble of decision trees trained on random subsets. Bagging method.
Optimization
Hyperparameter
Settings configured before training (not learned from data).
Parameter
Values learned during training.
Grid Search
Tests all combinations of hyperparameter values.
Overfitting
Model learns training data too well, fails on new data. High variance.
Algorithms - Scaling Required?
| Algorithm | Scaling? | Why |
|---|---|---|
| Linear/Logistic Regression | Yes | Gradient descent |
| SVM | Yes | Distance-based |
| KNN | Yes | Distance-based |
| Neural Network | Yes | Gradient descent |
| Decision Tree | No | Split-based |
| Random Forest | No | Tree-based |
| Naive Bayes | No | Probability-based |
Key Tips
Data Leakage Prevention
Always: Split → Fit on train → Transform both
Metric Selection
Recall for disease/fraud detection (don't miss positives)
Precision for spam filters (don't bother users with false alarms)
ML vs Deep Learning
ML: Less data, faster, manual features
DL: More data, GPU needed, auto features