Quick Reference

Essential Formulas

Mean (Average)

\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \]

Variance

\[ \sigma^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n} \]

Standard Deviation

\[ \sigma = \sqrt{\sigma^2} \]

Accuracy

\[ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \]

Precision

\[ Precision = \frac{TP}{TP + FP} \]

Recall (Sensitivity)

\[ Recall = \frac{TP}{TP + FN} \]

F1 Score

\[ F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} \]

Specificity

\[ Specificity = \frac{TN}{TN + FP} \]

StandardScaler

\[ x_{scaled} = \frac{x - \mu}{\sigma} \]

MinMaxScaler

\[ x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}} \]

Pearson Correlation

\[ \rho = \frac{cov(X,Y)}{\sigma_X \cdot \sigma_Y} \]

Confusion Matrix

	Predicted +	Predicted -
Actual +	TP (True Positive)	FN (Type II Error)
Actual -	FP (Type I Error)	TN (True Negative)

Precision

TP/(TP+FP) - Of predicted +, how many correct?

Recall

TP/(TP+FN) - Of actual +, how many found?

ML Basics

Machine Learning

A field of AI that enables computers to learn from data without being explicitly programmed.

Supervised Learning

Learning from labeled data where both input (X) and output (Y) are provided. The model learns the function Y = f(X).

Unsupervised Learning

Learning from unlabeled data to find hidden patterns or structures.

Feature

An input variable (column) used to make predictions. Also called predictor, attribute, or independent variable.

Statistics

Mean

The average value. Sum all values and divide by count.

Median

The middle value when data is sorted. Less affected by outliers than mean.

Mode

The most frequently occurring value in a dataset.

Variance

Average of squared differences from the mean. Measures spread of data.

Data Preprocessing

EDA

Exploratory Data Analysis - examining data to summarize characteristics and find patterns.

One-Hot Encoding

Converting categorical variables to binary columns (0 or 1).

Label Encoding

Converting categories to numbers. May create false ordinal relationships.

StandardScaler

Transforms data to have mean=0 and std=1.

Model Evaluation

True Positive (TP)

Model correctly predicted positive class.

True Negative (TN)

Model correctly predicted negative class.

False Positive (FP)

Model incorrectly predicted positive. Type I Error.

False Negative (FN)

Model incorrectly predicted negative. Type II Error.

Algorithms

Linear Regression

Predicts continuous output as weighted sum of inputs: y = a + bx

Logistic Regression

Predicts probability of binary outcome using sigmoid function.

Decision Tree

Makes decisions by splitting data based on feature values. Easy to interpret.

Random Forest

Ensemble of decision trees trained on random subsets. Bagging method.

Optimization

Hyperparameter

Settings configured before training (not learned from data).

Parameter

Values learned during training.

Grid Search

Tests all combinations of hyperparameter values.

Overfitting

Model learns training data too well, fails on new data. High variance.

Algorithms - Scaling Required?

Algorithm	Scaling?	Why
Linear/Logistic Regression	Yes	Gradient descent
SVM	Yes	Distance-based
KNN	Yes	Distance-based
Neural Network	Yes	Gradient descent
Decision Tree	No	Split-based
Random Forest	No	Tree-based
Naive Bayes	No	Probability-based

Key Tips

Data Leakage Prevention

Always: Split → Fit on train → Transform both

Metric Selection

Recall for disease/fraud detection (don't miss positives)
Precision for spam filters (don't bother users with false alarms)

ML vs Deep Learning

ML: Less data, faster, manual features
DL: More data, GPU needed, auto features