LaTeX Tutorials

LaTeX for Machine Learning: Math Formulas and Notation Guide

November 17, 202520 min read

Introduction

Machine learning and deep learning rely heavily on mathematical notation to express algorithms, loss functions, optimization procedures, and model architectures. Mastering LaTeX for machine learning formulas is essential for writing academic papers, research documents, and technical documentation in the field of AI and data science.

This comprehensive guide covers everything you need to know about typesetting machine learning mathematics in LaTeX, from basic loss functions and gradient descent to complex neural network architectures, matrix operations, and probability notation. Whether you're working on research papers, thesis documents, or technical blogs, this guide will help you create professional, readable mathematical expressions.

Loss Functions

Loss functions measure how well a model's predictions match the true values. Here are common loss functions in machine learning:

Mean Squared Error (MSE)

L = \frac{1}{n} i = 1 \sum n (y_{i} - \overset{y}{^}_{i})^{2}

Mean Absolute Error (MAE)

L = \frac{1}{n} i = 1 \sum n ∣ y_{i} - \overset{y}{^}_{i} ∣

Cross-Entropy Loss

For binary classification:

L = - \frac{1}{n} i = 1 \sum n [y_{i} lo g (\overset{y}{^}_{i}) + (1 - y_{i}) lo g (1 - \overset{y}{^}_{i})]

For multi-class classification:

L = - \frac{1}{n} i = 1 \sum n c = 1 \sum C y_{i, c} lo g (\overset{y}{^}_{i, c})

Hinge Loss (SVM)

L = \frac{1}{n} i = 1 \sum n max (0, 1 - y_{i} \cdot \overset{y}{^}_{i})

Gradient Descent

Gradient descent is the fundamental optimization algorithm in machine learning. Here's how to typeset it:

Basic Gradient Descent

θ_{t + 1} = θ_{t} - α \nabla_{θ} L (θ_{t})

Gradient with Respect to Parameters

\nabla_{θ} L = (\frac{\partial L}{\partial θ _{1}}, \frac{\partial L}{\partial θ _{2}}, \dots, \frac{\partial L}{\partial θ _{n}})

Stochastic Gradient Descent (SGD)

Update using a single sample or mini-batch:

θ_{t + 1} = θ_{t} - α \nabla_{θ} L (θ_{t}, x_{i}, y_{i})

Momentum

{v_{t + 1} = β v_{t} + \nabla_{θ} L (θ_{t}) θ_{t + 1} = θ_{t} - α v_{t + 1}

Adam Optimizer

⎩ ⎨ ⎧ m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) \nabla_{θ} L (θ_{t}) v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) (\nabla_{θ} L (θ_{t}))^{2} \overset{m}{^}_{t} = \frac{m _{t}}{1 - β _{1}^{t}}, \overset{v}{^}_{t} = \frac{v _{t}}{1 - β _{2}^{t}} θ_{t + 1} = θ_{t} - \frac{α}{v ^ _{t} + ϵ} \overset{m}{^}_{t}

Neural Networks

Neural networks are the foundation of deep learning. Here's how to typeset their mathematical notation:

Single Layer Forward Pass

z = Wx + b

With activation function:

a = σ (z) = σ (Wx + b)

Multi-Layer Network

Forward propagation through multiple layers:

{z^{(l)} = W^{(l)} a^{(l - 1)} + b^{(l)} a^{(l)} = σ (z^{(l)})

Backpropagation

Error propagation and gradient computation:

⎩ ⎨ ⎧ δ^{(L)} = \nabla_{a^{(L)}} L ⊙ σ^{'} (z^{(L)}) δ^{(l)} = ((W^{(l + 1)})^{T} δ^{(l + 1)}) ⊙ σ^{'} (z^{(l)}) \frac{\partial L}{\partial W ^{(l)}} = δ^{(l)} (a^{(l - 1)})^{T} \frac{\partial L}{\partial b ^{(l)}} = δ^{(l)}

Activation Functions

Activation functions introduce non-linearity into neural networks:

Sigmoid

σ (x) = \frac{1}{1 + e ^{- x}}

ReLU (Rectified Linear Unit)

ReLU (x) = max (0, x) = {x 0 if x > 0 otherwise

Tanh

tanh (x) = \frac{e ^{x} - e ^{- x}}{e ^{x} + e ^{- x}}

Softmax

For multi-class classification:

softmax (z)_{i} = \frac{e ^{z_{i}}}{\sum _{j = 1}^{C} e ^{z_{j}}}

Regularization

Regularization techniques prevent overfitting in machine learning models:

L1 Regularization (Lasso)

L_{total} = L (θ) + λ i = 1 \sum n ∣ θ_{i} ∣

L2 Regularization (Ridge)

L_{total} = L (θ) + λ i = 1 \sum n θ_{i}^{2}

Elastic Net

L_{total} = L (θ) + λ_{1} i = 1 \sum n ∣ θ_{i} ∣ + λ_{2} i = 1 \sum n θ_{i}^{2}

Dropout

During training, randomly set some activations to zero:

a^{(l)} = m^{(l)} ⊙ σ (z^{(l)})

where \mathbf{m}^{(l)} is a binary mask with probability p of being 1.

Matrix Operations

Machine learning heavily uses matrix operations. Here are common notations:

Matrix Multiplication

C = AB

Element-wise Operations

Hadamard product (element-wise multiplication):

C = A ⊙ B

Transpose and Inverse

A^{T}, A^{- 1}

Frobenius Norm

∥ A ∥_{F} = i = 1 \sum m j = 1 \sum n a_{ij}^{2}

Probability and Statistics

Machine learning uses probability theory extensively:

Maximum Likelihood Estimation

\hat{θ} = ar g θ max i = 1 \prod n P (y_{i} ∣ x_{i}, θ)

Log-likelihood (often used instead):

\hat{θ} = ar g θ max i = 1 \sum n lo g P (y_{i} ∣ x_{i}, θ)

Bayes' Theorem

P (θ ∣ D) = \frac{P ( D ∣ θ ) P ( θ )}{P ( D )}

Expectation and Variance

E [X] = i \sum x_{i} P (x_{i}), Var (X) = E [(X - E [X])^{2}]

Best Practices

Consistent Notation

Use consistent notation throughout your document. Common conventions:

Bold lowercase: \mathbf{x} for vectors
Bold uppercase: \mathbf{W} for matrices
Greek letters: \theta for parameters, \alpha for learning rate
Calligraphic: \mathcal{D} for datasets

Layer Notation

Use superscripts for layer indices: \mathbf{a}^{(l)} for activations at layer l.

Element-wise Operations

Use \odot for Hadamard product and \oslash for element-wise division.

Function Names

Use \text{} for function names like ReLU, softmax, etc.:

\text{ReLU}(x), \quad \text{softmax}(\mathbf{z})

Common Mistakes to Avoid

Mixing vector and scalar notation: Use \mathbf{x} for vectors and regular x for scalars.
Incorrect gradient notation: Use \nabla_\theta for gradients with respect to parameters, not just \nabla.
Missing layer indices: Always specify layer indices in neural network notation: \mathbf{W}^{(l)} not just \mathbf{W}.
Incorrect probability notation: Use P for probability and p for probability density functions.
Forgetting element-wise operations: Use \odot for element-wise multiplication, not regular \cdot.

Introduction

Loss Functions

Mean Squared Error (MSE)

Mean Absolute Error (MAE)

Cross-Entropy Loss

Hinge Loss (SVM)

Gradient Descent

Basic Gradient Descent

Gradient with Respect to Parameters

Stochastic Gradient Descent (SGD)

Momentum

Adam Optimizer

Neural Networks

Single Layer Forward Pass

Multi-Layer Network

Backpropagation

Activation Functions

Sigmoid

ReLU (Rectified Linear Unit)

Tanh

Softmax

Regularization

L1 Regularization (Lasso)

L2 Regularization (Ridge)

Elastic Net

Dropout

Matrix Operations

Matrix Multiplication

Element-wise Operations

Transpose and Inverse

Frobenius Norm

Probability and Statistics

Maximum Likelihood Estimation

Bayes' Theorem

Expectation and Variance

Best Practices

Consistent Notation

Layer Notation

Element-wise Operations

Function Names

Common Mistakes to Avoid

Related Topics

Matrices in LaTeX

Derivatives in LaTeX

Summations in LaTeX