Understanding Cost Function

1. Why Do We Even Need a Cost Function?

Machine learning is fundamentally an optimization problem.

Assume a model with parameters
Make predictions
Measure how wrong those predictions are
Adjust parameters to reduce that wrongness

That measurement of wrongness is where loss and cost functions come in.

2. From Model to Error: The Big Picture

Let’s start with a simple supervised learning pipeline:

Input (x) ──► Model f(x; θ) ──► Prediction (ŷ)
                         │
                         ▼
                   Compare with y
                         │
                         ▼
                   Loss Function
                         │
                         ▼
                   Cost Function
                         │
                         ▼
                  Optimization (GD)

Where:

x → input features
y → ground truth
ŷ → predicted output
θ → model parameters (weights, bias, etc.)

3. Loss Function vs Cost Function

This distinction is often confused, so let’s be precise.

Loss Function (ℓ)

A loss function measures error for a single training example.

$$\ell(y^{(i)}, \hat{y}^{(i)})$$

Example:

Squared error for one sample

$$\ell = (y - \hat{y})^2$$

Cost Function (J)

A cost function aggregates loss over the entire dataset.

$$J(\theta) = \frac{1}{n} \sum_{i=1}^{n} \ell(y^{(i)}, \hat{y}^{(i)})$$

Loss = local (per sample)
Cost = global (over dataset)

4. Cost Function for Linear Regression (Canonical Example)

Model Definition

$$\hat{y} = f(x) = wx + b$$

Where:

w = weight
b = bias

Loss Function (Squared Error)

For one data point:

$$\ell^{(i)} = (y^{(i)} - (wx^{(i)} + b))^2$$

Cost Function (Mean Squared Error)

$$\boxed{J(w,b) = \frac{1}{n} \sum_{i=1}^{n} \left(y^{(i)} - (wx^{(i)} + b)\right)^2 }$$

Why squared error?

Penalizes large errors more
Differentiable everywhere
Convex for linear regression

5. Geometry of the Cost Function

Cost as a Function of Parameters

The cost is not a function of x. It is a function of parameters.

$$J = J(w, b)$$

This means:

Every (w, b) pair has one cost value
Training = finding the (w, b) with minimum cost

Cost Surface:

X-axis → weight w
Y-axis → bias b
Z-axis → cost J(w,b)

Cost (Over Dataset)

Visualizing Squared Error Cost function for Logistic regression in ...

Aggregates all losses
Smooths noise
Drives optimization

6. Optimization: Why Cost Function Must Be Differentiable

To minimize cost, we use gradient descent.

$$w := w - \alpha \frac{\partial J}{\partial w}$$

$$ b := b - \alpha \frac{\partial J}{\partial b}$$

Where:

α = learning rate

Gradients for Linear Regression

$$\frac{\partial J}{\partial w} = -\frac{2}{n} \sum x^{(i)}(y^{(i)} - \hat{y}^{(i)}) $$

$$\frac{\partial J}{\partial b} = -\frac{2}{n} \sum (y^{(i)} - \hat{y}^{(i)})$$

Cost function must be smooth
Non-differentiable points break learning

7. Python Code: Cost Function

Dataset

import numpy as np

# Input features
X = np.array([1, 2, 3, 4])

# Ground truth labels
y = np.array([2, 4, 6, 8])

Cost Function Implementation

def compute_cost(X, y, w, b):
    """
    Computes Mean Squared Error cost function

    Parameters:
    X : input features
    y : ground truth values
    w : weight
    b : bias

    Returns:
    cost : mean squared error
    """
    n = len(X)
    total_cost = 0

    for i in range(n):
        y_hat = w * X[i] + b
        total_cost += (y[i] - y_hat) ** 2

    return total_cost / n

Try Different Parameters

# Try different parameter values
print("Cost with w=2, b=0:", compute_cost(X, y, w=2, b=0))
print("Cost with w=1, b=0:", compute_cost(X, y, w=1, b=0))
print("Cost with w=0, b=1:", compute_cost(X, y, w=0, b=1))

Output:

Cost with w=2, b=0: 0.0
Cost with w=1, b=0: 7.5
Cost with w=0, b=1: 27.5

Parameters	Explanation	Cost
w=2, b=0	Perfect model (`y = 2x`)	0.0
w=1, b=0	Underestimates y	Higher
w=0, b=1	Very poor model	Very high

8. Final Mental Model

Loss tells you how wrong one prediction is. Cost tells you how bad your model is overall. Optimization finds parameters that minimize cost.

Understanding Cost Function

1. Why Do We Even Need a Cost Function?

2. From Model to Error: The Big Picture

3. Loss Function vs Cost Function

Loss Function (ℓ)

Cost Function (J)

4. Cost Function for Linear Regression (Canonical Example)

Model Definition

Loss Function (Squared Error)

Cost Function (Mean Squared Error)

5. Geometry of the Cost Function

Cost as a Function of Parameters

Cost Surface:

Cost (Over Dataset)

6. Optimization: Why Cost Function Must Be Differentiable

Gradients for Linear Regression

7. Python Code: Cost Function

Dataset

Cost Function Implementation

Try Different Parameters

8. Final Mental Model

Comments

Machine Learning

Linear Regression

More from this blog

Race Condition - A Complete Deep Dive

Smart Pointer - A Deep Dive into Medern Memory Management

Resource Acquisition Is Initialization (RAII)

Virtual Channels (VC) in Camera Pipelines – A Complete Deep Dive

Camera Pipeline in Embedded & Automotive Systems

Command Palette

1. Why Do We Even Need a Cost Function?

2. From Model to Error: The Big Picture

3. Loss Function vs Cost Function

Loss Function (ℓ)

Cost Function (J)

4. Cost Function for Linear Regression (Canonical Example)

Model Definition

Loss Function (Squared Error)

Cost Function (Mean Squared Error)

5. Geometry of the Cost Function

Cost as a Function of Parameters

Cost Surface:

Cost (Over Dataset)

6. Optimization: Why Cost Function Must Be Differentiable

Gradients for Linear Regression

7. Python Code: Cost Function

Dataset

Cost Function Implementation

Try Different Parameters

8. Final Mental Model

Comments

Machine Learning

Linear Regression

More from this blog