Skip to main content

Command Palette

Search for a command to run...

Understanding Cost Function

Updated
4 min read
Understanding Cost Function
D

I'm an Engineer

1. Why Do We Even Need a Cost Function?

Machine learning is fundamentally an optimization problem.

  • Assume a model with parameters

  • Make predictions

  • Measure how wrong those predictions are

  • Adjust parameters to reduce that wrongness

That measurement of wrongness is where loss and cost functions come in.


2. From Model to Error: The Big Picture

Let’s start with a simple supervised learning pipeline:

Input (x) ──► Model f(x; θ) ──► Prediction (ŷ)
                         │
                         ▼
                   Compare with y
                         │
                         ▼
                   Loss Function
                         │
                         ▼
                   Cost Function
                         │
                         ▼
                  Optimization (GD)

Where:

  • x → input features

  • y → ground truth

  • ŷ → predicted output

  • θ → model parameters (weights, bias, etc.)


3. Loss Function vs Cost Function

This distinction is often confused, so let’s be precise.

Loss Function (ℓ)

A loss function measures error for a single training example.

$$\ell(y^{(i)}, \hat{y}^{(i)})$$

Example:

  • Squared error for one sample

$$\ell = (y - \hat{y})^2$$

Cost Function (J)

A cost function aggregates loss over the entire dataset.

$$J(\theta) = \frac{1}{n} \sum_{i=1}^{n} \ell(y^{(i)}, \hat{y}^{(i)})$$

Loss = local (per sample)
Cost = global (over dataset)


4. Cost Function for Linear Regression (Canonical Example)

Model Definition

$$\hat{y} = f(x) = wx + b$$

Where:

  • w = weight

  • b = bias

Loss Function (Squared Error)

For one data point:

$$\ell^{(i)} = (y^{(i)} - (wx^{(i)} + b))^2$$

Cost Function (Mean Squared Error)

$$\boxed{J(w,b) = \frac{1}{n} \sum_{i=1}^{n} \left(y^{(i)} - (wx^{(i)} + b)\right)^2 }$$

Why squared error?

  • Penalizes large errors more

  • Differentiable everywhere

  • Convex for linear regression


5. Geometry of the Cost Function

Cost as a Function of Parameters

The cost is not a function of x. It is a function of parameters.

$$J = J(w, b)$$

This means:

  • Every (w, b) pair has one cost value

  • Training = finding the (w, b) with minimum cost

Cost Surface:

  • X-axis → weight w

  • Y-axis → bias b

  • Z-axis → cost J(w,b)


Cost (Over Dataset)

Visualizing Squared Error Cost function for Logistic regression in ...

  • Aggregates all losses

  • Smooths noise

  • Drives optimization


6. Optimization: Why Cost Function Must Be Differentiable

To minimize cost, we use gradient descent.

$$w := w - \alpha \frac{\partial J}{\partial w}$$

$$ b := b - \alpha \frac{\partial J}{\partial b}$$

Where:

  • α = learning rate

Gradients for Linear Regression

$$\frac{\partial J}{\partial w} = -\frac{2}{n} \sum x^{(i)}(y^{(i)} - \hat{y}^{(i)}) $$

$$\frac{\partial J}{\partial b} = -\frac{2}{n} \sum (y^{(i)} - \hat{y}^{(i)})$$

  • Cost function must be smooth

  • Non-differentiable points break learning


7. Python Code: Cost Function

Dataset

import numpy as np

# Input features
X = np.array([1, 2, 3, 4])

# Ground truth labels
y = np.array([2, 4, 6, 8])

Cost Function Implementation

def compute_cost(X, y, w, b):
    """
    Computes Mean Squared Error cost function

    Parameters:
    X : input features
    y : ground truth values
    w : weight
    b : bias

    Returns:
    cost : mean squared error
    """
    n = len(X)
    total_cost = 0

    for i in range(n):
        y_hat = w * X[i] + b
        total_cost += (y[i] - y_hat) ** 2

    return total_cost / n

Try Different Parameters

# Try different parameter values
print("Cost with w=2, b=0:", compute_cost(X, y, w=2, b=0))
print("Cost with w=1, b=0:", compute_cost(X, y, w=1, b=0))
print("Cost with w=0, b=1:", compute_cost(X, y, w=0, b=1))

Output:

Cost with w=2, b=0: 0.0
Cost with w=1, b=0: 7.5
Cost with w=0, b=1: 27.5
ParametersExplanationCost
w=2, b=0Perfect model (y = 2x)0.0
w=1, b=0Underestimates yHigher
w=0, b=1Very poor modelVery high

8. Final Mental Model

Loss tells you how wrong one prediction is. Cost tells you how bad your model is overall. Optimization finds parameters that minimize cost.


More from this blog

The Engineering Hub

11 posts

knowledge sharing portal...