Understanding Cost Function

1. Why Do We Even Need a Cost Function?
Machine learning is fundamentally an optimization problem.
Assume a model with parameters
Make predictions
Measure how wrong those predictions are
Adjust parameters to reduce that wrongness
That measurement of wrongness is where loss and cost functions come in.
2. From Model to Error: The Big Picture
Let’s start with a simple supervised learning pipeline:
Input (x) ──► Model f(x; θ) ──► Prediction (ŷ)
│
▼
Compare with y
│
▼
Loss Function
│
▼
Cost Function
│
▼
Optimization (GD)
Where:
x→ input featuresy→ ground truthŷ→ predicted outputθ→ model parameters (weights, bias, etc.)
3. Loss Function vs Cost Function
This distinction is often confused, so let’s be precise.
Loss Function (ℓ)
A loss function measures error for a single training example.
$$\ell(y^{(i)}, \hat{y}^{(i)})$$
Example:
- Squared error for one sample
$$\ell = (y - \hat{y})^2$$
Cost Function (J)
A cost function aggregates loss over the entire dataset.
$$J(\theta) = \frac{1}{n} \sum_{i=1}^{n} \ell(y^{(i)}, \hat{y}^{(i)})$$
Loss = local (per sample)
Cost = global (over dataset)
4. Cost Function for Linear Regression (Canonical Example)
Model Definition
$$\hat{y} = f(x) = wx + b$$
Where:
w= weightb= bias
Loss Function (Squared Error)
For one data point:
$$\ell^{(i)} = (y^{(i)} - (wx^{(i)} + b))^2$$
Cost Function (Mean Squared Error)
$$\boxed{J(w,b) = \frac{1}{n} \sum_{i=1}^{n} \left(y^{(i)} - (wx^{(i)} + b)\right)^2 }$$
Why squared error?
Penalizes large errors more
Differentiable everywhere
Convex for linear regression
5. Geometry of the Cost Function
Cost as a Function of Parameters
The cost is not a function of x. It is a function of parameters.
$$J = J(w, b)$$
This means:
Every
(w, b)pair has one cost valueTraining = finding the
(w, b)with minimum cost
Cost Surface:

X-axis → weight
wY-axis → bias
bZ-axis → cost
J(w,b)
Cost (Over Dataset)

Aggregates all losses
Smooths noise
Drives optimization
6. Optimization: Why Cost Function Must Be Differentiable
To minimize cost, we use gradient descent.
$$w := w - \alpha \frac{\partial J}{\partial w}$$
$$ b := b - \alpha \frac{\partial J}{\partial b}$$
Where:
α= learning rate
Gradients for Linear Regression
$$\frac{\partial J}{\partial w} = -\frac{2}{n} \sum x^{(i)}(y^{(i)} - \hat{y}^{(i)}) $$
$$\frac{\partial J}{\partial b} = -\frac{2}{n} \sum (y^{(i)} - \hat{y}^{(i)})$$
Cost function must be smooth
Non-differentiable points break learning
7. Python Code: Cost Function
Dataset
import numpy as np
# Input features
X = np.array([1, 2, 3, 4])
# Ground truth labels
y = np.array([2, 4, 6, 8])
Cost Function Implementation
def compute_cost(X, y, w, b):
"""
Computes Mean Squared Error cost function
Parameters:
X : input features
y : ground truth values
w : weight
b : bias
Returns:
cost : mean squared error
"""
n = len(X)
total_cost = 0
for i in range(n):
y_hat = w * X[i] + b
total_cost += (y[i] - y_hat) ** 2
return total_cost / n
Try Different Parameters
# Try different parameter values
print("Cost with w=2, b=0:", compute_cost(X, y, w=2, b=0))
print("Cost with w=1, b=0:", compute_cost(X, y, w=1, b=0))
print("Cost with w=0, b=1:", compute_cost(X, y, w=0, b=1))
Output:
Cost with w=2, b=0: 0.0
Cost with w=1, b=0: 7.5
Cost with w=0, b=1: 27.5
| Parameters | Explanation | Cost |
| w=2, b=0 | Perfect model (y = 2x) | 0.0 |
| w=1, b=0 | Underestimates y | Higher |
| w=0, b=1 | Very poor model | Very high |
8. Final Mental Model
Loss tells you how wrong one prediction is. Cost tells you how bad your model is overall. Optimization finds parameters that minimize cost.




