Sunday, February 15, 2026

 Exercise for L1,L2,Huber loss  functions in Machine learning 


***You are given the numerical target y=[0,1,3,−8,1] and predicitons y^=[1,1,1,−1,−1] . Compute and select the correct empirical risk using a L2 loss.

Great, let’s compute the empirical risk using L2 loss step by step for your data:

1. Recall the formula

The empirical risk with L2 loss is the mean squared error (MSE):

R=1ni=1n(yiy^i)2

where:

  • yi = true target

  • y^i = prediction

  • n = number of samples

2. Plug in your values

True targets: y=[0,1,3,8,1] Predictions: y^=[1,1,1,1,1]

Compute squared errors for each sample:

  1. (01)2=(1)2=1

  2. (11)2=(0)2=0

  3. (31)2=(2)2=4

  4. (8(1))2=(7)2=49

  5. (1(1))2=(2)2=4

3. Sum and average

SSE=1+0+4+49+4=58
R=585=11.6

Final Answer: The empirical risk (L2 loss) is 11.6.

Would you like me to also show you how this compares to using L1 loss (absolute error) for the same data, so you can see the difference in sensitivity to large errors?

yes

Great idea — let’s compare the empirical risk using L2 loss (which we already computed as 11.6) with the empirical risk using L1 loss (absolute error) for the same data. This will show you how the two losses behave differently.

1. Recall the L1 Loss Formula

The empirical risk with L1 loss is the mean absolute error (MAE):

RL1=1ni=1nyiy^i

2. Compute Absolute Errors

True targets: y=[0,1,3,8,1] Predictions: y^=[1,1,1,1,1]

Errors:

  1. 01=1

  2. 11=0

  3. 31=2

  4. 8(1)=7=7

  5. 1(1)=2=2

3. Sum and Average

Sum of absolute errors=1+0+2+7+2=12
RL1=125=2.4

4. Comparison

  • L2 loss (MSE): 11.6

  • L1 loss (MAE): 2.4

5. Interpretation

  • L2 loss penalizes large errors more strongly (the big error of 49 from the 8 vs. 1 case dominates).

  • L1 loss treats all errors proportionally, so the large error doesn’t overwhelm the average.

  • That’s why L2 is more sensitive to outliers, while L1 is more robust.

Final Answer:

  • Empirical risk with L2 loss = 11.6

  • Empirical risk with L1 loss = 2.4

Would you like me to also show you a visual comparison (plotting the errors) so you can see how L1 vs. L2 loss curves behave differently?

yes

Perfect — let’s visualize the difference between L1 loss (absolute error) and L2 loss (squared error) so you can see how they behave differently.

1. Loss Curves

  • L1 loss (MAE):

LL1(e)=e
  • Grows linearly with error.

  • Treats all errors proportionally.

  • More robust to outliers.

  • L2 loss (MSE):

LL2(e)=e2
  • Grows quadratically with error.

  • Penalizes large errors much more strongly.

  • Sensitive to outliers.

2. Visual Comparison (Conceptual)

Imagine plotting error e on the x-axis and loss on the y-axis:

  • L1 curve: A “V” shape — straight lines with slope ±1.

  • L2 curve: A parabola — smooth curve that grows faster as error increases.

So for small errors, both are similar. But for large errors, L2 shoots upward much faster.

3. Applying to Your Example

Errors: [1,0,2,7,2]

  • L1 loss values: [1,0,2,7,2] → average = 2.4

  • L2 loss values: [1,0,4,49,4] → average = 11.6

Notice how the single large error (-7) dominates the L2 loss (49), while in L1 it’s just 7.

4. Intuition

  • L1 (MAE): “Fair” to all errors, robust to outliers.

  • L2 (MSE): Punishes big mistakes heavily, useful when you want to strongly discourage large deviations.

✨ So visually:

  • L1 = straight “V” line.

  • L2 = curved parabola. And numerically in your case: 2.4 vs. 11.6.

Would you like me to also show you how Huber loss combines both (L1 for large errors, L2 for small ones) — a hybrid approach often used in practice?

yes

Great, let’s explore Huber loss — the hybrid between L1 and L2 — and see how it compares to what we just computed.

1. Definition

Huber loss is designed to combine the best of both worlds:

Lδ(e)={12e2if eδδ(e12δ)if e>δ
  • For small errors (eδ): behaves like L2 (quadratic).

  • For large errors (e>δ): behaves like L1 (linear).

δ is a threshold parameter that decides when to switch.

2. Why Use Huber Loss?

  • L2 loss: Sensitive to outliers (big errors dominate).

  • L1 loss: Robust to outliers but less smooth for optimization.

  • Huber loss: Smooth like L2 for small errors, robust like L1 for large errors.

It’s often used in regression tasks where you expect some outliers but still want efficient optimization.

3. Example with Your Data

Errors: [1,0,2,7,2] Let’s pick δ=1.5.

  • Error = -1 → e=11.5 → L2 part → 0.512=0.5

  • Error = 0 → e=01.5 → L2 part → 0

  • Error = 2 → e=2>1.5 → L1 part → 1.5(20.75)=1.51.25=1.875

  • Error = -7 → e=7>1.5 → L1 part → 1.5(70.75)=1.56.25=9.375

  • Error = 2 → same as above → 1.875

Sum = 0.5+0+1.875+9.375+1.875=13.625 Average = 13.625/5=2.725

4. Comparison

  • L2 loss (MSE): 11.6

  • L1 loss (MAE): 2.4

  • Huber loss (δ=1.5): 2.725

5. Interpretation

  • Huber loss sits between L1 and L2.

  • It doesn’t explode like L2 when there’s a big outlier (-7 error).

  • It’s slightly higher than L1 because small errors are treated quadratically.

  • This makes it a robust yet smooth alternative.

Final Insight: Huber loss is often the best compromise when you want robustness to outliers but still smooth optimization.

Would you like me to also show you graphical intuition (how the three curves — L1, L2, Huber — look side by side) so you can visualize the difference clearly?

yes

Here’s the graphical intuition for how L1, L2, and Huber loss behave side by side:

1. Shapes of the Loss Functions

  • L1 loss (MAE):

    • Graph looks like a sharp “V”.

    • Increases linearly with error.

    • Outliers don’t explode the loss — they just add proportionally.

  • L2 loss (MSE):

    • Graph looks like a smooth parabola.

    • Increases quadratically with error.

    • Outliers dominate because squaring amplifies large errors.

  • Huber loss:

    • Graph looks like a parabola near the origin (small errors), then transitions into straight lines (like L1) for large errors.

    • Smooth at the switch point (δ).

    • Balances robustness (like L1) with smooth optimization (like L2).

2. Intuition with Your Example

Errors: [1,0,2,7,2]

  • L1: Each error contributes proportionally → average = 2.4.

  • L2: The large error (-7) dominates → average = 11.6.

  • Huber (δ=1.5): Small errors treated quadratically, large ones linearly → average ≈ 2.73.

So visually:

  • L1 = straight lines.

  • L2 = steep curve.

  • Huber = curve near zero, straight lines further out.

3. Practical Takeaway

  • Use L2 when you want to heavily penalize large deviations.

  • Use L1 when you want robustness to outliers.

  • Use Huber when you want a compromise: smooth optimization but not dominated by outliers.

✨ If you imagine plotting them:

  • All three meet at the origin (error = 0).

  • L2 grows fastest, L1 grows steadily, Huber grows like L2 near zero but switches to L1 growth beyond δ.

Would you like me to give you a small Python snippet that plots L1, L2, and Huber loss curves together, so you can visualize them directly?

yes

Here’s a simple Python snippet you can run to visualize L1, L2, and Huber loss curves side by side. This will make the differences crystal clear:

python
import numpy as np
import matplotlib.pyplot as plt

# Define error range
errors = np.linspace(-10, 10, 200)

# L1 loss
L1 = np.abs(errors)

# L2 loss
L2 = errors**2

# Huber loss (delta = 1.5)
delta = 1.5
Huber = np.where(np.abs(errors) <= delta,
                 0.5 * errors**2,
                 delta * (np.abs(errors) - 0.5 * delta))

# Plot
plt.figure(figsize=(8,6))
plt.plot(errors, L1, label="L1 Loss (MAE)", color="blue")
plt.plot(errors, L2, label="L2 Loss (MSE)", color="red")
plt.plot(errors, Huber, label="Huber Loss (δ=1.5)", color="green")
plt.axvline(x=0, color="black", linestyle="--", linewidth=0.8)

plt.title("Comparison of L1, L2, and Huber Loss Functions")
plt.xlabel("Error")
plt.ylabel("Loss")
plt.legend()
plt.grid(True)
plt.show()

🔎 What you’ll see:

  • Blue (L1): Sharp “V” shape, grows linearly.

  • Red (L2): Smooth parabola, grows quadratically.

  • Green (Huber): Parabola near zero, then switches to linear growth beyond δ.

This visualization makes it obvious why:

  • L2 punishes large errors heavily.

  • L1 is robust to outliers.

  • Huber balances both.

Would you like me to also prepare a numerical table of losses for your dataset (errors = [-1, 0, 2, -7, 2]) under L1, L2, and Huber, so you can compare them side by side without plotting?

yes

Here’s a numerical comparison table of the losses for your dataset under L1, L2, and Huber loss. This way you can see side by side how each error contributes.

Dataset

  • True targets: y=[0,1,3,8,1]

  • Predictions: y^=[1,1,1,1,1]

  • Errors: [1,0,2,7,2]

Loss Values

| Error | L1 Loss (e) | L2 Loss (e2) | Huber Loss (δ=1.5) | |-------|------------------|-------------------|-----------------------------| | -1 | 1 | 1 | 0.5 | | 0 | 0 | 0 | 0 | | 2 | 2 | 4 | 1.875 | | -7 | 7 | 49 | 9.375 | | 2 | 2 | 4 | 1.875 | | Sum | 12 | 58 | 13.625 | | Average (Empirical Risk) | 2.4 | 11.6 | 2.725 |

🔎 Interpretation

  • L1 (MAE): 2.4 → robust, treats all errors proportionally.

  • L2 (MSE): 11.6 → dominated by the large error (-7).

  • Huber (δ=1.5): 2.725 → balances small errors (quadratic) and large errors (linear).

✨ This table makes it clear:

  • L2 explodes with outliers.

  • L1 stays steady.

  • Huber sits in between, offering robustness with smooth optimization.

مدونة البعداني

اجعل مدونتك افضل مدونة مع التصاميم المجانية و الاضافات الجديدة و الدعم الفني المجاني فقط على مدونة جلال البعداني ,,,,,

0 comments