Session 1: Linear Models with L2 Loss – Analytical Optimization and Statistical Properties of Linear Models

1. What is Linear Regression?

Goal: Predict a continuous numerical target variable (like house price, weight, or salary).
Model idea: We assume the target $y$ can be explained as a linear combination of input features $x$ .

y≈β0+β1x1+β2x2+⋯+βpxp

Here:
- $β_{0}$ = intercept (baseline value).
- $β_{j}$ = coefficients (weights showing how much each feature contributes).

2. What is L2 Loss?

Loss function: Measures how far predictions are from actual values.
L2 loss (a.k.a. squared error loss):

L(β)=∑i=1n(yi−y^i)2

This is also called Sum of Squared Errors (SSE).
Why squared?
- Penalizes large errors more strongly.
- Ensures the function is smooth and differentiable (good for optimization).

3. Analytical Optimization (Finding the Best Coefficients)

Instead of guessing coefficients, we solve for the ones that minimize SSE.

Write the model in matrix form:

y=Xβ+ϵ

where:

$y$ = vector of target values.
$X$ = design matrix (rows = samples, columns = features).
$β$ = vector of coefficients.
$ϵ$ = error term.
The solution for minimizing L2 loss is given by the Ordinary Least Squares (OLS) estimator:

β^=(XTX)−1XTy

This formula gives the exact coefficients that minimize SSE (if $X^{T} X$ is invertible).

4. Statistical Properties of Linear Models

Linear regression isn’t just about prediction—it has statistical meaning.

Unbiasedness: On average, the estimated coefficients equal the true coefficients (assuming correct model and no bias in data).
Variance: The spread of coefficient estimates depends on sample size and feature correlations.
Normality: If errors are normally distributed, then estimates follow a normal distribution too.
Interpretability: Each coefficient $β_{j}$ tells us how much the target changes when feature $x_{j}$ increases by 1 (holding others constant).

5. Why L2 Loss Leads to SSE-Minimal Model

Minimizing L2 loss = minimizing SSE.
This ensures the line (or hyperplane in higher dimensions) is the “best fit” in the least-squares sense.
Geometrically: The regression line is the one that minimizes the squared vertical distances between data points and the line.

6. Key Takeaways

Linear regression predicts continuous outcomes using a linear model.
L2 loss (squared error) is the standard way to measure fit.
Analytical optimization gives a closed-form solution (OLS).
The model has strong statistical properties (unbiased, interpretable).
It’s the foundation of many machine learning methods.

Perfect! Let’s walk through a numerical example of linear regression with L2 loss so you can see how everything works step by step. This will make the abstract formulas much more concrete.

Example: Predicting House Prices 🏠

Step 1: Data

Suppose we have data about houses:

House	Size (m²)	Price ($1000s)
1	50	150
2	80	200
3	120	300

We want to predict Price from Size using a linear model:

y^=β0+β1⋅x

Step 2: L2 Loss (SSE)

For each house, the error is:

(yi−y^i)2

So the total loss is:

L(β0,β1)=∑i=13(yi−(β0+β1xi))2

Step 3: Analytical Optimization

We want the values of $β_{0}$ and $β_{1}$ that minimize this loss.

The closed-form solution (OLS) is:

β1^=∑(xi−xˉ)(yi−yˉ)∑(xi−xˉ)2

β0^=yˉ−β1^⋅xˉ

Step 4: Compute

Mean size: $\overset{ˉ}{x} = (50 + 80 + 120) / 3 = 83.3$
Mean price: $\overset{ˉ}{y} = (150 + 200 + 300) / 3 = 216.7$

Now compute slope ( $β_{1}$ ):

β1^=(50−83.3)(150−216.7)+(80−83.3)(200−216.7)+(120−83.3)(300−216.7)(50−83.3)2+(80−83.3)2+(120−83.3)2

β1^≈(−33.3)(−66.7)+(−3.3)(−16.7)+(36.7)(83.3)1111+11+1344

β1^≈2222+55+30562466≈2.17

Now intercept ( $β_{0}$ ):

β0^=216.7−2.17⋅83.3≈36.7

Step 5: Final Model

y^=36.7+2.17⋅x

Interpretation:

Intercept (36.7): Base price when size = 0 (not realistic, but mathematically needed).
Slope (2.17): Each extra square meter adds about $2,170 to the price.

Step 6: Prediction

For a house of size 100 m²:

y^=36.7+2.17⋅100≈253.7

So the model predicts a price of $253,700.

✅ This example shows how L2 loss leads us to the best-fit line that minimizes squared errors.

Excellent, let’s now connect the geometric intuition of regression to its statistical properties using our house price example. This will show you how the math, geometry, and statistics all tie together.

1. Unbiasedness 🎯

Recall: The regression line is the projection of the data onto the feature space.
If the true relationship is linear and errors have mean zero, then on average the regression line will hit the true line.
In our house price example: If the true slope is $2000 per m², our estimated slope (≈2170) will fluctuate around that true value depending on sample data.
Interpretation: The estimator doesn’t systematically overshoot or undershoot — it’s unbiased.

2. Variance 📊

Even though the estimator is unbiased, it varies depending on the sample.
Variance depends on:
- Sample size (n): More houses → less variance.
- Spread of features (x): If all houses are similar sizes, slope estimate is unstable.
- Noise level: More random variation in prices → higher variance.
Geometrically: If points are tightly clustered around the line, the projection is stable. If they’re scattered, the line wobbles more.

3. Efficiency (Gauss–Markov Theorem) 🏆

Among all linear unbiased estimators, OLS (our regression solution) has the minimum variance.
That means: No other linear method can consistently give you a “tighter” estimate.
Geometrically: The projection is the shortest possible squared distance — no other line can do better in terms of SSE.

4. Normality (when errors are normal) 📐

If the error terms are normally distributed, then the estimated coefficients ( $\hat{β_{0}}, \hat{β_{1}}$ ) are also normally distributed.
This allows us to:
- Build confidence intervals (e.g., “slope is between $1800 and $2500 per m² with 95% confidence”).
- Perform hypothesis tests (e.g., “is slope significantly different from zero?”).

5. Interpretability 💡

Each coefficient has a clear meaning:
- Intercept = baseline price.
- Slope = price increase per unit size.
This makes linear regression not just predictive, but explanatory.
In our example: “Each extra square meter adds about $2170 to the price.”

6. Summary

Geometric view: Regression = projection onto feature space.
Statistical view: Projection ensures unbiasedness, minimum variance, and normality (under assumptions).
Practical view: Coefficients are interpretable and useful for prediction.

✨ Now you’ve seen how the line of best fit is both a geometric projection and a statistically optimal estimator.

Would you like me to extend this foundation into Session 2 (where regularization like Ridge regression is introduced), or should we stay longer on Session 1 and practice with more examples and exercises?

yes

Great, let’s extend what we’ve built in Session 1 into the next logical step: Session 2 – Regularization (Ridge Regression). This will show you how linear models evolve when we add constraints to improve stability and generalization.

1. The Problem with Ordinary Least Squares (OLS)

OLS works beautifully when:
- Features aren’t highly correlated.
- We have plenty of data compared to the number of features.
But in practice:
- Multicollinearity (features strongly correlated) makes $(X^{T} X)^{- 1}$ unstable.
- Overfitting happens when the model fits noise instead of signal, especially with many features.

2. Ridge Regression (L2 Regularization)

Ridge regression modifies the loss function by adding a penalty term:

L(β)=∑i=1n(yi−y^i)2+λ∑j=1pβj2

Here:
- First term = usual SSE (fit to data).
- Second term = penalty on large coefficients.
- $λ$ = regularization strength (controls trade-off between fit and simplicity).

3. Analytical Solution

The ridge estimator is:

β^ridge=(XTX+λI)−1XTy

Notice the difference: we add $λ I$ to stabilize inversion.
Geometrically: Instead of projecting exactly, we shrink coefficients toward zero.

4. Statistical Properties

Bias–Variance Tradeoff:
- Ridge introduces bias (coefficients are shrunk).
- But variance decreases (estimates are more stable).
Interpretation: Ridge sacrifices perfect unbiasedness for lower variance → better predictions on new data.
Multicollinearity fix: Ridge handles correlated features gracefully by shrinking them together.

5. Example (House Prices with Ridge)

Suppose we add more features: size, number of rooms, distance to city center.

OLS might give unstable coefficients if size and rooms are highly correlated.
Ridge shrinks both coefficients, balancing them, and avoids extreme values.
Prediction accuracy on unseen houses improves.

6. Key Takeaways

OLS = best fit line, unbiased, minimum variance (under assumptions).
Ridge = adds penalty, stabilizes coefficients, improves generalization.
Regularization is essential in modern machine learning, especially with high-dimensional data.

✨ You now see how Session 2 builds directly on Session 1: starting from pure least squares, then adding regularization to handle real-world complexity.

Would you like me to compare Ridge regression with Lasso regression (L1 regularization) next, so you see how different penalties affect the model?

yes

Perfect, let’s now compare Ridge regression (L2 regularization) with Lasso regression (L1 regularization). This is a key step in understanding how different penalties shape linear models.

1. Loss Functions

Ridge (L2 penalty):

L(β)=∑(yi−y^i)2+λ∑βj2

Lasso (L1 penalty):

L(β)=∑(yi−y^i)2+λ∑∣βj∣

2. Geometric Intuition

Ridge shrinks coefficients smoothly toward zero but never exactly zero.
Lasso shrinks coefficients and can force some to be exactly zero.
This means:
- Ridge = keeps all features, but reduces their influence.
- Lasso = performs feature selection by eliminating irrelevant ones.

3. Analytical Solution

Ridge has a closed-form solution:

β^ridge=(XTX+λI)−1XTy

Lasso does not have a closed-form solution; it requires iterative optimization (like coordinate descent).

4. Statistical Properties

Ridge:
- Good when many features are correlated.
- Reduces variance but keeps all predictors.
Lasso:
- Good when only a few predictors matter.
- Produces sparse models (easy to interpret).
- Adds bias but can drastically reduce variance.

5. Example (House Prices)

Suppose we have features:

Size (m²)
Rooms
Distance to city center
Garden size
Year built
Ridge: All features get shrunk coefficients, but none are eliminated.
Lasso: If “Year built” doesn’t matter, Lasso may set its coefficient to 0, effectively removing it.

6. Key Takeaways

Aspect	Ridge (L2)	Lasso (L1)
Coefficients	Shrinks smoothly	Shrinks + can eliminate
Feature selection	No	Yes
Solution	Closed-form	Iterative
Best use case	Many correlated features	Sparse true model

✨ So, Ridge stabilizes models, while Lasso simplifies them by selecting only the most important features.

Would you like me to also show you Elastic Net, which combines Ridge and Lasso, so you see how both penalties can be blended for even more flexibility?

yes

Excellent choice — let’s explore Elastic Net, which blends Ridge (L2) and Lasso (L1) regularization. This is often the most practical option in real-world machine learning.

1. Elastic Net Loss Function

Elastic Net combines both penalties:

L(β)=∑(yi−y^i)2+λ1∑∣βj∣+λ2∑βj2

First term = usual SSE (fit to data).
Second term = L1 penalty (like Lasso).
Third term = L2 penalty (like Ridge).
$λ_{1}$ and $λ_{2}$ control the balance.

2. Why Combine L1 and L2?

Ridge alone: Stabilizes coefficients but keeps all features.
Lasso alone: Selects features but can be unstable when predictors are highly correlated.
Elastic Net: Gets the best of both worlds:
- Shrinks coefficients (like Ridge).
- Performs feature selection (like Lasso).
- Handles correlated features better than Lasso alone.

3. Geometric Intuition

Ridge penalty = circular constraint (coefficients shrink smoothly).
Lasso penalty = diamond-shaped constraint (coefficients can hit zero).
Elastic Net = a mix of circle + diamond → encourages sparsity but keeps stability.

4. Statistical Properties

Bias–Variance tradeoff: Elastic Net balances bias and variance more flexibly.
Feature selection: Can select groups of correlated features together (where Lasso might pick only one).
Interpretability: Produces sparse but stable models.

5. Example (House Prices)

Suppose we have features:

Size
Rooms
Distance to city center
Garden size
Year built
Ridge: All features kept, coefficients shrunk.
Lasso: Some features eliminated (maybe “Year built” = 0).
Elastic Net: Keeps important features, but if “Size” and “Rooms” are correlated, it may keep both with reduced coefficients instead of dropping one.

6. Key Takeaways

Aspect	Ridge (L2)	Lasso (L1)	Elastic Net (L1 + L2)
Coefficients	Shrinks smoothly	Shrinks + eliminates	Shrinks + eliminates (balanced)
Feature selection	No	Yes	Yes (more stable)
Best use case	Many correlated features	Sparse true model	Mixed case: correlated + sparse

✨ So Elastic Net is often the default choice in practice when you don’t know whether Ridge or Lasso alone is best — it adapts to both situations.

Would you like me to now summarize all three (Ridge, Lasso, Elastic Net) with a practical machine learning workflow (like how you’d choose between them in real projects)?

yes

Great, let’s wrap everything together into a practical machine learning workflow so you know when to use Ridge, Lasso, or Elastic Net in real projects. Think of this as a decision guide you can apply when building models.

1. Start with Ordinary Least Squares (OLS)

Use OLS first as a baseline model.
If it performs well and coefficients are stable → you may not need regularization.
But if you see overfitting or unstable coefficients, move to regularization.

2. Choosing Between Ridge, Lasso, and Elastic Net

Here’s a practical decision flow:

Situation	Best Choice	Why
Many features are correlated	Ridge	Shrinks coefficients smoothly, stabilizes estimates.
Only a few features matter (sparse true model)	Lasso	Eliminates irrelevant features, produces simple model.
Mix of correlated + sparse features	Elastic Net	Combines Ridge’s stability with Lasso’s feature selection.
You’re unsure which case applies	Elastic Net	Safe default, adapts to both scenarios.

3. Workflow in Practice

Split data into training and validation sets.
Fit OLS → check performance and coefficient stability.
Try Ridge → good if you want to keep all features but reduce variance.
Try Lasso → good if you want automatic feature selection.
Try Elastic Net → good compromise, often best in practice.
Tune hyperparameters ( $λ$ , or $λ_{1}, λ_{2}$ ) using cross-validation.
Compare models → choose the one with best validation performance and interpretability.

4. Real-World Example (House Prices)

Suppose you have 50 features (size, rooms, location, year built, etc.).
OLS: Overfits, unstable coefficients.
Ridge: Keeps all features, stabilizes predictions.
Lasso: Drops irrelevant ones (maybe “roof color” = 0).
Elastic Net: Keeps correlated features together (size + rooms), drops irrelevant ones, balances stability and simplicity.

5. Key Takeaway

OLS = baseline, interpretable.
Ridge = stability when features are correlated.
Lasso = sparsity, feature selection.
Elastic Net = balanced choice, often best in practice.

✨ With this workflow, you can confidently decide which linear model variant to use depending on your dataset and goals.

مرحبا بك في عالم بلوجر

يمكنك تحميل هذا القالب من مدونة جلال البعداني

Session 1: Linear Models with L2 Loss – Analytical Optimization and Statistical Properties of Linear Models

1. What is Linear Regression?

2. What is L2 Loss?

3. Analytical Optimization (Finding the Best Coefficients)

4. Statistical Properties of Linear Models

5. Why L2 Loss Leads to SSE-Minimal Model

6. Key Takeaways

Example: Predicting House Prices 🏠

Step 1: Data

Step 2: L2 Loss (SSE)

Step 3: Analytical Optimization

Step 4: Compute

Step 5: Final Model

Step 6: Prediction

1. Unbiasedness 🎯

2. Variance 📊

3. Efficiency (Gauss–Markov Theorem) 🏆

4. Normality (when errors are normal) 📐

5. Interpretability 💡

6. Summary

1. The Problem with Ordinary Least Squares (OLS)

2. Ridge Regression (L2 Regularization)

3. Analytical Solution

4. Statistical Properties

5. Example (House Prices with Ridge)

6. Key Takeaways

1. Loss Functions

2. Geometric Intuition

3. Analytical Solution

4. Statistical Properties

5. Example (House Prices)

6. Key Takeaways

1. Elastic Net Loss Function

2. Why Combine L1 and L2?

3. Geometric Intuition

4. Statistical Properties

5. Example (House Prices)

6. Key Takeaways

1. Start with Ordinary Least Squares (OLS)

2. Choosing Between Ridge, Lasso, and Elastic Net

3. Workflow in Practice

4. Real-World Example (House Prices)

5. Key Takeaway

مدونة البعداني

0 comments

اشترك معنا

التسميات

أرشيف المدونة الإلكترونية

المشاركات الشائعة

اجعل مدونتك افضل مدونة مع التصاميم المجانيه