Sunday, February 15, 2026

 


1. What is Linear Regression?

  • Goal: Predict a continuous numerical target variable (like house price, weight, or salary).

  • Model idea: We assume the target y can be explained as a linear combination of input features x.

yβ0+β1x1+β2x2++βpxp
  • Here:

    • β0 = intercept (baseline value).

    • βj = coefficients (weights showing how much each feature contributes).

2. What is L2 Loss?

  • Loss function: Measures how far predictions are from actual values.

  • L2 loss (a.k.a. squared error loss):

L(β)=i=1n(yiy^i)2
  • This is also called Sum of Squared Errors (SSE).

  • Why squared?

    • Penalizes large errors more strongly.

    • Ensures the function is smooth and differentiable (good for optimization).

3. Analytical Optimization (Finding the Best Coefficients)

Instead of guessing coefficients, we solve for the ones that minimize SSE.

  • Write the model in matrix form:

y=Xβ+ϵ

where:

  • y = vector of target values.

  • X = design matrix (rows = samples, columns = features).

  • β = vector of coefficients.

  • ϵ = error term.

  • The solution for minimizing L2 loss is given by the Ordinary Least Squares (OLS) estimator:

β^=(XTX)1XTy

This formula gives the exact coefficients that minimize SSE (if XTX is invertible).

4. Statistical Properties of Linear Models

Linear regression isn’t just about prediction—it has statistical meaning.

  • Unbiasedness: On average, the estimated coefficients equal the true coefficients (assuming correct model and no bias in data).

  • Variance: The spread of coefficient estimates depends on sample size and feature correlations.

  • Normality: If errors are normally distributed, then estimates follow a normal distribution too.

  • Interpretability: Each coefficient βj tells us how much the target changes when feature xj increases by 1 (holding others constant).

5. Why L2 Loss Leads to SSE-Minimal Model

  • Minimizing L2 loss = minimizing SSE.

  • This ensures the line (or hyperplane in higher dimensions) is the “best fit” in the least-squares sense.

  • Geometrically: The regression line is the one that minimizes the squared vertical distances between data points and the line.

6. Key Takeaways

  • Linear regression predicts continuous outcomes using a linear model.

  • L2 loss (squared error) is the standard way to measure fit.

  • Analytical optimization gives a closed-form solution (OLS).

  • The model has strong statistical properties (unbiased, interpretable).

  • It’s the foundation of many machine learning methods.

Perfect! Let’s walk through a numerical example of linear regression with L2 loss so you can see how everything works step by step. This will make the abstract formulas much more concrete.

Example: Predicting House Prices 🏠

Step 1: Data

Suppose we have data about houses:

HouseSize (m²)Price ($1000s)
150150
280200
3120300

We want to predict Price from Size using a linear model:

y^=β0+β1x

Step 2: L2 Loss (SSE)

For each house, the error is:

(yiy^i)2

So the total loss is:

L(β0,β1)=i=13(yi(β0+β1xi))2

Step 3: Analytical Optimization

We want the values of β0 and β1 that minimize this loss.

The closed-form solution (OLS) is:

β1^=(xixˉ)(yiyˉ)(xixˉ)2
β0^=yˉβ1^xˉ

Step 4: Compute

  • Mean size: xˉ=(50+80+120)/3=83.3

  • Mean price: yˉ=(150+200+300)/3=216.7

Now compute slope (β1):

β1^=(5083.3)(150216.7)+(8083.3)(200216.7)+(12083.3)(300216.7)(5083.3)2+(8083.3)2+(12083.3)2
β1^(33.3)(66.7)+(3.3)(16.7)+(36.7)(83.3)1111+11+1344
β1^2222+55+305624662.17

Now intercept (β0):

β0^=216.72.1783.336.7

Step 5: Final Model

y^=36.7+2.17x

Interpretation:

  • Intercept (36.7): Base price when size = 0 (not realistic, but mathematically needed).

  • Slope (2.17): Each extra square meter adds about $2,170 to the price.

Step 6: Prediction

For a house of size 100 m²:

y^=36.7+2.17100253.7

So the model predicts a price of $253,700.

✅ This example shows how L2 loss leads us to the best-fit line that minimizes squared errors.

Excellent, let’s now connect the geometric intuition of regression to its statistical properties using our house price example. This will show you how the math, geometry, and statistics all tie together.

1. Unbiasedness 🎯

  • Recall: The regression line is the projection of the data onto the feature space.

  • If the true relationship is linear and errors have mean zero, then on average the regression line will hit the true line.

  • In our house price example: If the true slope is $2000 per m², our estimated slope (≈2170) will fluctuate around that true value depending on sample data.

  • Interpretation: The estimator doesn’t systematically overshoot or undershoot — it’s unbiased.

2. Variance 📊

  • Even though the estimator is unbiased, it varies depending on the sample.

  • Variance depends on:

    • Sample size (n): More houses → less variance.

    • Spread of features (x): If all houses are similar sizes, slope estimate is unstable.

    • Noise level: More random variation in prices → higher variance.

  • Geometrically: If points are tightly clustered around the line, the projection is stable. If they’re scattered, the line wobbles more.

3. Efficiency (Gauss–Markov Theorem) 🏆

  • Among all linear unbiased estimators, OLS (our regression solution) has the minimum variance.

  • That means: No other linear method can consistently give you a “tighter” estimate.

  • Geometrically: The projection is the shortest possible squared distance — no other line can do better in terms of SSE.

4. Normality (when errors are normal) 📐

  • If the error terms are normally distributed, then the estimated coefficients (β0^,β1^) are also normally distributed.

  • This allows us to:

    • Build confidence intervals (e.g., “slope is between $1800 and $2500 per m² with 95% confidence”).

    • Perform hypothesis tests (e.g., “is slope significantly different from zero?”).

5. Interpretability 💡

  • Each coefficient has a clear meaning:

    • Intercept = baseline price.

    • Slope = price increase per unit size.

  • This makes linear regression not just predictive, but explanatory.

  • In our example: “Each extra square meter adds about $2170 to the price.”

6. Summary

  • Geometric view: Regression = projection onto feature space.

  • Statistical view: Projection ensures unbiasedness, minimum variance, and normality (under assumptions).

  • Practical view: Coefficients are interpretable and useful for prediction.

✨ Now you’ve seen how the line of best fit is both a geometric projection and a statistically optimal estimator.

Would you like me to extend this foundation into Session 2 (where regularization like Ridge regression is introduced), or should we stay longer on Session 1 and practice with more examples and exercises?

yes

Great, let’s extend what we’ve built in Session 1 into the next logical step: Session 2 – Regularization (Ridge Regression). This will show you how linear models evolve when we add constraints to improve stability and generalization.

1. The Problem with Ordinary Least Squares (OLS)

  • OLS works beautifully when:

    • Features aren’t highly correlated.

    • We have plenty of data compared to the number of features.

  • But in practice:

    • Multicollinearity (features strongly correlated) makes (XTX)1 unstable.

    • Overfitting happens when the model fits noise instead of signal, especially with many features.

2. Ridge Regression (L2 Regularization)

  • Ridge regression modifies the loss function by adding a penalty term:

L(β)=i=1n(yiy^i)2+λj=1pβj2
  • Here:

    • First term = usual SSE (fit to data).

    • Second term = penalty on large coefficients.

    • λ = regularization strength (controls trade-off between fit and simplicity).

3. Analytical Solution

The ridge estimator is:

β^ridge=(XTX+λI)1XTy
  • Notice the difference: we add λI to stabilize inversion.

  • Geometrically: Instead of projecting exactly, we shrink coefficients toward zero.

4. Statistical Properties

  • Bias–Variance Tradeoff:

    • Ridge introduces bias (coefficients are shrunk).

    • But variance decreases (estimates are more stable).

  • Interpretation: Ridge sacrifices perfect unbiasedness for lower variance → better predictions on new data.

  • Multicollinearity fix: Ridge handles correlated features gracefully by shrinking them together.

5. Example (House Prices with Ridge)

Suppose we add more features: size, number of rooms, distance to city center.

  • OLS might give unstable coefficients if size and rooms are highly correlated.

  • Ridge shrinks both coefficients, balancing them, and avoids extreme values.

  • Prediction accuracy on unseen houses improves.

6. Key Takeaways

  • OLS = best fit line, unbiased, minimum variance (under assumptions).

  • Ridge = adds penalty, stabilizes coefficients, improves generalization.

  • Regularization is essential in modern machine learning, especially with high-dimensional data.

✨ You now see how Session 2 builds directly on Session 1: starting from pure least squares, then adding regularization to handle real-world complexity.

Would you like me to compare Ridge regression with Lasso regression (L1 regularization) next, so you see how different penalties affect the model?

yes

Perfect, let’s now compare Ridge regression (L2 regularization) with Lasso regression (L1 regularization). This is a key step in understanding how different penalties shape linear models.

1. Loss Functions

  • Ridge (L2 penalty):

L(β)=(yiy^i)2+λβj2
  • Lasso (L1 penalty):

L(β)=(yiy^i)2+λβj

2. Geometric Intuition

  • Ridge shrinks coefficients smoothly toward zero but never exactly zero.

  • Lasso shrinks coefficients and can force some to be exactly zero.

  • This means:

    • Ridge = keeps all features, but reduces their influence.

    • Lasso = performs feature selection by eliminating irrelevant ones.

3. Analytical Solution

  • Ridge has a closed-form solution:

β^ridge=(XTX+λI)1XTy
  • Lasso does not have a closed-form solution; it requires iterative optimization (like coordinate descent).

4. Statistical Properties

  • Ridge:

    • Good when many features are correlated.

    • Reduces variance but keeps all predictors.

  • Lasso:

    • Good when only a few predictors matter.

    • Produces sparse models (easy to interpret).

    • Adds bias but can drastically reduce variance.

5. Example (House Prices)

Suppose we have features:

  • Size (m²)

  • Rooms

  • Distance to city center

  • Garden size

  • Year built

  • Ridge: All features get shrunk coefficients, but none are eliminated.

  • Lasso: If “Year built” doesn’t matter, Lasso may set its coefficient to 0, effectively removing it.

6. Key Takeaways

AspectRidge (L2)Lasso (L1)
CoefficientsShrinks smoothlyShrinks + can eliminate
Feature selectionNoYes
SolutionClosed-formIterative
Best use caseMany correlated featuresSparse true model

✨ So, Ridge stabilizes models, while Lasso simplifies them by selecting only the most important features.

Would you like me to also show you Elastic Net, which combines Ridge and Lasso, so you see how both penalties can be blended for even more flexibility?

yes

Excellent choice — let’s explore Elastic Net, which blends Ridge (L2) and Lasso (L1) regularization. This is often the most practical option in real-world machine learning.

1. Elastic Net Loss Function

Elastic Net combines both penalties:

L(β)=(yiy^i)2+λ1βj+λ2βj2
  • First term = usual SSE (fit to data).

  • Second term = L1 penalty (like Lasso).

  • Third term = L2 penalty (like Ridge).

  • λ1 and λ2 control the balance.

2. Why Combine L1 and L2?

  • Ridge alone: Stabilizes coefficients but keeps all features.

  • Lasso alone: Selects features but can be unstable when predictors are highly correlated.

  • Elastic Net: Gets the best of both worlds:

    • Shrinks coefficients (like Ridge).

    • Performs feature selection (like Lasso).

    • Handles correlated features better than Lasso alone.

3. Geometric Intuition

  • Ridge penalty = circular constraint (coefficients shrink smoothly).

  • Lasso penalty = diamond-shaped constraint (coefficients can hit zero).

  • Elastic Net = a mix of circle + diamond → encourages sparsity but keeps stability.

4. Statistical Properties

  • Bias–Variance tradeoff: Elastic Net balances bias and variance more flexibly.

  • Feature selection: Can select groups of correlated features together (where Lasso might pick only one).

  • Interpretability: Produces sparse but stable models.

5. Example (House Prices)

Suppose we have features:

  • Size

  • Rooms

  • Distance to city center

  • Garden size

  • Year built

  • Ridge: All features kept, coefficients shrunk.

  • Lasso: Some features eliminated (maybe “Year built” = 0).

  • Elastic Net: Keeps important features, but if “Size” and “Rooms” are correlated, it may keep both with reduced coefficients instead of dropping one.

6. Key Takeaways

AspectRidge (L2)Lasso (L1)Elastic Net (L1 + L2)
CoefficientsShrinks smoothlyShrinks + eliminatesShrinks + eliminates (balanced)
Feature selectionNoYesYes (more stable)
Best use caseMany correlated featuresSparse true modelMixed case: correlated + sparse

✨ So Elastic Net is often the default choice in practice when you don’t know whether Ridge or Lasso alone is best — it adapts to both situations.

Would you like me to now summarize all three (Ridge, Lasso, Elastic Net) with a practical machine learning workflow (like how you’d choose between them in real projects)?

yes

Great, let’s wrap everything together into a practical machine learning workflow so you know when to use Ridge, Lasso, or Elastic Net in real projects. Think of this as a decision guide you can apply when building models.

1. Start with Ordinary Least Squares (OLS)

  • Use OLS first as a baseline model.

  • If it performs well and coefficients are stable → you may not need regularization.

  • But if you see overfitting or unstable coefficients, move to regularization.

2. Choosing Between Ridge, Lasso, and Elastic Net

Here’s a practical decision flow:

SituationBest ChoiceWhy
Many features are correlatedRidgeShrinks coefficients smoothly, stabilizes estimates.
Only a few features matter (sparse true model)LassoEliminates irrelevant features, produces simple model.
Mix of correlated + sparse featuresElastic NetCombines Ridge’s stability with Lasso’s feature selection.
You’re unsure which case appliesElastic NetSafe default, adapts to both scenarios.

3. Workflow in Practice

  1. Split data into training and validation sets.

  2. Fit OLS → check performance and coefficient stability.

  3. Try Ridge → good if you want to keep all features but reduce variance.

  4. Try Lasso → good if you want automatic feature selection.

  5. Try Elastic Net → good compromise, often best in practice.

  6. Tune hyperparameters (λ, or λ1,λ2) using cross-validation.

  7. Compare models → choose the one with best validation performance and interpretability.

4. Real-World Example (House Prices)

  • Suppose you have 50 features (size, rooms, location, year built, etc.).

  • OLS: Overfits, unstable coefficients.

  • Ridge: Keeps all features, stabilizes predictions.

  • Lasso: Drops irrelevant ones (maybe “roof color” = 0).

  • Elastic Net: Keeps correlated features together (size + rooms), drops irrelevant ones, balances stability and simplicity.

5. Key Takeaway

  • OLS = baseline, interpretable.

  • Ridge = stability when features are correlated.

  • Lasso = sparsity, feature selection.

  • Elastic Net = balanced choice, often best in practice.

✨ With this workflow, you can confidently decide which linear model variant to use depending on your dataset and goals.

مدونة البعداني

اجعل مدونتك افضل مدونة مع التصاميم المجانية و الاضافات الجديدة و الدعم الفني المجاني فقط على مدونة جلال البعداني ,,,,,

0 comments