diff --git a/docs/ml/linear.md b/docs/ml/linear.md index 6d8e11a..92413bd 100644 --- a/docs/ml/linear.md +++ b/docs/ml/linear.md @@ -6,20 +6,20 @@ Give me 20 minutes, I will make Linear Regression click for you. -Linear regression seems simple—it's just fitting a line to data. But then you encounter a wall of jargon: *loss functions*, *gradient descent*, *cost surface*, *learning rates*, and the *normal equation*. The goal is to get past the terminology and see the simple, powerful machine at work. +Linear regression seems simple—it's just fitting a line to data. But then you encounter a wall of jargon: _loss functions_, _gradient descent_, _cost surface_, _learning rates_, and the _normal equation_. The goal is to get past the terminology and see the simple, powerful machine at work. We will start with a concrete problem: predicting a student's final exam score based on the hours they studied. Here is our data: | Hours Studied (x) | Exam Score (y) | | :--------------- | :------------- | -| 1 | 2 | -| 2 | 4 | -| 3 | 5 | -| 4 | 4 | -| 5 | 5 | +| 1 | 2 | +| 2 | 4 | +| 3 | 5 | +| 4 | 4 | +| 5 | 5 | -Our goal is to find a function that takes *hours studied* and outputs a predicted *exam score*. +Our goal is to find a function that takes _hours studied_ and outputs a predicted _exam score_. By the end of this tutorial, you will understand the fundamental components of not just linear regression, but many machine learning models. You will be able to explain, use, and even code the following from scratch. @@ -28,19 +28,22 @@ By the end of this tutorial, you will understand the fundamental components of n You will master the three pillars of this model and a one-shot analytical solution. 1. **The Model (Hypothesis Function):** The formula that makes predictions. For a single input feature like 'hours studied', it's a simple line. For multiple features, it's a plane or hyperplane. - * **Formula:** $\hat{y} = w \cdot x + b$ - * **Vector Form:** $\hat{y} = \mathbf{w}^T \mathbf{x} + b$ + + - **Formula:** $\hat{y} = w \cdot x + b$ + - **Vector Form:** $\hat{y} = \mathbf{w}^T \mathbf{x} + b$ 2. **The Loss Function (Cost Function):** A function that measures how bad our model's predictions are. Our goal is to make this number as small as possible. We will use the Mean Squared Error (MSE). - * **Formula:** $J(w, b) = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2$ + + - **Formula:** $J(w, b) = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2$ 3. **The Optimizer (Gradient Descent):** The algorithm that systematically finds the best values for `w` and `b` by minimizing the loss function. It's like walking down a hill to find the lowest point. - * **Update Rule:** - $w \leftarrow w - \alpha \frac{\partial J}{\partial w}$ - $b \leftarrow b - \alpha \frac{\partial J}{\partial b}$ + + - **Update Rule:** + $w \leftarrow w - \alpha \frac{\partial J}{\partial w}$ + $b \leftarrow b - \alpha \frac{\partial J}{\partial b}$ 4. **The Analytical Solution (Normal Equation):** A direct, one-shot formula to calculate the best `w` and `b` without any iteration. It is a powerful shortcut. - * **Formula:** $\theta = (X^T X)^{-1} X^T y$ + - **Formula:** $\theta = (X^T X)^{-1} X^T y$ These components are the engine of linear regression. Let's build it, piece by piece. @@ -51,12 +54,12 @@ These components are the engine of linear regression. Let's build it, piece by p **A Concrete Example:** Let's use our student data. | Hours Studied (x) | Exam Score (y) | -| :--------------- | :------------- | -| 1 | 2 | -| 2 | 4 | -| 3 | 5 | -| 4 | 4 | -| 5 | 5 | +| :---------------- | :------------- | +| 1 | 2 | +| 2 | 4 | +| 3 | 5 | +| 4 | 4 | +| 5 | 5 | If we plot this, we get a scatter of points. @@ -83,10 +86,10 @@ Visually, the green line is a better fit. Our model is the equation for that lin The formula for any straight line is: $\hat{y} = w \cdot x + b$ -* $\hat{y}$ (y-hat): This is our **predicted** output value (e.g., predicted exam score). -* $x$: This is our input value (e.g., hours studied). -* $w$: The **weight** (or slope). It controls the steepness of the line. A bigger `w` means that for every hour studied, the predicted score increases more. -* $b$: The **bias** (or y-intercept). It's the value of $\hat{y}$ when $x=0$. You can think of it as a baseline prediction. +- $\hat{y}$ (y-hat): This is our **predicted** output value (e.g., predicted exam score). +- $x$: This is our input value (e.g., hours studied). +- $w$: The **weight** (or slope). It controls the steepness of the line. A bigger `w` means that for every hour studied, the predicted score increases more. +- $b$: The **bias** (or y-intercept). It's the value of $\hat{y}$ when $x=0$. You can think of it as a baseline prediction. Finding the "best fit line" is just a search for the optimal values of `w` and `b`. @@ -98,10 +101,9 @@ $\hat{y} = w_1 \cdot x_1 + w_2 \cdot x_2 + b$ The principle is identical: find the weights (`w1`, `w2`) and bias (`b`) that make the predictions closest to the actual house prices. - ## **Section 3: The Loss Function: Quantifying the "Error"** -**The Big Picture:** Our eyes can tell us the green line is better than the red one. But to find the *best* line, a computer needs a precise, mathematical way to measure how "bad" any given line is. This measurement is called the **loss function**. A high loss means a bad fit. A low loss means a good fit. +**The Big Picture:** Our eyes can tell us the green line is better than the red one. But to find the _best_ line, a computer needs a precise, mathematical way to measure how "bad" any given line is. This measurement is called the **loss function**. A high loss means a bad fit. A low loss means a good fit. Our goal is to find the `w` and `b` that give the lowest possible loss. @@ -109,7 +111,7 @@ Our goal is to find the `w` and `b` that give the lowest possible loss. The most common loss function for regression is the Mean Squared Error. The formula looks intimidating, but the idea is simple. We calculate it in three steps for every point in our data: -1. **Calculate the error:** For a single point, find the difference between the predicted value and the actual value. This vertical distance is called the *residual*. +1. **Calculate the error:** For a single point, find the difference between the predicted value and the actual value. This vertical distance is called the _residual_. `error = predicted_y - actual_y` or $\hat{y}_i - y_i$ 2. **Square the error:** We square the error to get rid of negative signs (so errors don't cancel each other out) and to penalize large errors much more than small ones. An error of 3 becomes 9, while an error of 10 becomes 100. `squared_error = (error)^2` @@ -119,40 +121,38 @@ The most common loss function for regression is the Mean Squared Error. The form Let's prove with math that our visual intuition was right. -* **Data:** `xs = [1, 2, 3, 4, 5]`, `ys = [2, 4, 5, 4, 5]` -* **Line 1 (Bad Guess):** $\hat{y} = 1 \cdot x + 1$ (Here, `w=1`, `b=1`) -* **Line 2 (Better Guess):** $\hat{y} = 0.6 \cdot x + 2.5$ (Here, `w=0.6`, `b=2.5`) +- **Data:** `xs = [1, 2, 3, 4, 5]`, `ys = [2, 4, 5, 4, 5]` +- **Line 1 (Bad Guess):** $\hat{y} = 1 \cdot x + 1$ (Here, `w=1`, `b=1`) +- **Line 2 (Better Guess):** $\hat{y} = 0.6 \cdot x + 2.5$ (Here, `w=0.6`, `b=2.5`) #### Calculation for Line 1 (w=1, b=1) -| x | y (actual) | $\hat{y} = 1x+1$ (predicted) | Error ($\hat{y}-y$) | Squared Error | -|:-:|:----------:|:----------------------------:|:-------------------:|:---------------:| -| 1 | 2 | 2 | 0 | 0 | -| 2 | 4 | 3 | -1 | 1 | -| 3 | 5 | 4 | -1 | 1 | -| 4 | 4 | 5 | 1 | 1 | -| 5 | 5 | 6 | 1 | 1 | -| | | | **Sum:** | **4** | +| x | y (actual) | $\hat{y} = 1x+1$ (predicted) | Error ($\hat{y}-y$) | Squared Error | +| :-: | :--------: | :--------------------------: | :-----------------: | :-----------: | +| 1 | 2 | 2 | 0 | 0 | +| 2 | 4 | 3 | -1 | 1 | +| 3 | 5 | 4 | -1 | 1 | +| 4 | 4 | 5 | 1 | 1 | +| 5 | 5 | 6 | 1 | 1 | +| | | | **Sum:** | **4** | **MSE for Line 1 = (Sum of Squared Errors) / n = 4 / 5 = 0.8** #### Calculation for Line 2 (w=0.6, b=2.5) -| x | y (actual) | $\hat{y} = 0.6x+2.5$ (predicted) | Error ($\hat{y}-y$) | Squared Error | -|:-:|:----------:|:--------------------------------:|:-------------------:|:---------------:| -| 1 | 2 | 3.1 | 1.1 | 1.21 | -| 2 | 4 | 3.7 | -0.3 | 0.09 | -| 3 | 5 | 4.3 | -0.7 | 0.49 | -| 4 | 4 | 4.9 | 0.9 | 0.81 | -| 5 | 5 | 5.5 | 0.5 | 0.25 | -| | | | **Sum:** | **2.85** | +| x | y (actual) | $\hat{y} = 0.6x+2.5$ (predicted) | Error ($\hat{y}-y$) | Squared Error | +| :-: | :--------: | :------------------------------: | :-----------------: | :-----------: | +| 1 | 2 | 3.1 | 1.1 | 1.21 | +| 2 | 4 | 3.7 | -0.3 | 0.09 | +| 3 | 5 | 4.3 | -0.7 | 0.49 | +| 4 | 4 | 4.9 | 0.9 | 0.81 | +| 5 | 5 | 5.5 | 0.5 | 0.25 | +| | | | **Sum:** | **2.85** | **MSE for Line 2 = (Sum of Squared Errors) / n = 2.85 / 5 = 0.57** **The Result:** Line 2 has a lower MSE (0.57) than Line 1 (0.8). The math confirms it is a better fit. The goal of training, which we will cover next, is to find the values for `w` and `b` that produce the minimum possible MSE. - - ## **Section 4: The Training: Finding the Best Line with Gradient Descent** **The Big Picture:** We now have a model (`y = w*x + b`) and a way to score it (MSE). The final piece is the process for finding the specific `w` and `b` that result in the lowest possible MSE score. This process is called **training**, and the most common algorithm for it is **Gradient Descent**. @@ -164,67 +164,70 @@ Imagine the loss function as a giant, hilly landscape. Every possible combinatio The problem is, we're in a thick fog. We can't see the whole landscape. All we can do is feel the slope of the ground right where we're standing. Gradient Descent is a simple strategy: + 1. **Check the slope:** Feel which direction is steepest downhill. In math, this slope is called the **gradient**. 2. **Take a small step:** Take one step in that downhill direction. 3. **Repeat:** From your new position, repeat the process. By taking many small steps, you will eventually walk down the hill and settle at the bottom of the valley. -*(For a deep dive on the calculus behind calculating the gradient, check out our video: [link to hypothetical video on derivatives and partial derivatives])* +_(For a deep dive on the calculus behind calculating the gradient, check out our video: [link to hypothetical video on derivatives and partial derivatives])_ **Formalizing the Algorithm** This "walking" process translates into a simple update rule for our parameters, `w` and `b`: -* **Update Rule for `w`:** `w_new = w_old - learning_rate * gradient_w` -* **Update Rule for `b`:** `b_new = b_old - learning_rate * gradient_b` +- **Update Rule for `w`:** `w_new = w_old - learning_rate * gradient_w` +- **Update Rule for `b`:** `b_new = b_old - learning_rate * gradient_b` Two new terms here: -* **`learning_rate` ($\alpha$):** This controls the size of our downhill step. If it's too big, we might overshoot the valley. If it's too small, it could take forever to get to the bottom. It's a hyperparameter you choose. -* **`gradient_w` and `gradient_b`:** These are the calculated slopes for `w` and `b`. They tell us how a small change in `w` or `b` will affect the MSE. The formulas for these gradients, derived from the MSE function, are: - * $gradient_w = \frac{\partial J}{\partial w} = \frac{2}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i) \cdot x_i$ - * $gradient_b = \frac{\partial J}{\partial b} = \frac{2}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)$ + +- **`learning_rate` ($\alpha$):** This controls the size of our downhill step. If it's too big, we might overshoot the valley. If it's too small, it could take forever to get to the bottom. It's a hyperparameter you choose. +- **`gradient_w` and `gradient_b`:** These are the calculated slopes for `w` and `b`. They tell us how a small change in `w` or `b` will affect the MSE. The formulas for these gradients, derived from the MSE function, are: + - $gradient_w = \frac{\partial J}{\partial w} = \frac{2}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i) \cdot x_i$ + - $gradient_b = \frac{\partial J}{\partial b} = \frac{2}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)$ **A Concrete Walkthrough: One Step of Gradient Descent** Let's perform a single training step. -* **Data:** `xs = [1, 2, 3, 4, 5]`, `ys = [2, 4, 5, 4, 5]` -* **Hyperparameter:** Let's choose a `learning_rate` of `0.01`. -* **Step 0: Initialize.** We start with a random guess. Let's begin at `w = 0.0` and `b = 0.0`. This is our "before" state. +- **Data:** `xs = [1, 2, 3, 4, 5]`, `ys = [2, 4, 5, 4, 5]` +- **Hyperparameter:** Let's choose a `learning_rate` of `0.01`. +- **Step 0: Initialize.** We start with a random guess. Let's begin at `w = 0.0` and `b = 0.0`. This is our "before" state. The MSE for this initial line ($\hat{y} = 0$) is high: $MSE_{before} = \frac{(0-2)^2 + (0-4)^2 + (0-5)^2 + (0-4)^2 + (0-5)^2}{5} = \frac{4+16+25+16+25}{5} = 17.2$ -* **Step 1: Calculate the Gradients.** We use our formulas and data to find the slope at our current position (`w=0, b=0`). +- **Step 1: Calculate the Gradients.** We use our formulas and data to find the slope at our current position (`w=0, b=0`). -| x | y | $\hat{y} = 0x+0$ | Error ($\hat{y}-y$) | Error * x | -|:-:|:-:|:----------------:|:-------------------:|:---------:| -| 1 | 2 | 0 | -2 | -2 | -| 2 | 4 | 0 | -4 | -8 | -| 3 | 5 | 0 | -5 | -15 | -| 4 | 4 | 0 | -4 | -16 | -| 5 | 5 | 0 | -5 | -25 | -| | | | **Sum = -20** | **Sum = -66** | +| x | y | $\hat{y} = 0x+0$ | Error ($\hat{y}-y$) | Error \* x | +| :-: | :-: | :--------------: | :-----------------: | :-----------: | +| 1 | 2 | 0 | -2 | -2 | +| 2 | 4 | 0 | -4 | -8 | +| 3 | 5 | 0 | -5 | -15 | +| 4 | 4 | 0 | -4 | -16 | +| 5 | 5 | 0 | -5 | -25 | +| | | | **Sum = -20** | **Sum = -66** | Now, plug the sums into the gradient formulas (`n=5`): -* `gradient_w` = (2 / 5) * (-66) = **-26.4** -* `gradient_b` = (2 / 5) * (-20) = **-8.0** -These gradients tell us the direction of steepest *ascent*. To go downhill, we move in the opposite direction. +- `gradient_w` = (2 / 5) \* (-66) = **-26.4** +- `gradient_b` = (2 / 5) \* (-20) = **-8.0** -* **Step 2: Update the Parameters.** We use our update rule to take one small step. +These gradients tell us the direction of steepest _ascent_. To go downhill, we move in the opposite direction. -* `w_new = w_old - learning_rate * gradient_w` - `w_new = 0.0 - 0.01 * (-26.4) = 0.264` +- **Step 2: Update the Parameters.** We use our update rule to take one small step. -* `b_new = b_old - learning_rate * gradient_b` - `b_new = 0.0 - 0.01 * (-8.0) = 0.08` +- `w_new = w_old - learning_rate * gradient_w` + `w_new = 0.0 - 0.01 * (-26.4) = 0.264` + +- `b_new = b_old - learning_rate * gradient_b` + `b_new = 0.0 - 0.01 * (-8.0) = 0.08` **The Result: Before and After** -* **Before (Step 0):** `w = 0.0`, `b = 0.0`, `MSE = 17.2` -* **After (Step 1):** `w = 0.264`, `b = 0.08`, `MSE = 11.45` (calculated by plugging the new `w` and `b` into the MSE formula) +- **Before (Step 0):** `w = 0.0`, `b = 0.0`, `MSE = 17.2` +- **After (Step 1):** `w = 0.264`, `b = 0.08`, `MSE = 11.45` (calculated by plugging the new `w` and `b` into the MSE formula) As you can see, after just **one** step, our line is already significantly better—the MSE has dropped from 17.2 to 11.45. The process simply repeats this exact calculation many times (`epochs`), with `w` and `b` getting closer to the optimal values with every step. @@ -232,19 +235,19 @@ As you can see, after just **one** step, our line is already significantly bette If we continue the process with a learning rate of `0.05`, the table below shows how `w` and `b` gradually converge to the optimal values (w = 0.6, b = 2.2): -| Step | w | b | MSE | -|-----:|------:|------:|------:| -| 0 | 0.0000 | 0.0000 | 17.2000 | -| 1 | 1.3200 | 0.4000 | 1.6464 | -| 2 | 1.0680 | 0.3640 | 1.1047 | -| 5 | 1.0807 | 0.4655 | 1.0276 | -| 10 | 1.0412 | 0.6072 | 0.9418 | -| 20 | 0.9720 | 0.8569 | 0.8084 | -| 50 | 0.8231 | 1.3946 | 0.5981 | -| 100 | 0.6951 | 1.8566 | 0.5015 | -| 200 | 0.6173 | 2.1376 | 0.4807 | -| 500 | 0.6001 | 2.1996 | 0.4800 | -| 1000 | 0.6000 | 2.2000 | 0.4800 | +| Step | w | b | MSE | +| ----------: | ---------: | ---------: | ---------: | +| 0 | 0.0000 | 0.0000 | 17.2000 | +| 1 | 1.3200 | 0.4000 | 1.6464 | +| 2 | 1.0680 | 0.3640 | 1.1047 | +| 5 | 1.0807 | 0.4655 | 1.0276 | +| 10 | 1.0412 | 0.6072 | 0.9418 | +| 20 | 0.9720 | 0.8569 | 0.8084 | +| 50 | 0.8231 | 1.3946 | 0.5981 | +| 100 | 0.6951 | 1.8566 | 0.5015 | +| 200 | 0.6173 | 2.1376 | 0.4807 | +| 500 | 0.6001 | 2.1996 | 0.4800 | +| 1000 | 0.6000 | 2.2000 | 0.4800 | | **Optimal** | **0.6000** | **2.2000** | **0.4800** | Notice how the MSE drops dramatically in the first step, then gradually refines. By step 1000, we've essentially converged to the optimal solution that the Normal Equation gives us instantly. @@ -258,6 +261,7 @@ It's the mathematical equivalent of seeing the entire loss landscape from above **The Derivation (Briefly)** The intuition comes from basic calculus: the minimum of a function is where its slope (derivative) is zero. The Normal Equation is what you get when you: + 1. Write the MSE loss function using matrix notation. 2. Take the derivative with respect to your parameters (`w` and `b`). 3. Set that derivative to zero. @@ -266,15 +270,15 @@ The intuition comes from basic calculus: the minimum of a function is where its The resulting formula is: $\theta = (X^T X)^{-1} X^T y$ -* $\theta$ (theta): A vector containing all our model parameters. In our case, $\theta = \begin{bmatrix} b \\ w \end{bmatrix}$. -* $X$: The **design matrix**, which is our input data `xs` with an extra column of ones added for the bias term. -* $y$: The vector of our actual output values. +- $\theta$ (theta): A vector containing all our model parameters. In our case, $\theta = \begin{bmatrix} b \\ w \end{bmatrix}$. +- $X$: The **design matrix**, which is our input data `xs` with an extra column of ones added for the bias term. +- $y$: The vector of our actual output values. **Applying the Normal Equation to Our Example** Let's solve for the optimal `w` and `b` for our student data in one go. -* **Data:** `xs = [1, 2, 3, 4, 5]`, `ys = [2, 4, 5, 4, 5]` +- **Data:** `xs = [1, 2, 3, 4, 5]`, `ys = [2, 4, 5, 4, 5]` **Step 1: Construct the matrix `X` and vector `y`** @@ -322,8 +326,7 @@ X^T X = 1 & 3 \\ 1 & 4 \\ 1 & 5 -\end{bmatrix} -= +\end{bmatrix} = \begin{bmatrix} 5 & 15 \\ 15 & 55 @@ -334,14 +337,14 @@ $$ For a 2x2 matrix $\begin{bmatrix} a & b \\ c & d \end{bmatrix}$, the inverse is $\frac{1}{ad-bc}\begin{bmatrix} d & -b \\ -c & a \end{bmatrix}$. -Determinant ($ad-bc$) = (5 * 55) - (15 * 15) = 275 - 225 = 50. +Determinant ($ad-bc$) = (5 _ 55) - (15 _ 15) = 275 - 225 = 50. -$$(X^T X)^{-1} = \frac{1}{50} +$$ +(X^T X)^{-1} = \frac{1}{50} \begin{bmatrix} 55 & -15 \\ -15 & 5 -\end{bmatrix} -= +\end{bmatrix} = \begin{bmatrix} 1.1 & -0.3 \\ -0.3 & 0.1 @@ -351,7 +354,7 @@ $$ **Step 4: Calculate $X^T y$** $$ -X^T y = +X^T y = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 & 5 @@ -362,13 +365,11 @@ X^T y = 5 \\ 4 \\ 5 -\end{bmatrix} -= +\end{bmatrix} = \begin{bmatrix} 2+4+5+4+5 \\ 2+8+15+16+25 -\end{bmatrix} -= +\end{bmatrix} = \begin{bmatrix} 20 \\ 66 @@ -386,18 +387,15 @@ $$ \begin{bmatrix} 20 \\ 66 -\end{bmatrix} -= +\end{bmatrix} = \begin{bmatrix} (1.1 * 20) + (-0.3 * 66) \\ (-0.3 * 20) + (0.1 * 66) -\end{bmatrix} -= +\end{bmatrix} = \begin{bmatrix} 22 - 19.8 \\ -6 + 6.6 -\end{bmatrix} -= +\end{bmatrix} = \begin{bmatrix} 2.2 \\ 0.6 @@ -407,19 +405,20 @@ $$ **The Result:** The Normal Equation gives us the exact optimal parameters in one calculation: -* $b = 2.2$ -* $w = 0.6$ + +- $b = 2.2$ +- $w = 0.6$ The best fit line for our data is $\hat{y} = 0.6 \cdot x + 2.2$. This is the mathematical bottom of the loss valley that Gradient Descent was slowly stepping towards. **Practical Tradeoffs** -| Feature | Gradient Descent | Normal Equation | -| :-------------------- | :--------------------------------------------- | :--------------------------------------------- | -| **Process** | Iterative, takes many small steps. | Direct, one-shot calculation. | -| **Scalability** | Works well with huge datasets (millions of features). | Computationally expensive for many features (inverting a large matrix is slow). | -| **Learning Rate** | Requires choosing a learning rate, $\alpha$. | No hyperparameters to tune. | -| **When to Use** | The default for most large-scale ML problems. | Excellent for smaller datasets where the number of features is not too large (e.g., < 10,000). | +| Feature | Gradient Descent | Normal Equation | +| :---------------- | :---------------------------------------------------- | :--------------------------------------------------------------------------------------------- | +| **Process** | Iterative, takes many small steps. | Direct, one-shot calculation. | +| **Scalability** | Works well with huge datasets (millions of features). | Computationally expensive for many features (inverting a large matrix is slow). | +| **Learning Rate** | Requires choosing a learning rate, $\alpha$. | No hyperparameters to tune. | +| **When to Use** | The default for most large-scale ML problems. | Excellent for smaller datasets where the number of features is not too large (e.g., < 10,000). | ## **Section 6: Conclusion: From Our Simple Model to the Real World** @@ -428,19 +427,22 @@ In the last 20 minutes, we have built a complete machine learning model from the We started with a simple dataset and a clear goal: predict an exam score from hours studied. To achieve this, we assembled a three-part engine: 1. **The Model (Hypothesis Function):** We defined a straight line as our hypothesis for how the data behaves. - * **Formula:** $\hat{y} = w \cdot x + b$ + + - **Formula:** $\hat{y} = w \cdot x + b$ 2. **The Loss Function (Cost Function):** We chose Mean Squared Error (MSE) to give us a single, precise number that quantifies how "wrong" our model's predictions are. - * **Formula:** $J(w, b) = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2$ + + - **Formula:** $J(w, b) = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2$ 3. **The Optimizer (Gradient Descent):** We used Gradient Descent to iteratively step towards the best `w` and `b` that minimize the MSE. - * **Update Rule:** - $w \leftarrow w - \alpha \frac{\partial J}{\partial w}$ - $b \leftarrow b - \alpha \frac{\partial J}{\partial b}$ + + - **Update Rule:** + $w \leftarrow w - \alpha \frac{\partial J}{\partial w}$ + $b \leftarrow b - \alpha \frac{\partial J}{\partial b}$ 4. **The Analytical Solution (Normal Equation):** We also saw how the Normal Equation can solve for the optimal parameters directly in one calculation. - * **Formula:** $\theta = (X^T X)^{-1} X^T y$ + - **Formula:** $\theta = (X^T X)^{-1} X^T y$ These components—**model, loss, and optimizer**—are the fundamental building blocks of most supervised machine learning. -You now understand the engine that powers a significant portion of data science and machine learning. You've built the foundation. \ No newline at end of file +You now understand the engine that powers a significant portion of data science and machine learning. You've built the foundation.