In many applications, we may like to find a best fit line.
Given data points $\{(x_i, y_i)\}_{i=1}^N = \{(x_1, y_1),(x_2, y_2),\dots, (x_N, y_N)\}$ where $N\geq 2$ and $x_i$'s are distinct.
We wish to fit the data with a straight line $y = kx$. There are many (infinitely many) choices of straight lines. The best choice would be a straight line that best fit the given data.
Mathematically, we hope to find $k$ so that the following total error is minimized (as compared to other values of $k$): $$J(k) = \sum_{i=1}^N (y_i - kx_i)^2.$$
By expansion, we have
$$ J(k) = \sum_{i=1}^N \big(y_i^2 - 2kx_iy_i + k^2x_i^2 \big) $$$$ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;= \Big(\sum_{i=1}^N x_i^2 \Big)k^2 - 2 \Big(\sum_{i=1}^N x_iy_i \Big)k + \sum_{i=1}^N y_i^2 $$Notice that $J(k)$ is a quadratic polynomial in $k$. We define $$ a = \sum_{i=1}^N x_i^2, \;\; b = 2\sum_{i=1}^Nx_iy_i, \;\; c=\sum_{i=1}^Ny_i^2 ,$$ note that they are just constants based on the given data! We can then write \begin{equation}J(k) = ak^2 - bk + c\end{equation}
Obviously, $$\frac{dJ}{dk}=0 \;\;\;\Leftrightarrow\;\;\; 2ak - b = 0 \;\;\;\Leftrightarrow\;\;\; k=\frac{b}{2a}$$
Since $a = \sum_{i=1}^N x_i^2$ and at least one of the terms is non-zero, thus
$$ \frac{d^2J}{dk^2} = 2a \;> 0 $$By Second Derivative Test, the error $J(k)$ is minimized at
$$ k_m = \frac{b}{2a} = \frac{\sum_{i=1}^N x_iy_i}{\sum_{i=1}^Nx_i^2} $$The line defined by $y=k_mx$ would best fit the given data points. Let's test our theory with a concrete example!
Given the following data
$x$ | $y$ |
---|---|
1 | 2.34 |
1.5 | 2.68 |
2 | 3.86 |
2.5 | 5.88 |
3 | 6.19 |
3.5 | 7.41 |
4 | 7.31 |
4.5 | 8.36 |
5 | 9.86 |
5.5 | 11.47 |
By the above analysis, $$ J(k) \;= \;\; ak^2 - bk + c \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; $$
$$ \;\;\;\;\;\;\;\;\;\;\;\; = \Big(\sum_{i=1}^{10} x_i^2\Big)k^2 - \Big(2\sum_{i=1}^{10}x_iy_i\Big)k + \sum_{i=1}^{10}y_i^2 $$$$ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; =\Big(1^2+1.5^2+...+5.5^2\Big)k^2 - 2\Big((1)(2.34)+(1.5)(2.68)+...+(5.5)(11.47)\Big)k $$$$ \;\;\;\;\;\;\;\;\;+ \Big(2.34^2+2.68^2+...+11.47^2\Big) $$$$ = 126.25k^2 - 505.06k + 507.46$$We compute $$ k_m = \frac{\sum_{i=1}^{10} x_iy_i}{\sum_{i=1}^{10}x_i^2} = \frac{(1)(2.34) + (1.5)(2.68) + ... + (5.5)(11.47)}{1^2 + 1.5^2 + ... + 5.5^2} = 2.00$$
$$$$Comparing this choice of slope with other values. What can you observe?
$$$$Suppose that we would like to estimate the gravitational acceleration $g$ by conducting a simple experiment. We allow a mass to free fall from a certain height.
We then measure the displacement (s) which should follow the following equation \begin{equation} s = \frac{1}{2}gt^2 \end{equation} where $t$ is the time elapsed. Based on the measurements
$t$ | $s$ |
---|---|
2 | 7.63 |
3 | 56.16 |
4 | 66.48 |
5 | 134.62 |
6 | 188.61 |
7 | 228.35 |
8 | 325.91 |
9 | 387.67 |
we would like to estimate $g$.
We can apply a similar argument as before with the total error defined by
$$ Q(g) = \sum_{i=1}^8 \big(s_i - \frac{1}{2}gt_i^2 \big)^2 $$$$ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; = \sum_{i=1}^8 \Big(s_i^2 - gs_it_i^2 + \frac{1}{4}g^2t_i^4 \Big)$$$$ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; = \Big(\frac{1}{4}\sum_{i=1}^8t_i^4\Big)g^2 - \Big(\sum_{i=1}^8s_it_i^2\Big)g + \sum_{i=1}^8s_i^2$$$$ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; = 3833g^2 - 75203.76g + 369977.11$$The estimated value of $g$ is $$ g^* = \frac{\sum_{i=1}^8 (\frac{1}{2}t_i^2)(s_i)}{\sum_{i=1}^8 (\frac{1}{2}t_i^2)^2} = \frac{2\sum_{i=1}^8 t_i^2s_i}{\sum_{i=1}^8 t_i^4} = \frac{2[(2^2)(7.63)+(3^2)(56.16)+...+(9^2)(387.67)]}{2^4+3^4+...+9^4} = 9.81$$
Observe that we are fitting some data points with a straight line passing through the origin. Can we adjust the intercept?
Of course! Try to formulate the problem where we consider fitting the data points with $y = x + c$ and solve for the best value of c.
How about a generic straight line $y = kx + c$?
That will involve multivariable calculus!