Motivation

In many applications, we may like to find a best fit line.

Given data points $\{(x_i, y_i)\}_{i=1}^N = \{(x_1, y_1),(x_2, y_2),\dots, (x_N, y_N)\}$ where $N\geq 2$ and $x_i$'s are distinct.

We wish to fit the data with a straight line $y = kx$. There are many (infinitely many) choices of straight lines. The best choice would be a straight line that best fit the given data.

Mathematically, we hope to find $k$ so that the following total error is minimized (as compared to other values of $k$): $$J(k) = \sum_{i=1}^N (y_i - kx_i)^2.$$

Learning Outcomes

  • To encourage students to appreciate the application of differentiation and minimization
  • To motivate students to further investigate the topic of multivariable calculus

Finding '$k$' by Minimization

By expansion, we have

$$ J(k) = \sum_{i=1}^N \big(y_i^2 - 2kx_iy_i + k^2x_i^2 \big) $$$$ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;= \Big(\sum_{i=1}^N x_i^2 \Big)k^2 - 2 \Big(\sum_{i=1}^N x_iy_i \Big)k + \sum_{i=1}^N y_i^2 $$

Notice that $J(k)$ is a quadratic polynomial in $k$. We define $$ a = \sum_{i=1}^N x_i^2, \;\; b = 2\sum_{i=1}^Nx_iy_i, \;\; c=\sum_{i=1}^Ny_i^2 ,$$ note that they are just constants based on the given data! We can then write \begin{equation}J(k) = ak^2 - bk + c\end{equation}

Obviously, $$\frac{dJ}{dk}=0 \;\;\;\Leftrightarrow\;\;\; 2ak - b = 0 \;\;\;\Leftrightarrow\;\;\; k=\frac{b}{2a}$$

Since $a = \sum_{i=1}^N x_i^2$ and at least one of the terms is non-zero, thus

$$ \frac{d^2J}{dk^2} = 2a \;> 0 $$

By Second Derivative Test, the error $J(k)$ is minimized at

$$ k_m = \frac{b}{2a} = \frac{\sum_{i=1}^N x_iy_i}{\sum_{i=1}^Nx_i^2} $$

The line defined by $y=k_mx$ would best fit the given data points. Let's test our theory with a concrete example!

Example

Given the following data

$x$ $y$
1 2.34
1.5 2.68
2 3.86
2.5 5.88
3 6.19
3.5 7.41
4 7.31
4.5 8.36
5 9.86
5.5 11.47

By the above analysis, $$ J(k) \;= \;\; ak^2 - bk + c \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; $$

$$ \;\;\;\;\;\;\;\;\;\;\;\; = \Big(\sum_{i=1}^{10} x_i^2\Big)k^2 - \Big(2\sum_{i=1}^{10}x_iy_i\Big)k + \sum_{i=1}^{10}y_i^2 $$$$ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; =\Big(1^2+1.5^2+...+5.5^2\Big)k^2 - 2\Big((1)(2.34)+(1.5)(2.68)+...+(5.5)(11.47)\Big)k $$$$ \;\;\;\;\;\;\;\;\;+ \Big(2.34^2+2.68^2+...+11.47^2\Big) $$$$ = 126.25k^2 - 505.06k + 507.46$$

We compute $$ k_m = \frac{\sum_{i=1}^{10} x_iy_i}{\sum_{i=1}^{10}x_i^2} = \frac{(1)(2.34) + (1.5)(2.68) + ... + (5.5)(11.47)}{1^2 + 1.5^2 + ... + 5.5^2} = 2.00$$

$$$$

Comparing this choice of slope with other values. What can you observe?

$$$$

Application

Suppose that we would like to estimate the gravitational acceleration $g$ by conducting a simple experiment. We allow a mass to free fall from a certain height.

We then measure the displacement (s) which should follow the following equation \begin{equation} s = \frac{1}{2}gt^2 \end{equation} where $t$ is the time elapsed. Based on the measurements

$t$ $s$
2 7.63
3 56.16
4 66.48
5 134.62
6 188.61
7 228.35
8 325.91
9 387.67

we would like to estimate $g$.

We can apply a similar argument as before with the total error defined by

$$ Q(g) = \sum_{i=1}^8 \big(s_i - \frac{1}{2}gt_i^2 \big)^2 $$$$ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; = \sum_{i=1}^8 \Big(s_i^2 - gs_it_i^2 + \frac{1}{4}g^2t_i^4 \Big)$$$$ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; = \Big(\frac{1}{4}\sum_{i=1}^8t_i^4\Big)g^2 - \Big(\sum_{i=1}^8s_it_i^2\Big)g + \sum_{i=1}^8s_i^2$$$$ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; = 3833g^2 - 75203.76g + 369977.11$$

The estimated value of $g$ is $$ g^* = \frac{\sum_{i=1}^8 (\frac{1}{2}t_i^2)(s_i)}{\sum_{i=1}^8 (\frac{1}{2}t_i^2)^2} = \frac{2\sum_{i=1}^8 t_i^2s_i}{\sum_{i=1}^8 t_i^4} = \frac{2[(2^2)(7.63)+(3^2)(56.16)+...+(9^2)(387.67)]}{2^4+3^4+...+9^4} = 9.81$$

Questions to Ponder

  • Observe that we are fitting some data points with a straight line passing through the origin. Can we adjust the intercept?

    Of course! Try to formulate the problem where we consider fitting the data points with $y = x + c$ and solve for the best value of c.

  • How about a generic straight line $y = kx + c$?

    That will involve multivariable calculus!

References

  • K. D. Stroyan, Calculus Using Mathematica: Scientific Projects and Mathematical Background, Academic Press, 2014.