January 15, 2025

Introduction to Method of Least Squares


In this post, we provide a short introduction to the method of least squares. This method serves as the basis of many estimation and control algorithms. Consequently, every engineer, as well as other scientists, should have a working knowledge and basic understanding of this important method. In our next post, we will explain how to implement this method in C++ and Python. Also, in our future posts, we will explain how to recursively implement the least-squares method. A YouTube video accompanying this post is given below.

In the sequel, we present the problem formulation. Let us assume that we want to estimate a set of scalar variables: x_{1},x_{2},\ldots, x_{n}. However, all we have are the noisy observed (measured) variables y_{1},y_{2},\ldots, y_{s}, where in the general case s>n (basic engineering intuition tells us that we need to have more measurements than unknowns, that is the main reason why s>n).

Let us assume that the observed variables are related to the set of scalar variables that we want to estimate via the following linear model

(1)   \begin{align*}y_{i}=\sum_{j=1}^{n} L_{i,j} x_{j}+v_{i}, \;\; i=1,2,\ldots, s\end{align*}

where L_{i,j}\in \mathbb{R} are scalars and v_{i}\in \mathbb{R} is the measurement noise. Here it should be emphasized that the noise term is not known. All we can know are the statistical properties of the noise that can be estimated from the data. We assume that the measurement noise is zero mean and that

(2)   \begin{align*}E[v_{i}^{2}]=\sigma_{i}^{2}\end{align*}

where E[\cdot] is mathematical expectation and \sigma_{i}^{2} is a variance (\sigma_{i} is a standard deviation) . Furthermore, we assume that v_{i} and v_{j} are statistically independent for i\ne j.

The set of equations in (1) can be compactly written as follows

(3)   \begin{align*}\underbrace{\begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{s}  \end{bmatrix}}_{\mathbf{y}} = \underbrace{\begin{bmatrix}L_{11} & L_{12}& \ldots & L_{1n}  \\ L_{21} & L_{22}& \ldots & L_{2n}   \\ \vdots & \vdots & \ldots & \vdots  \\ L_{s1} & L_{s2} & \ldots & L_{sn}\end{bmatrix}}_{L} \underbrace{\begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n}   \end{bmatrix}}_{\mathbf{x}}+\underbrace{\begin{bmatrix} v_{1} \\ v_{2} \\ \vdots \\ v_{s} \end{bmatrix}}_{\mathbf{v}}\end{align*}

where \mathbf{y},\mathbf{v} \in \mathbb{R}^{s} and L\in \mathbb{R}^{s\times n}. In (3) and throughout this post, vector quantities are denoted using a bold font. The last equation can be compactly written as follows:

(4)   \begin{align*}\mathbf{y}=L\mathbf{x}+\mathbf{v}\end{align*}

In order to formulate the least-squares problem, we need to quantify how accurate is our estimate of scalar variables. Arguably, the most natural way of quantifying the accuracy is to introduce the measurement error

(5)   \begin{align*}e_{i} = y_{i}-\sum_{j=1}^{n} L_{i,j} x_{j}, \;\; i=1,2,\ldots, s\end{align*}

Using (4), the last equation can be written compactly as follows

(6)   \begin{align*}\boldsymbol{e}=\mathbf{y}-L\mathbf{x}\end{align*}

where

(7)   \begin{align*}e=\begin{bmatrix}e_{1} \\ e_{2} \\ \vdots \\ e_{s}  \end{bmatrix}\end{align*}

The main idea of the least-squares method is to penalize the weighted sum of squares of the errors. This weighted sum is called a cost function. In the case of the least-squares method, the cost function has the following form

(8)   \begin{align*}J=\frac{e_{1}^{2}}{\sigma_{1}^{2}}+\frac{e_{2}^{2}}{\sigma_{2}^{2}}+\ldots+\frac{e_{s}^{2}}{\sigma_{s}^{2}}\end{align*}

where e_{i} is defined in (5). In (8), we are dividing the measurement errors by the noise variance. This idea is deeply rooted in engineering intuition and common sense. Namely, if a noise v_{i} is significant, then its variance will be large. Consequently, we are less confident that the measurement y_{i} will contain useful information. In other words, we want to put less emphasis on the squared measurement error e^{2}_{i} in the cost function. We can do that by dividing the square of the measurement error by the variance. If the variance has a large value, then the term e^{2}_{i}/\sigma_{i}^{2} will be relatively small. On the other hand, if the noise in y_{i} is not significant, then the term e^{2}_{i}/\sigma_{i}^{2} will accordingly increase its influence in the overall sum that is represented by J.

From (6)-(7), it follows that the the cost function (8) can be written as

(9)   \begin{align*}J=\mathbf{e}^{T}W\mathbf{e}=\Big(\mathbf{y}-L\mathbf{x} \Big)^{T} W \Big(\mathbf{y}-L\mathbf{x} \Big) \end{align*}

where W is a diagonal weighting matrix defined as follows

(10)   \begin{align*}W=\begin{bmatrix}\frac{1}{\sigma_{1}^{2}} & 0 & 0 & \ldots & 0 \\0 & \frac{1}{\sigma_{2}^{2}} &  0 & \ldots & 0 & \\  &   & & \ddots & \\ 0 & 0 & \ldots & 0 & \frac{1}{\sigma_{s}^{2}} \end{bmatrix}\end{align*}

The least-squares problem can be formally represented as follows:

(11)   \begin{align*}\text{min}_{\mathbf{x}} \;\; \mathbf{e}^{T}W\mathbf{e} \end{align*}

In order to find the minimum of this cost function, we need to compute its gradient and set it to zero. By solving this equation, which is often referred to as the normal equation, we can find an optimal value of \mathbf{x}. To compute the gradient, we use the denominator layout notation. In this notation, the gradient is a column vector, and more background information about this notation can be found on its Wikipedia page. To find the gradient of the cost function we use two rules that are outlined in equations (69) and (81) in “The Matrix Cookbook” (2012 version). First, let us expand the expression for J in (9). We have

(12)   \begin{align*}J=\mathbf{y}^{T}W\mathbf{y}-\mathbf{y}^{T}WL\mathbf{x}-\mathbf{x}^{T}L^{T}W\mathbf{y}+\mathbf{x}^{T}L^{T}WL\mathbf{x}\end{align*}

Next, we need to compute the gradient of J, that is denoted by \nabla J. Since the gradient is a linear operator (gradient of a sum of elements is equal to the sum of gradients of elements), we need to compute the gradients of every term in (12). The gradients of the following two terms are computed using the equation (69) in “The Matrix Cookbook” (2012 version):

(13)   \begin{align*}\nabla ( \mathbf{y}^{T}WL\mathbf{x} )= L^{T}W\mathbf{y} \\\nabla (\mathbf{x}^{T}L^{T}W\mathbf{y} )= L^{T}W\mathbf{y}\end{align*}


where we have used the fact that the matrix W is symmetric. The gradient of the following term is computed using the equation (81) in “The Matrix Cookbook” (2012 version):

(14)   \begin{align*}\nabla (\mathbf{x}^{T}L^{T}WL\mathbf{x}) = 2 L^{T}WL\mathbf{x}\end{align*}

Finally, using (13) and (14), we can express the gradient of J as follows:

(15)   \begin{align*}\nabla J =-2L^{T}W\mathbf{y}+2L^{T}WL\mathbf{x}\end{align*}

By setting the gradient to zero, we obtain the system of normal equations:

(16)   \begin{align*}-L^{T}W\mathbf{y}+L^{T}WL\mathbf{x}=0\end{align*}

Under the assumption that the matrix L^{T}WL is invertible, we obtain the solution of the least-squares problem

(17)   \begin{align*}\hat{x} = (L^{T}WL)^{-1}L^{T}W\mathbf{y} \end{align*}

where (L^{T}WL)^{-1} denotes the matrix inverse of L^{T}WL, and the “hat” notation above \mathbf{x} is used to denote the solution of the least-squares problem.

In our next post, we are going to explain how to implement the least-squares method in Python and C++.