January 22, 2025

Introduction to Kalman Filter: Derivation of the Recursive Least Squares Method


In this post, we derive equations describing the recursive least squares method. The motivation for creating this tutorial, comes from the fact that these equations are used in the derivation of the Kalman filter equations. Apart from this, the recursive least squares method is the basis of many other algorithms, such as adaptive control algorithms. Consequently, it is of paramount importance to properly understand the recursive least-squares method. The derivation of the Kalman filter equations on the basis of the recursive least-squares equations is arguably much simpler and easier to understand than the derivation based on other methods or approaches.

The Python implementation of the derived least-squares method is given here.

The YouTube videos accompanying this post are given below.

Motivational Example

To motivate the need for the recursive least squares method, we consider the following example. We assume that a vehicle is moving from a starting position y_{0}, with an initial velocity, denoted by v_{0}, and with an acceleration denoted by a. Our goal is to estimate, y_{0}, v_{0}, and a, by only measuring position values y(t) obtained by a distance sensor. Here, t denotes the time variable. From the basic kinematics equations that relate the position, velocity, and acceleration, we know that the following model fully kinematically describes the motion of the car

(1)   \begin{align*}y(t)=y_{0}+v_{0}t+\frac{1}{2}at^{2}\end{align*}


We assume that we are able to measure the position at the discrete-time instants t= 0,\Delta t, 2\Delta t, 3\Delta t,\ldots, where \Delta t is the sampling period of the sensor. Next, let y(k\Delta t):=y_{k}. Our measurement equation is then

(2)   \begin{align*}y_{k}=\begin{bmatrix} 1 & k\Delta t &  \frac{1}{2} (k\Delta t)^{2}  \end{bmatrix}\begin{bmatrix} y_{0} \\ v_{0} \\ a  \end{bmatrix}+v_{k}\end{align*}

where v_{k} is the measurement noise. Clearly, our C_{k} matrix is then

(3)   \begin{align*}C_{k}=\begin{bmatrix} 1 & k\Delta t &  \frac{1}{2} (k\Delta t)^{2}    \end{bmatrix}\end{align*}

And our vector \mathbf{x} is then

(4)   \begin{align*}\mathbf{x}=\begin{bmatrix} y_{0} \\ v_{0} \\ a  \end{bmatrix}\end{align*}

The goal of the recursive least squares method is to estimate the vector \mathbf{x} recursively on the basis of the sequentially obtained data \mathbf{y}_{0},\mathbf{y}_{1},\mathbf{y}_{2},\ldots. That is, we start from some initial guess of \mathbf{x}, denoted by \mathbf{x}_{0}. When the initial measurement sample \mathbf{y}_{0} arrives, we should update the initial estimate. Then, when the next sample \mathbf{y}_{1} arrives, we should use this sample to update the previously computed estimate. This procedure is repeated every time a new measurement sample arrives.

Issues with the ordinary least squares method

In our previous post which can be found here, we derived the solution to the least squares problem. Here for completeness of this tutorial, we briefly revise the ordinary least squares method.

Consider the following system of linear equations

(5)   \begin{align*}y_{i}=\sum_{j=1}^{n} C_{i,j} x_{j}+v_{i}, \;\; i=1,2,\ldots, s\end{align*}

where x_{j}, j=1,2,\ldots, n are real scalar variables that we want to determine (unknowns), C_{i,j}\in \mathbb{R} are scalars and v_{i}\in \mathbb{R} is the measurement noise. The noise term is not known. However, we assume that we know the statistical properties of the noise that can be estimated from the data. We assume that the measurement noise is zero mean and that

(6)   \begin{align*}E[v_{i}^{2}]=\sigma_{i}^{2}\end{align*}

where E[\cdot] is an expectation operator and \sigma_{i}^{2} is a variance (\sigma_{i} is a standard deviation) . Furthermore, we assume that v_{i} and v_{j} are statistically independent for all i\ne j.

The equations in (5) can be compactly written as follows

(7)   \begin{align*}\underbrace{\begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{s}  \end{bmatrix}}_{\mathbf{y}} = \underbrace{\begin{bmatrix}C_{11} & C_{12}& \ldots & C_{1n}  \\ C_{21} & C_{22}& \ldots & C_{2n}   \\ \vdots & \vdots & \ldots & \vdots  \\ C_{s1} & C_{s2} & \ldots & C_{sn}\end{bmatrix}}_{C} \underbrace{\begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n}   \end{bmatrix}}_{\mathbf{x}}+\underbrace{\begin{bmatrix} v_{1} \\ v_{2} \\ \vdots \\ v_{s} \end{bmatrix}}_{\mathbf{v}}\end{align*}

where \mathbf{y},\mathbf{v} \in \mathbb{R}^{s} and C\in \mathbb{R}^{s\times n}. In (7) and throughout this post, vector quantities are denoted using a bold font. The last equation can be compactly written as follows:

(8)   \begin{align*}\mathbf{y}=C\mathbf{x}+\mathbf{v}\end{align*}

Usually, in the least squares estimation problems n\ll s, that is, the number of measurements s is usually much larger than the number of variables to be estimated. Under the assumption that the matrix C has a full column rank and under the assumption that s\ge n, the least squares solution is found by solving the following optimization problem

(9)   \begin{align*}\min_{\mathbf{x}} \left\|\mathbf{y}-C\mathbf{x} \right\|_{2}^{2}\end{align*}

There are a number of computationally efficient approaches for solving (9). Here for brevity and simplicity, we use the solution that is not the most efficient one to be computed. This solution is given by the following equation

(10)   \begin{align*}\hat{\mathbf{x}}=(C^{T}C)^{-1}C^{T}\mathbf{y}\end{align*}

There are at least two issues with the formulation and solution of the least squares problem:

  1. The least-squares problem formulation and its solutions assume that all measurements are available at a certain time instant. However, this often might not be the case in the practice. Also, we would like to update the solution as new measurements arrive in order to have an adaptive approach for updating the solution.
  2. When the number of measurements s is extremely large, the solutions of the least squares problem are difficult to compute. That is, the computational and memory complexities of the solution are very high.

These facts motivate the development of the recursive least squares that is presented in the sequel.

Recursive Least Squares Method – Complete Derivation From Scratch

Let us assume that at the discrete-time instant k, the set of measurements grouped in the vector \mathbf{y}_{k}\in \mathbb{R}^{l} becomes available. Let us assume that these measurements are related to the vector \mathbf{x} that we want to estimate by the following equation

(11)   \begin{align*}\mathbf{y}_{k}=C_{k}\mathbf{x}+\mathbf{v}_{k}\end{align*}

where C_{k}\in \mathbb{R}^{l\times n}, and \mathbf{v}_{k}\in \mathbb{R}^{l} is the measurement noise vector. We assume that the covariance of the measurement noise is given by

(12)   \begin{align*}E[\mathbf{v}_{k}\mathbf{v}^{T}_{k}]=R_{k}\end{align*}

and we assume that the mean of \mathbf{v}_{k} is zero:

(13)   \begin{align*}E[\mathbf{v}_{k}]=0\end{align*}

The recursive least-squares method that we derive in this section, has the following form:

(14)   \begin{align*}\hat{\mathbf{x}}_{k}= \hat{\mathbf{x}}_{k-1}+K_{k}(\mathbf{y}_{k}-C_{k}\hat{\mathbf{x}}_{k-1})\end{align*}

where \hat{\mathbf{x}}_{k} and \hat{\mathbf{x}}_{k-1} are the estimates of the vector \mathbf{x} at the discrete-time instants k and k-1, and K_{k}\in \mathbb{R}^{n\times l} is the gain matrix that we need to determine.

The equation (14) updates the estimate of \mathbf{x} at the time instant k on the basis of the estimate \hat{\mathbf{x}}_{k-1} at the previous time instant k-1 and on the basis of the measurement \mathbf{y}_{k} obtained at the time instant k, as well as on the basis of the gain matrix K_{k} computed at the time instant k.

Our goal is to derive the formulas for computing the gain matrix K_{k}.

First, we need to introduce new notation. We assume that the vector \hat{\mathbf{x}}_{k} is partitioned as follows

(15)   \begin{align*}\hat{\mathbf{x}}_{k}=\begin{bmatrix}\hat{x}_{1,k} \\ \hat{x}_{2,k} \\ \vdots \\ \hat{x}_{n,k}  \end{bmatrix}\end{align*}

This partitioning directly corresponds to the partitioning of the vector \mathbf{x}

(16)   \begin{align*}\mathbf{x}=\begin{bmatrix}x_{1} \\ x_{2} \\ \vdots \\ x_{n}  \end{bmatrix}\end{align*}

Next, we introduce the estimation error vector \boldsymbol{\varepsilon}_{k} and the following partitioning:

(17)   \begin{align*}\boldsymbol{\varepsilon}_{k}=\begin{bmatrix} \varepsilon_{1,k} \\ \varepsilon_{2,k} \\ \vdots \\ \varepsilon_{n,k}\end{bmatrix}=\mathbf{x}-\hat{\mathbf{x}}_{k}=\begin{bmatrix} x_{1}-\hat{x}_{1,k} \\ x_{2}-\hat{x}_{2,k} \\ \vdots \\ x_{n}-\hat{x}_{n,k} \end{bmatrix}\end{align*}

That is,

(18)   \begin{align*}\varepsilon_{i,k}= x_{i}-\hat{x}_{i,k},\;\; i=1,2,\ldots,n\end{align*}

The gain K_{k} is computed by minimizing the sum of variances of the estimation errors

(19)   \begin{align*}W_{k}=E[\varepsilon_{1,k}^{2}]+E[\varepsilon_{2,k}^{2}]+\ldots + E[\varepsilon_{n,k}^{2}]\end{align*}

Next, we show that this cost function, can be represented as follows:

(20)   \begin{align*}W_{k}=\text{tr}(P_{k})\end{align*}

where \text{tr}(\cdot) is the notation for the matrix trace, and P_{k} is the estimation error covariance matrix that is defined by

(21)   \begin{align*}P_{k}=E[\boldsymbol{\varepsilon}_{k}\boldsymbol{\varepsilon}^{T}_{k}]\end{align*}

That is, the gain matrix K_{k} is found by minimizing the trace of the estimation error covariance matrix.

To show this, let us first expand the following expression

(22)   \begin{align*}& \boldsymbol{\varepsilon}_{k}\boldsymbol{\varepsilon}^{T}_{k}  =\begin{bmatrix}  \varepsilon_{1,k} \\ \varepsilon_{2,k} \\ \vdots \\ \varepsilon_{n,k} \end{bmatrix} \begin{bmatrix}  \varepsilon_{1,k} & \varepsilon_{2,k} & \hdots & \varepsilon_{n,k} \end{bmatrix} \\ & = \begin{bmatrix} \varepsilon_{1,k}^{2} & \varepsilon_{1,k}\varepsilon_{2,k}  & \varepsilon_{1,k}\varepsilon_{3,k} & \ldots &  \varepsilon_{1,k}\varepsilon_{n,k} \\  \varepsilon_{1,k}\varepsilon_{2,k} & \varepsilon_{2,k}^{2}  & \varepsilon_{2,k}\varepsilon_{3,k} & \ldots &  \varepsilon_{2,k}\varepsilon_{n,k} \\ \vdots  & \vdots & \vdots & \hdots & \vdots \\ \varepsilon_{1,k}\varepsilon_{n,k} & \varepsilon_{2,k}\varepsilon_{n,k} & \varepsilon_{3,k}\varepsilon_{n,k} & \ldots &  \varepsilon_{n,k}^{2} \end{bmatrix}\end{align*}

From the last expression, we have

(23)   \begin{align*}P_{k}& = E[\boldsymbol{\varepsilon}_{k}\boldsymbol{\varepsilon}^{T}_{k}] \\&  = \begin{bmatrix} E[\varepsilon_{1,k}^{2}] & E[\varepsilon_{1,k}\varepsilon_{2,k}]  & E[\varepsilon_{1,k}\varepsilon_{3,k}] & \ldots &  E[\varepsilon_{1,k}\varepsilon_{n,k}] \\  E[\varepsilon_{1,k}\varepsilon_{2,k}] & E[\varepsilon_{2,k}^{2}]  & E[\varepsilon_{2,k}\varepsilon_{3,k}] & \ldots &  E[\varepsilon_{2,k}\varepsilon_{n,k}] \\ \vdots  & \vdots & \vdots & \hdots & \vdots \\ E[\varepsilon_{1,k}\varepsilon_{n,k}] & E[\varepsilon_{2,k}\varepsilon_{n,k}] &E[\varepsilon_{3,k}\varepsilon_{n,k}] & \ldots &  E[\varepsilon_{n,k}^{2}] \end{bmatrix}\end{align*}

Next, by taking the matrix trace of the last expression, we obtain

(24)   \begin{align*}\text{tr}(P_{k})=E[\varepsilon_{1,k}^{2}] +E[\varepsilon_{2,k}^{2}] +\ldots +E[\varepsilon_{n,k}^{2}] \end{align*}

and this is exactly the expression for the cost function W_{k} given by (19).

Next, we derive the expression for the time propagation of the estimation covariance matrix. By substituting the estimate propagation equation (14) and the measurement equation (11) in the expression for the estimation error, we obtain

(25)   \begin{align*}\boldsymbol{\varepsilon}_{k} & =\mathbf{x}-\hat{\mathbf{x}}_{k} \\& =\mathbf{x} -  \hat{\mathbf{x}}_{k-1}-K_{k}(\mathbf{y}_{k}-C_{k}\hat{\mathbf{x}}_{k-1}) \\& =\mathbf{x} -  \hat{\mathbf{x}}_{k-1}-K_{k}(C_{k}\mathbf{x}+\mathbf{v}_{k}-C_{k}\hat{\mathbf{x}}_{k-1}) \\&= (I-K_{k}C_{k})(\mathbf{x}-\hat{\mathbf{x}}_{k-1})-K_{k}\mathbf{v}_{k} \\& =(I-K_{k}C_{k})\boldsymbol{\varepsilon}_{k-1}-K_{k}\mathbf{v}_{k}\end{align*}

Next, we have

(26)   \begin{align*}\boldsymbol{\varepsilon}_{k}\boldsymbol{\varepsilon}^{T}_{k}& = ((I-K_{k}C_{k})\boldsymbol{\varepsilon}_{k-1}-K_{k}\mathbf{v}_{k})((I-K_{k}C_{k})\boldsymbol{\varepsilon}_{k-1}-K_{k}\mathbf{v}_{k})^{T} \\ & = ((I-K_{k}C_{k})\boldsymbol{\varepsilon}_{k-1}-K_{k}\mathbf{v}_{k})(\boldsymbol{\varepsilon}_{k-1}^{T}(I-K_{k}C_{k})^{T}-\mathbf{v}_{k}^{T}K_{k}^{T})   \\& = (I-K_{k}C_{k})\boldsymbol{\varepsilon}_{k-1}\boldsymbol{\varepsilon}^{T}_{k-1}(I-K_{k}C_{k})^{T}-(I-K_{k}C_{k})\boldsymbol{\varepsilon}^{T}_{k-1}\mathbf{v}_{k}^{T}K_{k}^{T} \\& -K_{k}\mathbf{v}_{k}\boldsymbol{\varepsilon}^{T}_{k-1}(I-K_{k}C_{k})^{T}+K_{k}\mathbf{v}_{k}\mathbf{v}_{k}^{T}K_{k}^{T}\end{align*}

Applying the expectation operator to (26), we obtain

(27)   \begin{align*}P_{k} & =E[\boldsymbol{\varepsilon}_{k}\boldsymbol{\varepsilon}^{T}_{k}] \\& = (I-K_{k}C_{k})E[\boldsymbol{\varepsilon}_{k-1}\boldsymbol{\varepsilon}^{T}_{k-1}](I-K_{k}C_{k})^{T}-(I-K_{k}C_{k})E[\boldsymbol{\varepsilon}_{k-1}\mathbf{v}_{k}^{T}]K_{k}^{T} \\& -K_{k}E[\mathbf{v}_{k}\boldsymbol{\varepsilon}^{T}_{k-1}](I-K_{k}C_{k})^{T}+K_{k}E[\mathbf{v}_{k}\mathbf{v}_{k}^{T}]K_{k}^{T}\end{align*}

We have

(28)   \begin{align*}P_{k-1}=E[\boldsymbol{\varepsilon}_{k-1}\boldsymbol{\varepsilon}^{T}_{k-1}]\end{align*}

Then, since \mathbf{v}_{k} and \boldsymbol{\varepsilon}_{k-1} are independent (measurement noise at the time instant k is independent from the estimation error at the time instant k-1), and due to the fact that \mathbf{v}_{k} is zero-mean:

(29)   \begin{align*}E[\boldsymbol{\varepsilon}_{k-1}\mathbf{v}_{k}^{T}]=E[\boldsymbol{\varepsilon}_{k-1}]E[\mathbf{v}_{k}^{T}] =0\end{align*}

(30)   \begin{align*}E[\mathbf{v}_{k}\boldsymbol{\varepsilon}^{T}_{k-1}]=E[\mathbf{v}_{k}]E[\boldsymbol{\varepsilon}^{T}_{k-1}] =0\end{align*}

(31)   \begin{align*}E[\mathbf{v}_{k}\mathbf{v}^{T}_{k}]=R_{k}\end{align*}

By substituting (28), (29), (30), and (31) in (27), we obtain an important expression for the propagation of the estimation error covariance matrix

(32)   \begin{align*}P_{k}=(I-K_{k}C_{k})P_{k-1}(I-K_{k}C_{k})^{T}+K_{k}R_{k}K_{k}^{T} \end{align*}

Let us expand the last expression

(33)   \begin{align*}P_{k}=P_{k-1}-P_{k-1}C_{k}^{T}K_{k}^{T}-K_{k}C_{k}P_{k-1}+K_{k}C_{k}P_{k-1}C_{k}^{T}K_{k}^{T}+K_{k}R_{k}K_{k}^{T}\end{align*}

By substituting (49) in (20), we obtain

(34)   \begin{align*}W_{k}& =\text{tr}(P_{k-1})-\text{tr}(P_{k-1}C_{k}^{T}K_{k}^{T}) \\& -\text{tr}(K_{k}C_{k}P_{k-1})+\text{tr}(K_{k}C_{k}P_{k-1}C_{k}^{T}K_{k}^{T})+\text{tr}(K_{k}R_{k}K_{k}^{T})\end{align*}

We find the gain matrix K_{k} by optimizing the cost function (34). That is, we find the value of K_{k} by solving the following equation

(35)   \begin{align*}\frac{\partial W_{k}}{\partial K_{k}} =0\end{align*}

To compute the derivatives of matrix traces, we use the formulas given in The Matrix Cookbook (Section 2.5). We use the following formulas

(36)   \begin{align*}\frac{\partial  \text{tr}(AX^{T}B)}{\partial X}=BA\end{align*}

(37)   \begin{align*}\frac{\partial \text{tr}(AXB)}{\partial X}=A^{T}B^{T}\end{align*}

(38)   \begin{align*}\frac{\partial \text{tr}(XBX^{T})}{\partial X}=XB^{T}+XB\end{align*}

We find derivatives of all terms in (34). Since P_{k-1} is constant, we have

(39)   \begin{align*}\frac{\partial \text{tr}(P_{k-1})}{\partial K_{k}}=0\end{align*}

Next, by using (36), we have

(40)   \begin{align*}\frac{\partial \text{tr}(P_{k-1}C_{k}^{T}K_{k}^{T}) }{\partial K_{k}}=P_{k-1}C_{k}^{T}\end{align*}

By using (37), we have

(41)   \begin{align*}\frac{\partial \text{tr}(K_{k}C_{k}P_{k-1}) }{\partial K_{k}}=P_{k-1}C_{k}^{T}\end{align*}

By using (38), we have

(42)   \begin{align*}\frac{\partial \text{tr}(K_{k}C_{k}P_{k-1}C_{k}^{T}K_{k}^{T}) }{\partial K_{k}}=2K_{k}C_{k}P_{k-1}C_{k}^{T}\end{align*}

(43)   \begin{align*}\frac{\partial \text{tr}(K_{k}R_{k}K_{k}^{T}) }{\partial K_{k}}=2K_{k}R_{k}\end{align*}

By substituting (39), (40), (41), (42), and (43):

(44)   \begin{align*}-2P_{k-1}C_{k}^{T}+2K_{k}C_{k}P_{k-1}C_{k}^{T}+2K_{k}R_{k}=0\end{align*}

From the last equation, we have

(45)   \begin{align*}K_{k}(R_{k}+C_{k}P_{k-1}C_{k}^{T})=P_{k-1}C_{k}^{T}\end{align*}

By solving the last equation we obtain the final expression for the gain matrix K_{k}

(46)   \begin{align*}K_{k}=P_{k-1}C_{k}^{T}(R_{k}+C_{k}P_{k-1}C_{k}^{T})^{-1}\end{align*}

Before we summarize the recursive least-squares method, we derive an alternative form of the estimation error covariance matrix propagation equation. First, we can write the gain matrix (46), as follows

(47)   \begin{align*}K_{k}=P_{k-1}C_{k}^{T}L_{k}^{-1}, \;\;\; L_{k}=R_{k}+C_{k}P_{k-1}C_{k}^{T}\end{align*}

Here, we should observe that the matrix L_{k} and its inverse L_{k}^{-1} are symmetric, that is

(48)   \begin{align*}L_{k}=L_{k}^{T},\;\; (L_{k}^{-1})^{T}=L_{k}^{-1}\end{align*}

We substitute (47) in (49), and as the result, we obtain

(49)   \begin{align*}P_{k}& =P_{k-1}-P_{k-1}C_{k}^{T}L_{k}^{-1}C_{k}P_{k-1}-P_{k-1}C_{k}^{T}L_{k}^{-1}C_{k}P_{k-1}\\&+P_{k-1}C_{k}^{T}L_{k}^{-1}C_{k}P_{k-1}C_{k}^{T}L_{k}^{-1}C_{k}P_{k-1}+P_{k-1}C_{k}^{T}L_{k}^{-1}R_{k}L_{k}^{-1}C_{k}P_{k-1}\end{align*}

By adding the second and third terms and by grouping the last two terms, we obtain

(50)   \begin{align*}P_{k}& =P_{k-1}-2P_{k-1}C_{k}^{T}L_{k}^{-1}C_{k}P_{k-1}\\&+P_{k-1}C_{k}^{T}L_{k}^{-1}\underbrace{(C_{k}P_{k-1}C_{k}^{T}+R_{k})}_{L_{k}}L_{k}^{-1}C_{k}P_{k-1}  \\& =P_{k-1}-2P_{k-1}C_{k}^{T}L_{k}^{-1}C_{k}P_{k-1}+P_{k-1}C_{k}^{T}L_{k}^{-1}C_{k}P_{k-1} \\& =P_{k-1}-\underbrace{P_{k-1}C_{k}^{T}L_{k}^{-1}}_{K_{k}}C_{k}P_{k-1} \\& =P_{k-1}-K_{k}C_{k}P_{k-1} \\& =(I-K_{k}C_{k})P_{k-1}\end{align*}

That is, an alternative expression for the propagation of the estimation error covariance matrix is given by

(51)   \begin{align*}P_{k}=(I-K_{k}C_{k})P_{k-1}\end{align*}

Consequently, the recursive least squares method consists of the following three equations

  1. Gain matrix update

    (52)   \begin{align*}K_{k}=P_{k-1}C_{k}^{T}(R_{k}+C_{k}P_{k-1}C_{k}^{T})^{-1}\end{align*}

  2. Estimate update

    (53)   \begin{align*}\hat{\mathbf{x}}_{k}= \hat{\mathbf{x}}_{k-1}+K_{k}(\mathbf{y}_{k}-C_{k}\hat{\mathbf{x}}_{k-1}) \end{align*}

  3. Propagation of the estimation error covariance matrix by using this equation

    (54)   \begin{align*}P_{k}=(I-K_{k}C_{k})P_{k-1}(I-K_{k}C_{k})^{T}+K_{k}R_{k}K_{k}^{T}\end{align*}


    or this equation

    (55)   \begin{align*}P_{k}=(I-K_{k}C_{k})P_{k-1}\end{align*}



    If no apriori knowledge of the estimate is known, then this method is initialized with the estimation error covariance matrix that is equal to P_{0} = c I, where c is a large number, and with a random value of \mathbf{x}_{0}.

The Python implementation of the derived least-squares method is given here.