Return to Math

Gradient Descent for solving linear problems

Linear problems like \mathbf{Ax} = \mathbf{b}, with \mathbf{x} the unknown solution, appear in many applications. In theory, one needs to invert matrix \mathbf{A} such that \mathbf{x} = \mathbf{A}^{-1}\mathbf{b}. When \mathbf{A} is very small, for example a 3×3 matrix, one can compute the inverse matrix if its determinant is non-zero. When the matrix is large, a direct inversion of the matrix is in general not a good idea, so other methods must be used.

Since we are interested in \mathbf{x}, rather than the inverse of \mathbf{A}, it is possible to search for \mathbf{x} through the minimization of a quadratic function f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T\mathbf{Ax} - \mathbf{x}^T\mathbf{b}, for which the solution \mathbf{x} of the linear problem, is a minimizer of the quadratic function. To see this, the solution of the minimization problem is located where its its gradient at \mathbf{x} becomes zero, provided that the minimization is unique. Since the gradient of the minimization problem is exactly our linear problem, the minimizer solves the linear problem.

A method for solving this minimization problem is for example the Gradient Descent method. Please note that Gradient Descent is in general used for non-linear problems, but can also used for linear problems. Although the method is not very efficient for solving linear problems, it gives a good insight in other methods like the Conjugate Gradient method.

Derivation of Gradient Descent for solving linear problems

As the name of the method suggests, the approximation of solution \mathbf{x} descents along the gradient of the minimization problem, until it reaches the minimum of the quadratic function. The gradient of f(\mathbf{x}) at \mathbf{x} is just \mathbf{Ax} - \mathbf{b}. Suppose this gradient direction at \mathbf{x} is known as \mathbf{p}, one takes a step from \mathbf{x} in direction \mathbf{p} until a minimum is found, i.e., \mathbf{x}_{i+1} = \mathbf{x}_i + \alpha \mathbf{p}_i, with \alpha the unknown step-size. At this minimum, the gradient of f(\mathbf{x}) and \mathbf{p} are perpendicular. Using this observation, the step-size \alpha \mathbf{p} can be computed as follows:

Update rule for the approximation, with \alpha the unknown step-size.

\mathbf{x}_{i+1} = \mathbf{x}_i + \alpha \mathbf{p}_i

Left-multiply by -\mathbf{A}.

-\mathbf{Ax}_{i+1} = -\mathbf{Ax}_i - \alpha\mathbf{Ap}_i

Add \mathbf{b}.

\mathbf{b}-\mathbf{Ax}_{i+1} = \mathbf{b}-\mathbf{Ax}_i - \alpha\mathbf{Ap}_i

\mathbf{r}_{i+1} = \mathbf{r}_i - \alpha \mathbf{Ap}_i

Please note that residual \mathbf{r} is the negative gradient of f, so \alpha is computed as follows by using the property that the gradient and \mathbf{p} are perpendicular.

\mathbf{p}_i^T\mathbf{r}_{i+1} = \mathbf{p}_i^T\mathbf{r}_i - \alpha \mathbf{p}_i^T\mathbf{Ap}_i

Since \mathbf{p}_i^T\mathbf{r}_{i+1} = 0 at the minimum, we have:

\alpha = \frac{\mathbf{p}_i^T\mathbf{r}_i}{\mathbf{p}_i^T\mathbf{Ap}_i}

Given \alpha, both \mathbf{x}_{i+1} and \mathbf{r}_{i+1} can be computed. In the Gradient Descent method, \mathbf{p}_{i+1} is set using \mathbf{r}_{i+1}. By repeating this procedure, eventually \mathbf{x} reaches the global minimizer for f, and so a solution for the linear problem is obtained. When solving a non-linear problem, \alpha is set small enough to ensure that f(\mathbf{x}_{i+1}) < f(\mathbf{x}_i), and \mathbf{p} is set to \nabla f(\mathbf{x}).