Most artificial intelligence today is implemented using some form of neural network. In my last two articles, I introduced neural networks and showed you how to build a neural network in Java. The power of a neural network derives largely from its capacity for deep learning, and that capacity is built on the concept and execution of backpropagation with gradient descent. I’ll conclude this short series of articles with a quick dive into backpropagation and gradient descent in Java.

## Backpropagation in machine learning

It’s been said that AI isn’t all that intelligent, that it is largely just backpropagation. So, what is this keystone of modern machine learning?

To understand backpropagation, you must first understand how a neural network works. Basically, a neural network is a directed graph of nodes called *neurons*. Neurons have a specific structure that takes inputs, multiplies them with weights, adds a bias value, and runs all that through an activation function. Neurons feed their output into other neurons until the output neurons are reached. The output neurons produce the output of the network. (See Styles of machine learning: Intro to neural networks for a more complete introduction.)

I’ll assume from here that you understand how a network and its neurons are structured, including feedforward. The example and discussion will focus on backpropagation with gradient descent. Our neural network will have a single output node, two “hidden” nodes, and two input nodes. Using a relatively simple example will make it easier to see the math involved with the algorithm. Figure 1 shows a diagram of the example neural network.

The idea in backpropagation with gradient descent is to consider the entire network as a multivariate function that provides input to a loss function. The loss function calculates a number representing how well the network is performing by comparing the network output against known good results. The set of input data paired with good results is known as the training set. The loss function is designed to increase the number value as the network’s behavior moves further away from correct.

Gradient descent algorithms take the loss function and use partial derivatives to determine what each variable (weights and biases) in the network contributed to the loss value. It then moves backward, visiting each variable and adjusting it to decrease the loss value.