Neural networks can seem daunting, complicated, and impossible to explain. But in reality they are remarkably simple. In fact, they only ever require a single layer of neurons.
In my previous post about the basics of neural networks, I talked about how neurons compute values. They take a set of inputs, multiply each input value by a weight, and sum the terms. An activation function is then applied to the sum of products, to yield the output value. That output value could be zero (i.e., did not activate), negative, or positive.
In this post, we will go deeper down the rabbit hole. We will look at neuron layers, which layers are actually necessary for a network to function, and come to the stunning realization that all neural networks have only a single output.
In most neural networks, we tend to organize neurons into layers. The reason for this comes from graph theory (as neural networks are little more than computational graphs). Each layer consists of nodes which are at the same depth in the graph, meaning it takes the same number of hops through other neurons to reach the input.
In this example, there are two layers, an output layer, which consists of a single neuron, and a layer which transforms the input channels using only one term.
In reality, layer 0 isn't necessary to call this a neural network. In fact, the only layer you ever need for something to be called a neural network is the output layer.
It gets even simpler than this. Not only is the output layer the only layer you need to call something a neural network, it also happens that for classification and regression solvers, the output layer of any neural network is always a single neuron.
Yep. Only a single neuron. In reality, no neural network doing classification (labeling something) or regression (predicting something) ever has more than a single output neuron.
"That's not true!" you might say to yourself. "Networks to solve MNIST have 10 output neurons, one for each digit!" Well, yes. And no. In fact MNIST solvers are not a single neural network which classifies images of handwritten digits into 0 though 9. It is TEN neural networks, each of which independently classifies a single digit, 0 through 9.
Let's take a look at what the output layer of a neural network actually does. Let's suppose we have a network with 3 output neurons, each taking the input of all the neurons in the previous layer (a common way of performing final classification).
Each node in this classification layer takes as input all of the outputs of the previous layer, as is common for fully-connected classifiers. However, the values transmitted to each neuron of the output layer from the previous layer are exactly the same. Because of this, we can say that each neuron in the neural network receives the output of the previous layer, which we will call $latex f(i)$, where $latex i$ is the input vector and $latex f(i)$ emits a vector.
If the output of the neural network up to this point is $latex f(i)$, then each neuron in the output layer is a function on $latex f(i)$ that we will call $latex g_0$, $latex g_1$, and $latex g_2$. So to recap:
$latex g_0(f(i))$ is the value of output 0.
$latex g_1(f(i))$ is the value of output 1.
$latex g_2(f(i))$ is the value of output 2.
None of these functions depend on one another. So, what we have is not 1 neural network which consists of 3 output classifiers, but in fact 3 neural networks each consisting of a single output classifier. Each of these neural networks is a composition function sharing $latex f(i)$.
When you train a multi-classifier network (say an MNIST solver, for example), you are training all of the independent neural networks simultaneously. For binary classification, it's pretty simple: one of the networks will return a true (1), while the others return a false (0). All you do is optimize to that goal. Most of the weight optimization will occur in the output layer, as that is the layer where the highest degree of differentiation is required (remember that each neural network is actually sharing all of the other weight values, so changing something in an earlier layer effects all of the networks).
It's not always binary classification, though. You could be trying to build a multi-label model, where it's acceptable to have more than one neural network return true. You could be building a multi-regression model, where each neural network is predicting a different thing (i.e., tomorrow's temperature, cloud cover, and chance of rain). You can get more complicated, but it becomes more difficult to optimize (e.g., train), as the difference between expected output value in each of the neural networks becomes less black and white.
I hope this post helps you understand how neural networks produce the answers that they do, and how multi-class networks are actually multiple neural networks being simultaneously trained.
In my previous post about the basics of neural networks, I talked about how neurons compute values. They take a set of inputs, multiply each input value by a weight, and sum the terms. An activation function is then applied to the sum of products, to yield the output value. That output value could be zero (i.e., did not activate), negative, or positive.
In this post, we will go deeper down the rabbit hole. We will look at neuron layers, which layers are actually necessary for a network to function, and come to the stunning realization that all neural networks have only a single output.
Organizing Neurons into Layers
In most neural networks, we tend to organize neurons into layers. The reason for this comes from graph theory (as neural networks are little more than computational graphs). Each layer consists of nodes which are at the same depth in the graph, meaning it takes the same number of hops through other neurons to reach the input.
![]() |
Organizing the neurons of a neural network into layers, according to depth from the input. Layer 0 neurons are directly connected to the input, while layer 1 neurons are 1 hop deep into the network. |
In this example, there are two layers, an output layer, which consists of a single neuron, and a layer which transforms the input channels using only one term.
The Only Neuron (Layer) You Will Ever Need
In reality, layer 0 isn't necessary to call this a neural network. In fact, the only layer you ever need for something to be called a neural network is the output layer.
It gets even simpler than this. Not only is the output layer the only layer you need to call something a neural network, it also happens that for classification and regression solvers, the output layer of any neural network is always a single neuron.
Yep. Only a single neuron. In reality, no neural network doing classification (labeling something) or regression (predicting something) ever has more than a single output neuron.
"That's not true!" you might say to yourself. "Networks to solve MNIST have 10 output neurons, one for each digit!" Well, yes. And no. In fact MNIST solvers are not a single neural network which classifies images of handwritten digits into 0 though 9. It is TEN neural networks, each of which independently classifies a single digit, 0 through 9.
Let's Do the Math
Let's take a look at what the output layer of a neural network actually does. Let's suppose we have a network with 3 output neurons, each taking the input of all the neurons in the previous layer (a common way of performing final classification).
![]() |
The output layer of a neural network with 3 classification groups. |
Each node in this classification layer takes as input all of the outputs of the previous layer, as is common for fully-connected classifiers. However, the values transmitted to each neuron of the output layer from the previous layer are exactly the same. Because of this, we can say that each neuron in the neural network receives the output of the previous layer, which we will call $latex f(i)$, where $latex i$ is the input vector and $latex f(i)$ emits a vector.
If the output of the neural network up to this point is $latex f(i)$, then each neuron in the output layer is a function on $latex f(i)$ that we will call $latex g_0$, $latex g_1$, and $latex g_2$. So to recap:
$latex g_0(f(i))$ is the value of output 0.
$latex g_1(f(i))$ is the value of output 1.
$latex g_2(f(i))$ is the value of output 2.
None of these functions depend on one another. So, what we have is not 1 neural network which consists of 3 output classifiers, but in fact 3 neural networks each consisting of a single output classifier. Each of these neural networks is a composition function sharing $latex f(i)$.
Training Multi-classifier Networks
When you train a multi-classifier network (say an MNIST solver, for example), you are training all of the independent neural networks simultaneously. For binary classification, it's pretty simple: one of the networks will return a true (1), while the others return a false (0). All you do is optimize to that goal. Most of the weight optimization will occur in the output layer, as that is the layer where the highest degree of differentiation is required (remember that each neural network is actually sharing all of the other weight values, so changing something in an earlier layer effects all of the networks).
It's not always binary classification, though. You could be trying to build a multi-label model, where it's acceptable to have more than one neural network return true. You could be building a multi-regression model, where each neural network is predicting a different thing (i.e., tomorrow's temperature, cloud cover, and chance of rain). You can get more complicated, but it becomes more difficult to optimize (e.g., train), as the difference between expected output value in each of the neural networks becomes less black and white.
I hope this post helps you understand how neural networks produce the answers that they do, and how multi-class networks are actually multiple neural networks being simultaneously trained.