Neural network dense layers (or fully connected layers) are the foundation of nearly all neural networks. If you look closely at almost any topology, somewhere there is a dense layer lurking.
This post will cover the history behind dense layers, what they are used for, and how to use them by walking through the "Hello, World!" of neural networks: digit classification.
Check out my previous post, What are Neural Networks?
Neural networks come in many different variations these days, from convolutional and recurrent, to homogenous and heterogeneous, to linear and branching.
But the original neural networks were a single neuron: the perceptron. Perceptrons showed some promise, but came up short when attempting to handle some of the simplest logical operations. Unfortunately, perceptrons didn't have enough complexity to approximate many of the functions that neural networks can approximate today.
The solution was to add more neurons. The question was how. Should neurons be placed in series, building a deeper network of single neurons; in parallel, creating a wider network; or both?
The next iteration of neural networks was both. By adding width, the network could simultaneously approximate more functions, expanding the solution space. By adding depth, the network could use those parallel approximations to make more informed decisions. The multi-layer perceptron, or MLP, was born.
The MLP used a layer of neurons that each took input from every input component. Each was a perceptron. And each perceptron in this layer fed its result into another perceptron. In Keras, and many other frameworks, this layer type is referred to as the dense (or fully connected) layer.
Neural network dense layers map each neuron in one layer to every neuron in the next layer. This allows for the largest potential function approximation within a given layer width. It also means that there are a lot of parameters to tune, so training very wide and very deep dense networks is computationally expensive.
But for limited function approximations in a limited input space, it was an ideal system. One of the first uses of this type of network was digit identification for ZIP codes. Yann LeCun (then at Bell Labs) developed an MLP-based model to identify digits from low-resolution black and white images (like those that would have been available from digital cameras of the late 1980s).
We'll be using Keras to build a digit classifier based on neural network dense layers. Keras is a high-level abstraction for designing neural networks in a layer-wise fashion.
Keras also has a set of convenient dataset loader functions to download common datasets. For this example, we'll be using the MNIST dataset, which consists of examples of handwritten digits as 28x28 grayscale images.
For this post, I'll actually be stepping through Keras' mnist_mlp.py example code on GitHub. It's a great resource if you are just learning to use Keras for constructing neural networks, so I recommend checking it out.
Each digit image is a 2-dimensional image of 28x28 pixels, or 784 pixels. Each pixel will be an input to the network, provided as an unrolled 1-dimensional array (or tensor).
Keras provides an MNIST dataset loader which downloads the dataset of handwritten digits and separates it into training and testing sets, each with images and ground-truth labels.
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop
(x_train, y_train), (x_test, y_test) = mnist.load_data()
Although the dataset is easily loaded, the images have to be reshaped into 1-dimensional pixel arrays. In addition, the grayscale images have pixels represented as integer intensities from 0 (black) to 255 (white). However, during training it's easier to have values between 0 and 1. This is because large integers can lead to problems when training, where neurons are over-saturated and error information cannot be properly backpropagated (also known as exploding gradients). To fix this, we simply divide each pixel value by 255 to normalize to real values between 0 and 1.
We also need to set the training and testing labels (the y parts) to categorical. This means that each image can belong to one (and only one) category. In this case, there are 10 categories, one for each digit (0 through 9).
Read my post on Output Layers
batch_size = 128
num_classes = 10
epochs = 20
x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
The construction of the actual neural network requires remarkably few lines of code. This particular network topology consists of only a few layers.
model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(784,)))
model.add(Dropout(0.2))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(num_classes, activation='softmax'))
model.summary()
It's finally time to train the network to identify handwritten digits. The first task is to compile the neural network description, during which point we provide a loss function (in this case, the built-in categorical_crossentropy), and an optimizer and optimizer metric.
Next, the model is fit to the training data, and validated against the test data. Once the fit is complete (in this case we train for 20 epochs, or 20 passes through the training data), we evaluate the model to determine its accuracy on the testing data.
model.compile(loss='categorical_crossentropy',
optimizer=RMSprop(),
metrics=['accuracy'])
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
In the evolution of neural networks, dense layers were the first to appear after the original perceptron. However, a literal zoo of new neurons, layers, and topologies have been developed in the decades since the MLP was first shown to be a universal function approximator. In future posts I'll cover how these other advancements have improved the capabilities and reduced the training time of deeper, more complex neural networks which can translate languages, identify diseases, make complex predictions, and drive automobiles.
This post will cover the history behind dense layers, what they are used for, and how to use them by walking through the "Hello, World!" of neural networks: digit classification.
The Problem with the Perceptron
Check out my previous post, What are Neural Networks?
Neural networks come in many different variations these days, from convolutional and recurrent, to homogenous and heterogeneous, to linear and branching.
But the original neural networks were a single neuron: the perceptron. Perceptrons showed some promise, but came up short when attempting to handle some of the simplest logical operations. Unfortunately, perceptrons didn't have enough complexity to approximate many of the functions that neural networks can approximate today.
The solution was to add more neurons. The question was how. Should neurons be placed in series, building a deeper network of single neurons; in parallel, creating a wider network; or both?
The next iteration of neural networks was both. By adding width, the network could simultaneously approximate more functions, expanding the solution space. By adding depth, the network could use those parallel approximations to make more informed decisions. The multi-layer perceptron, or MLP, was born.
Multi-layer Perceptrons
![]() |
An example of a Multi-layer Perceptron |
The MLP used a layer of neurons that each took input from every input component. Each was a perceptron. And each perceptron in this layer fed its result into another perceptron. In Keras, and many other frameworks, this layer type is referred to as the dense (or fully connected) layer.
Neural network dense layers map each neuron in one layer to every neuron in the next layer. This allows for the largest potential function approximation within a given layer width. It also means that there are a lot of parameters to tune, so training very wide and very deep dense networks is computationally expensive.
But for limited function approximations in a limited input space, it was an ideal system. One of the first uses of this type of network was digit identification for ZIP codes. Yann LeCun (then at Bell Labs) developed an MLP-based model to identify digits from low-resolution black and white images (like those that would have been available from digital cameras of the late 1980s).
A Digit Classifier with Neural Network Dense Layers
We'll be using Keras to build a digit classifier based on neural network dense layers. Keras is a high-level abstraction for designing neural networks in a layer-wise fashion.
Keras also has a set of convenient dataset loader functions to download common datasets. For this example, we'll be using the MNIST dataset, which consists of examples of handwritten digits as 28x28 grayscale images.
For this post, I'll actually be stepping through Keras' mnist_mlp.py example code on GitHub. It's a great resource if you are just learning to use Keras for constructing neural networks, so I recommend checking it out.
Providing Inputs to the Network
Each digit image is a 2-dimensional image of 28x28 pixels, or 784 pixels. Each pixel will be an input to the network, provided as an unrolled 1-dimensional array (or tensor).
Keras provides an MNIST dataset loader which downloads the dataset of handwritten digits and separates it into training and testing sets, each with images and ground-truth labels.
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop
(x_train, y_train), (x_test, y_test) = mnist.load_data()
Preparing the Dataset for training
Although the dataset is easily loaded, the images have to be reshaped into 1-dimensional pixel arrays. In addition, the grayscale images have pixels represented as integer intensities from 0 (black) to 255 (white). However, during training it's easier to have values between 0 and 1. This is because large integers can lead to problems when training, where neurons are over-saturated and error information cannot be properly backpropagated (also known as exploding gradients). To fix this, we simply divide each pixel value by 255 to normalize to real values between 0 and 1.
We also need to set the training and testing labels (the y parts) to categorical. This means that each image can belong to one (and only one) category. In this case, there are 10 categories, one for each digit (0 through 9).
Read my post on Output Layers
batch_size = 128
num_classes = 10
epochs = 20
x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
Constructing the Network
The construction of the actual neural network requires remarkably few lines of code. This particular network topology consists of only a few layers.
- A Dense layer of 512 neurons which accepts 784 inputs (the input image)
- A Dropout layer, which is used to help prevent over fitting to the training data
- A second Dense layer of 512 neurons
- A second Dropout layer
- A third Dense layer of 10 neurons, which will provide the final classification
model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(784,)))
model.add(Dropout(0.2))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(num_classes, activation='softmax'))
model.summary()
Training the Network
It's finally time to train the network to identify handwritten digits. The first task is to compile the neural network description, during which point we provide a loss function (in this case, the built-in categorical_crossentropy), and an optimizer and optimizer metric.
Next, the model is fit to the training data, and validated against the test data. Once the fit is complete (in this case we train for 20 epochs, or 20 passes through the training data), we evaluate the model to determine its accuracy on the testing data.
model.compile(loss='categorical_crossentropy',
optimizer=RMSprop(),
metrics=['accuracy'])
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
Dense Layers are Just the Beginning
In the evolution of neural networks, dense layers were the first to appear after the original perceptron. However, a literal zoo of new neurons, layers, and topologies have been developed in the decades since the MLP was first shown to be a universal function approximator. In future posts I'll cover how these other advancements have improved the capabilities and reduced the training time of deeper, more complex neural networks which can translate languages, identify diseases, make complex predictions, and drive automobiles.