Genetic algorithms are a great tool for optimizing all sorts of things, from path planning and scheduling to neural network architectures.
In this post, I'm going to talk about the first step in designing a genetic algorithm: Encoding Individuals. As I mentioned in my post on genetic algorithms basics, genetic algorithms borrow on the concept of phenotypic expression, which is basically the idea that specific sequences of DNA can be mapped to particular traits which are expressed in the individual. For a genetic algorithm, this means knowing what problem you want to solve and developing a succinct way of expressing solutions to that problem as chains of values.
Before we get into the details, many will probably ask "Individuals? Don't you mean chromosomes?" The genetic algorithms literature will often refer to the members of a population as chromosomes. Although this is common terminology, the term individuals is also appropriate, and is in fact the term John Holland originally used in his 1973 SIAM article. Personally, I think using the term individual makes the metaphor easier to follow. It also helps to avoid awkward conversations with biologists and geneticists about genetic algorithms being a metaphor for mathematical optimization, and not a reflection of biological reality.
Once you've decided to write a genetic algorithm, you need to start by figuring out how each potential solution should be encoded.
For some problems, such as path planning, this is simple. As an example, let's consider the traveling salesperson problem (TSP). The problem being solved is how to travel to each of a list of cities and return to your home city, while traveling the shortest distance possible.
This example has a relatively simple encoding. If you enumerate each of the destination cities, then each individual in your population is a list of k non-repeating numbers, where k is the number of cities.
For other problems, the encoding may be more complex. Let's say that you want to use a genetic algorithm to design a neural network. If your network is a multilayer perceptron, you might encode each individual as the number of neurons in each layer - simple enough! If you want to make something more complex, like a heterogeneous network of many different types of layers, then you may need to have an enumeration for the type of neurons in each layer, plus the number of neurons in the layer. In this case, each trait (a layer in the network) is made of 2 elements: the type enumeration and the number of elements.
This can be extended to all sorts of problems. Each trait might consist of many elements, like degrees of freedom in an actuated robot arm or multi-dimensional vectors. The number of traits in an individual can also vary. In these types of encodings (called variable-length encodings), the problem looks less like a permutation and more like a full combination, where solutions may consist of any number of elements of the given trait set.
It's important to remember when designing an encoding that a genetic algorithm is a program running on a computer, and it should therefore be designed with performance on the computer in mind. Performance optimization is a topic that could consume a hundred different blog sites, but in terms of genetic algorithms it is important to remember the following tips.
This is the #1 thing you need to do. Linked lists may provide a convenient means of expressing variable-length encodings, and for writing elegant functions for crossover and mutation, but from a performance perspective they are slow. Try to use contiguous memory allocation (like C arrays) whenever possible.
The alphabet of your encoding is the number of different values that each component of each trait can assume. If we look at DNA as an example, the alphabet is of size 4 (A/G/C/T). For the TSP, the size of the alphabet is the number of cities to be visited. For more complex representations (like real-value numbers), the alphabet can become irrationally large if you allow double-precision values. Perhaps a single or half precision would be just as effective or, better still, and integer alphabet would suffice.
For computer scientists, there is a constant pull to abstract something to the point where it is elegantly represented. Even after years of working on high performance computing codes, I still sometimes fall into this trap. Unfortunately, converting that elegant abstraction back into a form that can be tested can become computationally expensive. Remember that you must be able to express each individual in your population each generation, in order to determine its fitness for mating and survival. If this is a more than constant-time operation, then you can end up spending a lot of cycles doing very little actual work.
I hope that you have a better understanding of how to encode individuals for a genetic algorithm. The process of determining an encoding is one of the most important aspects of the algorithm, and so it should be the part of the problem that takes up the largest fraction of a developer's time. The rest of the operations will be straightforward once you decide on a simple, expressive, and performant encoding.
Please feel free to leave comments below or reach out to me on Twitter or LinkedIn!
In this post, I'm going to talk about the first step in designing a genetic algorithm: Encoding Individuals. As I mentioned in my post on genetic algorithms basics, genetic algorithms borrow on the concept of phenotypic expression, which is basically the idea that specific sequences of DNA can be mapped to particular traits which are expressed in the individual. For a genetic algorithm, this means knowing what problem you want to solve and developing a succinct way of expressing solutions to that problem as chains of values.
Individuals or Chromosomes?
Before we get into the details, many will probably ask "Individuals? Don't you mean chromosomes?" The genetic algorithms literature will often refer to the members of a population as chromosomes. Although this is common terminology, the term individuals is also appropriate, and is in fact the term John Holland originally used in his 1973 SIAM article. Personally, I think using the term individual makes the metaphor easier to follow. It also helps to avoid awkward conversations with biologists and geneticists about genetic algorithms being a metaphor for mathematical optimization, and not a reflection of biological reality.
Encoding Strategies
Once you've decided to write a genetic algorithm, you need to start by figuring out how each potential solution should be encoded.
For some problems, such as path planning, this is simple. As an example, let's consider the traveling salesperson problem (TSP). The problem being solved is how to travel to each of a list of cities and return to your home city, while traveling the shortest distance possible.
This example has a relatively simple encoding. If you enumerate each of the destination cities, then each individual in your population is a list of k non-repeating numbers, where k is the number of cities.
Traits
For other problems, the encoding may be more complex. Let's say that you want to use a genetic algorithm to design a neural network. If your network is a multilayer perceptron, you might encode each individual as the number of neurons in each layer - simple enough! If you want to make something more complex, like a heterogeneous network of many different types of layers, then you may need to have an enumeration for the type of neurons in each layer, plus the number of neurons in the layer. In this case, each trait (a layer in the network) is made of 2 elements: the type enumeration and the number of elements.
This can be extended to all sorts of problems. Each trait might consist of many elements, like degrees of freedom in an actuated robot arm or multi-dimensional vectors. The number of traits in an individual can also vary. In these types of encodings (called variable-length encodings), the problem looks less like a permutation and more like a full combination, where solutions may consist of any number of elements of the given trait set.
Performant Encoding
It's important to remember when designing an encoding that a genetic algorithm is a program running on a computer, and it should therefore be designed with performance on the computer in mind. Performance optimization is a topic that could consume a hundred different blog sites, but in terms of genetic algorithms it is important to remember the following tips.
Use Arrays
This is the #1 thing you need to do. Linked lists may provide a convenient means of expressing variable-length encodings, and for writing elegant functions for crossover and mutation, but from a performance perspective they are slow. Try to use contiguous memory allocation (like C arrays) whenever possible.
Use a Simple Alphabet
The alphabet of your encoding is the number of different values that each component of each trait can assume. If we look at DNA as an example, the alphabet is of size 4 (A/G/C/T). For the TSP, the size of the alphabet is the number of cities to be visited. For more complex representations (like real-value numbers), the alphabet can become irrationally large if you allow double-precision values. Perhaps a single or half precision would be just as effective or, better still, and integer alphabet would suffice.
Use an Encoding that can Quickly Express the Individual
For computer scientists, there is a constant pull to abstract something to the point where it is elegantly represented. Even after years of working on high performance computing codes, I still sometimes fall into this trap. Unfortunately, converting that elegant abstraction back into a form that can be tested can become computationally expensive. Remember that you must be able to express each individual in your population each generation, in order to determine its fitness for mating and survival. If this is a more than constant-time operation, then you can end up spending a lot of cycles doing very little actual work.
Conclusion
I hope that you have a better understanding of how to encode individuals for a genetic algorithm. The process of determining an encoding is one of the most important aspects of the algorithm, and so it should be the part of the problem that takes up the largest fraction of a developer's time. The rest of the operations will be straightforward once you decide on a simple, expressive, and performant encoding.
Please feel free to leave comments below or reach out to me on Twitter or LinkedIn!