Arrays of Structures or Structures of Arrays: Performance vs. Readability

It's one of those things that might have an obvious answer if you have ever written scientific software for a vector machine. For everyone else, it's something you probably never even thought about: Should I write my code with arrays of structures (or classes), or structures (or classes) of arrays. Read on to see how both approaches perform, and what kind of readability you can expect from each approach.

The Performance Perspective

If you commonly write code in C or Fortran, you are typically writing:

scientific software

latency sensitive software

performance critical software

In these cases, you probably know which way to go: structures of arrays. The reason is that in array-oriented languages - where your common operation is traversal over arrays - a structure of arrays will give you better prefetcher and cache performance than an array of structures.

Why is this the answer? Well, it comes down to the concept of unit stride. Stride is the way in which you access elements in an array. Do you access every 2nd element? Every 4th element? Every 13th? Each of these has different performance characteristics. However, the highest performance option is if you access every element in order. The reason for this is most modern processors look a lot like the supercomputer processors of decades past - they have vector processing units (VPUs). VPUs perform the same operation on multiple elements of an array simultaneously, a programming model called Single Instruction / Multiple Data (SIMD).

Example

Let's look at an example of array of structures vs. structure of arrays. In our example, we have coordinates in 3-dimensional space (x,y,z), and we have a bunch of them - perhaps for storing particle data. Most modern programmers will take the object-oriented approach to this problem: create an object (or structure) which contains the data for each particle. Then have a list of different particles. It makes logical sense. Start with the smallest abstraction (a particle), and work your way up to the highest abstraction (a set of particles).

The problem with this approach is that it can fragment the memory space. In the figure below, I've color-coded the two different encoding options. x-values are blue, y-values are orange, and z-values are green:

You can see that the array of structures has a lot of color variation. That's because we have each particle stored as an independent unit. If, for example, you wanted to compare all of the particles' x-positions, you would have to hop through the array of structures to find all of the x-positions. This requires more of the most expensive operation you can do in a code other than write to disk or use the network card: Reading from memory.

We tend to think of memory as the fast hardware. It's true that it's faster than requesting data over the network, or even requesting data from the hard drive. But from the CPU's perspective, memory is slow. Very slow, in fact. For modern processors, it can take 350+ clock cycles to load a value from memory. What can you do in 350 clock cycles? If you do it right, you can do 350 integer or floating-point additions, or 350 multiplications. On some processors with a Fused Multiply Adder (FMA) unit, you could have done 300+ ax+b operations. That's a lot of math.

So we want to read from memory as little as possible, and utilize the even-faster-than-main-memory memory on the CPU: the cache. The cache is an ultra-fast memory which resides on the processor. And it helps to mask some of the latency of reading from main memory.

How Cache Reads from Memory

Most modern CPUs have byte-addressable memory - meaning each byte of data in the memory has an independent address. This makes abstraction simple, and writing code easier. But it's slow. So cache doesn't work that way. Caches load cache lines from main memory into cache memory. The size of the cache line depends on the size of the cache and its associativity, so you need to check for your specific CPU. But for most modern CPUs today, the cache line is going to be 64 bytes wide.

Example

What does this mean for you? Well, the cache doesn't load addresses - it load lines. So if you pick an address in the middle or end of a line, the whole line has to be loaded into the cache before you can use it. Let's go back to our example. In our example, we will assume that a cache line is 8 elements.

So if we want to look at all of the x-positions for our particles, the array of structures would require 3 loads from main memory (because each area between the "Cache Line Boundary" lines has blue cells). For the structure of arrays? A single load from memory. That's a 66% reduction in memory loads, which is always a good thing.

That's fine, but what if I wanted to use all of particle position coordinates? Don't I still have to perform 3 loads? Yes, you do. But the question is: How many streams of loads can I do simultaneously?

Yep. Simultaneously. Modern CPUs can fetch multiple cache lines from memory at the same time (so long as there is not an associativity conflict). The catch? The streams have to be referenced by different variables in your code. In our array of structures, this is not possible: There is only 1 array variable. In the structure of arrays example, there are 3 array variables. So 3 simultaneous loads from memory. Net effect: 1 load from main memory.

The Readability Perspective

What about readability? Surely it is easier to read if you use the array of structures approach, right? It's not that different.

Let's look at a C example.The array of structures would be coded in the following fashion:

[sourcecode language="C"]
typedef struct {
int x,y,z;
} particle;
particle particles[8];
[/sourcecode]

Simple enough. How difficult to read would the structure of arrays version be? Not that difficult:

[sourcecode language="C"]
struct {
int x[8];
int y[8];
int z[8];
} particles;
[/sourcecode]

Not too bad. What about usage within the code? Let's take the average in each dimension to find the virtual center of the particles in the array of structures example:

[sourcecode language="C"]
float x_avg=0.0;
float y_avg=0.0;
float z_avg=0.0;
for(int i=0; i<8; i++) {
x_avg += particles[i].x;
y_avg += particles[i].y;
z_avg += particles[i].z;
}
x_avg /= 8.0;
y_avg /= 8.0;
z_avg /= 8.0;
[/sourcecode]

The structure of arrays example is remarkably similar:

[sourcecode language="C"]
float x_avg=0.0;
float y_avg=0.0;
float z_avg=0.0;
for(int i=0; i<8; i++) {
x_avg += particles.x[i];
y_avg += particles.y[i];
z_avg += particles.z[i];
}
x_avg /= 8.0;
y_avg /= 8.0;
z_avg /= 8.0;
[/sourcecode]

The readability is not an issue, but the performance differences can be sizable (my friends an Intel have done several detailed write-ups on this subject, one in blog form and one as a case study). Plus, this is the type of thing compilers cannot automatically optimize. The programmer has to take care of this one on their own.

For large codes, translation from array of structures to structure of arrays can be a daunting task, but in the end it might be worth it - especially if your code could benefit from a significant speed boost (and, frankly, whose couldn't?). If you are just starting out, remember to always use the structure of arrays approach. It's easier to start with the faster encoding then try to fix it all later once you have something working.

As always, please feel free to start a conversation in the comments section below. Also, don't forget to subscribe over on the right, and connect with me on Twitter and LinkedIn.

The Professional Programmer

Search This Blog