Last week I posted about Arrays of Structures vs. Structures of Arrays, and the response was pretty great. I did get some feedback that I had left out one important concept:
Arrays of Structures of Arrays (AoSoAs) are an interesting mixture of Arrays of Structures (AoSs) and Structures of Arrays (SoAs). This basically allows for tiling of accesses, allowing multiple streams for simultaneous access. Here's a quick visual primer:
[caption id="attachment_331" align="alignnone" width="840"]
Array of Structures vs Structure of Arrays vs Array of Structures of Arrays[/caption]
I've made the AoSoA example in this picture part of a contiguous memory segment. In C, this would be the case for memory allocated on the stack during compile time. If you were to allocate on the heap at runtime, the memory structure might be slightly different, but it's important to note that the index of traversal should always be contiguous. And in an AoSoA layout, the major traversal index is always the inner-most array (which will always be 1-dimensional, and therefore be contiguous).
What should the dimensions of this tiling look like? Well, there are some general rules of thumb:
This is a very important observation. The smallest tile size should be the length of the Single-Instruction, Multiple-Data (SIMD) unit in the processor you are optimizing for. For older Intel processors with only MMX extensions, the SIMD length was 2. For more recent Intel/AMD processors, the SIMD length is 4. And for the newest architectures, the SIMD length is 8. The second thing to consider is the length of a cache line. As I talked about last time, computers read memory in cache lines, not bytes or words. So whenever possible, find a length that is a common multiple of the SIMD length and the cache line length (on modern Intel processors, the cache line length is already a multiple of the SIMD width).
The same Intel blog and white paper I linked to last time also contains a section on AoSoAs which I recommend everyone take a look at.
The way you code AoSoAs isn't very different from AoSs or SoAs. Here is an example which builds the same example from the picture:
[sourcecode language="C"]
typedef struct {
int x[8];
int y[8];
int z[8];
} tile;
tile tiles[3];
[/sourcecode]
Not very different. In fact, usage within code is very similar, as well:
[sourcecode language="C"]
float x_avg=0.0;
float y_avg=0.0;
float z_avg=0.0;
for(int i=0; i<3; i++) {
for(int i=0; i<8; i++) {
x_avg += tiles[i].x[j];
y_avg += tiles[i].y[j];
z_avg += tiles[i].z[j];
}
}
x_avg /= 24.0;
y_avg /= 24.0;
z_avg /= 24.0;
[/sourcecode]
This loop can be unrolled and vectorized, just like SoAs. But it also has the advantage of being parallelizable across the exterior dimension, while maintaining full width SIMD operations and full cache line access:
[sourcecode language="C"]
float x_avg=0.0;
float y_avg=0.0;
float z_avg=0.0;
#pragma omp parallel for reduction(+:x_avg,y_avg,z_avg)
for(int i=0; i<3; i++) {
for(int i=0; i<8; i++) {
x_avg += tiles[i].x[j];
y_avg += tiles[i].y[j];
z_avg += tiles[i].z[j];
}
}
x_avg /= 24.0;
y_avg /= 24.0;
z_avg /= 24.0;
[/sourcecode]
Now we have a code which is parallelized in the non-SIMD dimension (try to always parallelize a level above the SIMD operation level). Each of the tiles is executed in parallel, and then the local copies of x_avg, y_avg, and z_avg are reduced by sum so that a final average can be calculated. The result is efficient use of the SIMD math units, good cache use, and shared-memory parallelism - all in one code!
As always, please feel free to start a conversation in the comments section below. Also, don't forget to subscribe over on the right, and connect with me on Twitter and LinkedIn.
It might be also worth to consider arrays of structures of arrays when that attributes are always to be accessed at the same time. Otherwise there might be over loading of one attr and cache misses for the others.
— ClĂ©ment Foyer (@FoyerClement) April 17, 2018
Consequently, before you even start changing the data layout of your code, one should think about adding some abstraction so you can switch between AoS/SoA/AoSSoA at compile time whilst maintaining the same access semantics.
— Ioan Hadade (@IoanHadade) April 16, 2018
Arrays of Structures of Arrays (AoSoAs) are an interesting mixture of Arrays of Structures (AoSs) and Structures of Arrays (SoAs). This basically allows for tiling of accesses, allowing multiple streams for simultaneous access. Here's a quick visual primer:
[caption id="attachment_331" align="alignnone" width="840"]

I've made the AoSoA example in this picture part of a contiguous memory segment. In C, this would be the case for memory allocated on the stack during compile time. If you were to allocate on the heap at runtime, the memory structure might be slightly different, but it's important to note that the index of traversal should always be contiguous. And in an AoSoA layout, the major traversal index is always the inner-most array (which will always be 1-dimensional, and therefore be contiguous).
What should the dimensions of this tiling look like? Well, there are some general rules of thumb:
You are right. Also, in my experience, starting with tile size = SIMD length is a good enough option although the nice thing is you can actually use an auto-tuner once you designed your code well enough to handle such abstractions
— Ioan Hadade (@IoanHadade) April 16, 2018
This is a very important observation. The smallest tile size should be the length of the Single-Instruction, Multiple-Data (SIMD) unit in the processor you are optimizing for. For older Intel processors with only MMX extensions, the SIMD length was 2. For more recent Intel/AMD processors, the SIMD length is 4. And for the newest architectures, the SIMD length is 8. The second thing to consider is the length of a cache line. As I talked about last time, computers read memory in cache lines, not bytes or words. So whenever possible, find a length that is a common multiple of the SIMD length and the cache line length (on modern Intel processors, the cache line length is already a multiple of the SIMD width).
The same Intel blog and white paper I linked to last time also contains a section on AoSoAs which I recommend everyone take a look at.
Coding AoSoAs
The way you code AoSoAs isn't very different from AoSs or SoAs. Here is an example which builds the same example from the picture:
[sourcecode language="C"]
typedef struct {
int x[8];
int y[8];
int z[8];
} tile;
tile tiles[3];
[/sourcecode]
Not very different. In fact, usage within code is very similar, as well:
[sourcecode language="C"]
float x_avg=0.0;
float y_avg=0.0;
float z_avg=0.0;
for(int i=0; i<3; i++) {
for(int i=0; i<8; i++) {
x_avg += tiles[i].x[j];
y_avg += tiles[i].y[j];
z_avg += tiles[i].z[j];
}
}
x_avg /= 24.0;
y_avg /= 24.0;
z_avg /= 24.0;
[/sourcecode]
This loop can be unrolled and vectorized, just like SoAs. But it also has the advantage of being parallelizable across the exterior dimension, while maintaining full width SIMD operations and full cache line access:
[sourcecode language="C"]
float x_avg=0.0;
float y_avg=0.0;
float z_avg=0.0;
#pragma omp parallel for reduction(+:x_avg,y_avg,z_avg)
for(int i=0; i<3; i++) {
for(int i=0; i<8; i++) {
x_avg += tiles[i].x[j];
y_avg += tiles[i].y[j];
z_avg += tiles[i].z[j];
}
}
x_avg /= 24.0;
y_avg /= 24.0;
z_avg /= 24.0;
[/sourcecode]
Now we have a code which is parallelized in the non-SIMD dimension (try to always parallelize a level above the SIMD operation level). Each of the tiles is executed in parallel, and then the local copies of x_avg, y_avg, and z_avg are reduced by sum so that a final average can be calculated. The result is efficient use of the SIMD math units, good cache use, and shared-memory parallelism - all in one code!
As always, please feel free to start a conversation in the comments section below. Also, don't forget to subscribe over on the right, and connect with me on Twitter and LinkedIn.