array
C++ library for multi-dimensional arrays
About

This library provides a multidimensional array class for C++, with the following design goals:

The library uses some ideas established in other existing projects, such as numpy, Halide, and Eigen. Array shapes are specified as a list of N dimensions, where each dimension has parameters such as an extent and a stride. Array references and objects use shape objects to map N-dimensional indices to a flat index. N-dimensional indices are mapped to flat offsets with the following formula:

1 flat_offset = (x0 - min0)*stride0 + (x1 - min1)*stride1 + ... + (xN - minN)*strideN

where:

Arrays efficiently support advanced manipulations like cropping, slicing, and splitting arrays and loops, all while preserving compile-time constant parameters when possible. Although it is a heavily templated library, incorrect usage generates informative and helpful error messages. Typically, an issue will result in only one error message, located at the site of the problem in user code. This is something we test for.

Many other libraries offering multi-dimensional arrays or tensors allow compile-time constant shapes. However, most if not all of them only allow either all of the shape parameters to be compile-time constant, or none of them. This is really limiting; often only a few key parameters of a shape need to be compile-time constant for performance, while other dimensions need flexibility to accommodate runtime-valued shape parameters. Some examples of this are:

Some other features of the library are:

For more detailed documentation, see the generated documentation.

Usage

Shapes

The basic types provided by the library are:

To define an array, define a shape type, and use it to define an array object:

1 {c++}
2  using my_3d_shape_type = shape<dim<>, dim<>, dim<>>;
3  constexpr int width = 16;
4  constexpr int height = 10;
5  constexpr int depth = 3;
6  my_3d_shape_type my_3d_shape(width, height, depth);
7  array<int, my_3d_shape_type> my_array(my_3d_shape);

General shapes and arrays like this have the following built-in aliases:

Access and iteration

Accessing array or array_ref is done via operator(...) and operator[index_type]. There are both variadic and index_type overloads of operator(). index_type is a specialization of std::tuple defined by shape (and array and array_ref), e.g. my_3d_shape_type::index_type.

1 {c++}
2  for (int z = 0; z < depth; z++) {
3  for (int y = 0; y < height; y++) {
4  for (int x = 0; x < width; x++) {
5  // Variadic version:
6  my_array(x, y, z) = 5;
7  // Or the index_type version:
8  my_array[{x, y, z}] = 5;
9  }
10  }
11  }

array::for_each_value and array_ref::for_each_value calls a function with a reference to each value in the array.

1 {c++}
2  my_array.for_each_value([](int& value) {
3  value = 5;
4  });

for_all_indices is a free function taking a shape object and a function to call with every index in the shape. for_each_index is similar, calling a free function with the index as an instance of the index type my_3d_shape_type::index_type.

1 {c++}
2  for_all_indices(my_3d_shape, [&](int x, int y, int z) {
3  my_array(x, y, z) = 5;
4  });
5  for_each_index(my_3d_shape, [&](my_3d_shape_type::index_type i) {
6  my_array[i] = 5;
7  });

The order in which each of for_each_value, for_each_index, and for_all_indices execute their traversal over the shape is defined by shape_traits<Shape>. The default implementation of shape_traits<Shape>::for_each_index iterates over the innermost dimension as the innermost loop, and proceeds in order to the outermost dimension.

1 {c++}
2  my_3d_shape_type my_shape(2, 2, 2);
3  for_all_indices(my_shape, [](int x, int y, int z) {
4  std::cout << x << ", " << y << ", " << z << std::endl;
5  });
6  // Output:
7  // 0, 0, 0
8  // 1, 0, 0
9  // 0, 1, 0
10  // 1, 1, 0
11  // 0, 0, 1
12  // 1, 0, 1
13  // 0, 1, 1
14  // 1, 1, 1

The default implementation of shape_traits<Shape>::for_each_value iterates over a dynamically optimized shape. The order will vary depending on the properties of the shape.

There are overloads of for_all_indices and for_each_index accepting a permutation to indicate the loop order. In this example, the permutation <2, 0, 1> iterates over the z dimension as the innermost loop, then x, then y.

1 {c++}
2  for_all_indices<2, 0, 1>(my_shape, [](int x, int y, int z) {
3  std::cout << x << ", " << y << ", " << z << std::endl;
4  });
5  // Output:
6  // 0, 0, 0
7  // 0, 0, 1
8  // 1, 0, 0
9  // 1, 0, 1
10  // 0, 1, 0
11  // 0, 1, 1
12  // 1, 1, 0
13  // 1, 1, 1

Compile-time constant shapes

In the previous examples, no array parameters are compile time constants, so all of these accesses and loops expand to a flat_offset expression where the mins, extents, and strides are runtime variables. This can prevent the compiler from generating efficient code. For example, the compiler may be able to auto-vectorize these loops, but if the stride of the dimension accessed by the vectorized loop is a runtime variable, the compiler will have to generate gathers and scatters instead of vector load and store instructions, even if the stride is one at runtime.

To avoid this, we need to make array parameters compile time constants. However, while making array parameters compile time constants helps the compiler generate efficient code, it also makes the program less flexible.

This library helps balance this tradeoff by enabling any of the array parameters to be compile time constants, but not require it. Which parameters should be made into compile time constants will vary depending on the use case. A common case is to make the innermost dimension have stride 1:

1 {c++}
2  using my_dense_3d_shape_type = shape<
3  dim</*Min=*/dynamic, /*Extent=*/dynamic, /*Stride=*/1>,
4  dim<>,
5  dim<>>;
6  array<char, my_dense_3d_shape_type> my_dense_array({16, 3, 3});
7  for (auto x : my_dense_array.x()) {
8  // The compiler knows that each loop iteration accesses
9  // elements that are contiguous in memory for contiguous x.
10  my_dense_array(x, y, z) = 0;
11  }

A dimension with unknown min and extent, and stride 1, is common enough that it has a built-in alias dense_dim<>, and shapes with a dense first dimension are common enough that shapes and arrays have the following built-in aliases:

There are other common examples that are easy to support. A very common array is an image where 3-channel RGB or 4-channel RGBA pixels are stored together in a 'chunky' format.

1 {c++}
2 template <int Channels, int XStride = Channels>
3 using chunky_image_shape = shape<
4  strided_dim</*Stride=*/XStride>,
5  dim<>,
6  dense_dim</*Min=*/0, /*Extent=*/Channels>>;
7 array<uint8_t, chunky_image_shape<3>> my_chunky_image({1920, 1080, {}});

strided_dim<> is another alias for dim<> where the min and extent are unknown, and the stride may be a compile-time constant. image.h is a small helper library of typical image shape and object types defined using arrays, including chunky_image_shape.

Another common example is matrices indexed (row, column) with the column dimension stored densely:

1 {c++}
2  using matrix_shape = shape<dim<>, dense_dim<>>;
3  array<double, matrix_shape> my_matrix({10, 4});
4  for (auto i : my_matrix.i()) {
5  for (auto j : my_matrix.j()) {
6  // This loop ordering is efficient for this type.
7  my_matrix(i, j) = 0.0;
8  }
9  }

There are also many use cases for matrices with small constant sizes. This library provides auto_allocator<T, N>, an std::allocator compatible allocator that only allocates buffers of N small fixed sized objects with automatic storage. This makes it possible to define a small matrix type that will not use any dynamic memory allocation:

1 {c++}
2 template <int M, int N>
3 using small_matrix_shape = shape<
4  dim<0, M>,
5  dense_dim<0, N>>;
6 template <typename T, int M, int N>
7 using small_matrix = array<T, small_matrix_shape<M, N>, auto_allocator<T, M*N>>;
8 small_matrix<float, 4, 4> my_small_matrix;
9 // my_small_matrix is only one fixed size allocation, no new/delete calls
10 // happen. sizeof(small_matrix) = sizeof(float) * 4 * 4 + (overhead)

matrix.h is a small helper library of typical matrix shape and object types defined using arrays, including the examples above.

Slicing, cropping, and splitting

Shapes and arrays can be sliced and cropped using interval<Min, Extent> objects, which are similar to dim<>s. They can have either a compile-time constant or runtime valued min and extent. range(begin, end) is a helper functions to construct an interval.

1 {c++}
2  // Slicing
3  array_ref_of_rank<int, 2> channel1 = my_array(_, _, 1);
4  array_ref_of_rank<int, 1> row4_channel2 = my_array(_, 4, 2);
5 
6  // Cropping
7  array_ref_of_rank<int, 3> top_left = my_array(interval<>{0, 2}, interval<>{0, 4}, _);
8  array_ref_of_rank<int, 2> center_channel0 = my_array(interval<>{1, 2}, interval<>{2, 4}, 0);

The _ or all constants are placeholders indicating the entire dimension should be preserved. Dimensions that are sliced are removed from the shape of the array.

When iterating a dim, it is possible to split it first by either a compile-time constant or a runtime-valued split factor. A split dim produces an iterator range that produces interval<> objects. This allows easy tiling of algorithms:

1 {c++}
2  constexpr index_t x_split_factor = 3;
3  const index_t y_split_factor = 5;
4  for (auto yo : split(my_array.y(), y_split_factor)) {
5  for (auto xo : split<x_split_factor>(my_array.x())) {
6  auto tile = my_array(xo, yo, _);
7  for (auto x : tile.x()) {
8  // The compiler knows this loop has a fixed extent x_split_factor!
9  tile(x, y, z) = x;
10  }
11  }
12  }

Both loops have extents that are not divided by their split factors. To avoid generating an array_ref referencing data out of bounds of the original array, the split iterators modify the last iteration. The behavior of each kind of split is different:

Compile-time constant split factors produce ranges with compile-time extents, and shapes and arrays cropped with these ranges will have a corresponding dim<> with a compile-time constant extent. This allows potentially significant optimizations to be expressed relatively easily!

Einstein reductions

The ein_reduce.h header provides Einstein notation reductions and summation helpers, similar to np.einsum or tf.einsum. These are zero-cost abstractions for describing loops that allow expressing a wide variety of array operations. Einstein notation expression operands are constructed using the ein<i, j, ...>(x) helper function, where x can be any callable object, including an array<> or array_ref<>. i, j, ... are constexpr integers indicating which dimensions of the reduction operation are used to evaluate x. Therefore, the number of arguments of x must match the number of dimensions provided to ein. Operands can be combined into larger expressions using typical binary operators.

Einstein notation expressions can be evaluated using one of the following functions:

Here are some examples using these reduction operations to compute summations:

1 {c++}
2  // Name the dimensions we use in Einstein reductions.
3  enum { i = 0, j = 1, k = 2, l = 3 };
4 
5  // Dot product dot1 = dot2 = x.y:
6  vector<float> x({10});
7  vector<float> y({10});
8  float dot1 = make_ein_sum<float>(ein<i>(x) * ein<i>(y));
9  float dot2 = 0.0f;
10  ein_reduce(ein(dot2) += ein<i>(x) * ein<i>(y));
11 
12  // Matrix multiply C1 = C2 = A*B:
13  matrix<float> A({10, 10});
14  matrix<float> B({10, 15});
15  matrix<float> C1({10, 15});
16  fill(C1, 0.0f);
17  ein_reduce(ein<i, j>(C1) += ein<i, k>(A) * ein<k, j>(B));
18  auto C2 = make_ein_sum<float, i, j>(ein<i, k>(A) * ein<k, j>(B));

We can use arbitrary functions as expression operands:

1 {c++}
2  // Cross product array crosses_n = x_n x y_n:
3  using vector_array = array<float, shape<dim<0, 3>, dense_dim<>>>;
4  vector_array xs({3, 100});
5  vector_array ys({3, 100});
6  vector_array crosses({3, 100});
7  auto epsilon3 = [](int i, int j, int k) { return sgn(j - i) * sgn(k - i) * sgn(k - j); };
8  ein_reduce(ein<i, l>(crosses) += ein<i, j, k>(epsilon3) * ein<j, l>(xs) * ein<k, l>(ys));

These operations generally produce loop nests that are as readily optimized by the compiler as hand-written loops. In this example, crosses, xs, and ys have shape shape<dim<0, 3>, dense_dim<>>, so the compiler will see small constant loops and likely be able to optimize this to similar efficiency as hand-written code, by unrolling and evaluating the function at compile time. The compiler also should be able to efficiently vectorize the l dimension of the ein_reduce, because that dimension has a constant stride 1.

The expression can be another kind of reduction, or not a reduction at all:

1 {c++}
2  // Matrix transpose AT = A^T:
3  matrix<float> AT({10, 10});
4  ein_reduce(ein<i, j>(AT) = ein<j, i>(A));
5 
6  // Maximum of each x-y plane of a 3D volume:
7  dense_array<float, 3> volume({8, 12, 20});
8  dense_array<float, 1> max_xy({20});
9  auto r = ein<k>(max_xy);
10  ein_reduce(r = max(r, ein<i, j, k>(volume)));

Reductions can have a mix of result and operand types:

1 {c++}
2  // Compute X1 = X2 = DFT[x]:
3  using complex = std::complex<float>;
4  dense_array<complex, 2> W({10, 10});
5  for_all_indices(W.shape(), [&](int j, int k) {
6  W(j, k) = exp(-2.0f * pi * complex(0, 1) * (static_cast<float>(j * k) / 10));
7  });
8  // Using `make_ein_sum`, returning the result:
9  auto X1 = make_ein_sum<complex, j>(ein<j, k>(W) * ein<k>(x));
10  // Using `ein_reduce`, computing the result in place:
11  vector<complex> X2({10}, 0.0f);
12  ein_reduce(ein<j>(X2) += ein<j, k>(W) * ein<k>(x));

These reductions also compose well with loop transformations like split and array operations like slicing and cropping. For example, a matrix multiplication can be tiled like so:

1 {c++}
2  // Adjust this depending on the target architecture. For AVX2, vectors are 256-bit.
3  constexpr index_t vector_size = 32 / sizeof(float);
4 
5  // We want the tiles to be big without spilling the accumulators to the stack.
6  constexpr index_t tile_rows = 4;
7  constexpr index_t tile_cols = vector_size * 3;
8 
9  for (auto io : split<tile_rows>(C.i())) {
10  for (auto jo : split<tile_cols>(C.j())) {
11  auto C_ijo = C(io, jo);
12  fill(C_ijo, 0.0f);
13  ein_reduce(ein<i, j>(C_ijo) += ein<i, k>(A) * ein<k, j>(B));
14  }
15  }

This generates the following machine code(*) for the inner loop using clang 18 with -O2 -ffast-math:

1 vbroadcastss (%rsi,%rdi,4), %ymm12
2 vmovups -64(%r12,%r15,4), %ymm13
3 vmovups -32(%r12,%r15,4), %ymm14
4 vmovups (%r12,%r15,4), %ymm15
5 addq %rbx, %r15
6 vfmadd231ps %ymm12, %ymm13, %ymm11
7 vfmadd231ps %ymm12, %ymm14, %ymm10
8 vfmadd231ps %ymm12, %ymm15, %ymm9
9 vbroadcastss (%r8,%rdi,4), %ymm12
10 vfmadd231ps %ymm12, %ymm13, %ymm8
11 vfmadd231ps %ymm12, %ymm14, %ymm7
12 vfmadd231ps %ymm12, %ymm15, %ymm6
13 vbroadcastss (%r10,%rdi,4), %ymm12
14 vfmadd231ps %ymm12, %ymm13, %ymm5
15 vfmadd231ps %ymm12, %ymm14, %ymm4
16 vfmadd231ps %ymm12, %ymm15, %ymm3
17 vbroadcastss (%rdx,%rdi,4), %ymm12
18 incq %rdi
19 vfmadd231ps %ymm13, %ymm12, %ymm2
20 vfmadd231ps %ymm14, %ymm12, %ymm1
21 vfmadd231ps %ymm12, %ymm15, %ymm0
22 cmpq %rdi, %r13
23 jne .LBB8_12

This is 40-50x faster than a naive C implementation of nested loops on my machine, and it should be within a factor of 2 of the peak possible performance. A similar example that is only a little more complicated achieves around 90% of the peak possible performance.

(*) Unfortunately, this doesn't generate performant code currently and requires a few tweaks to work around an issue in LLVM. See the matrix example for the code that produces the above assembly. To summarise, it is currently necessary to perform the accumulation into a temporary buffer instead of accumulating directly into the output.

CUDA support

Most of the functions in this library are marked with __device__, enabling them to be used in CUDA code. This includes array_ref<T, Shape> and most of its helper functions. The exceptions to this are functions and classes that allocate memory, primarily array<T, Shape, Alloc>.

Try it on Compiler Explorer

This library is available on Compiler Explorer as Array.