array
C++ library for multi-dimensional arrays
|
This library provides a multidimensional array class for C++, with the following design goals:
The library uses some ideas established in other existing projects, such as numpy, Halide, and Eigen. Array shapes are specified as a list of N dimensions, where each dimension has parameters such as an extent and a stride. Array references and objects use shape objects to map N-dimensional indices to a flat index. N-dimensional indices are mapped to flat offsets with the following formula:
where:
xN
are the indices in each dimension.minN
are the mins in each dimension. The min is the value of the first in-range index in this dimension (the max is minN + extentN - 1
).strideN
are the distances in the flat offsets between elements in each dimension.Arrays efficiently support advanced manipulations like cropping, slicing, and splitting arrays and loops, all while preserving compile-time constant parameters when possible. Although it is a heavily templated library, incorrect usage generates informative and helpful error messages. Typically, an issue will result in only one error message, located at the site of the problem in user code. This is something we test for.
Many other libraries offering multi-dimensional arrays or tensors allow compile-time constant shapes. However, most if not all of them only allow either all of the shape parameters to be compile-time constant, or none of them. This is really limiting; often only a few key parameters of a shape need to be compile-time constant for performance, while other dimensions need flexibility to accommodate runtime-valued shape parameters. Some examples of this are:
Some other features of the library are:
__device__
functions.For more detailed documentation, see the generated documentation.
The basic types provided by the library are:
dim<Min, Extent, Stride>
, a description of a single dimension. The template parameters specify a compile-time constant min, extent, or stride, or are dynamic
(meaning unknown) and are specified at runtime.shape<Dim0, Dim1, ...>
, a description of multiple dimensions. Dim0
is referred to as the innermost dimension.array<T, Shape, Allocator>
, a container following the conventions of std::vector
where possible. This container manages the allocation of a buffer associated with a Shape
.array_ref<T, Shape>
, a wrapper for addressing existing memory with a shape Shape
.To define an array, define a shape type, and use it to define an array object:
General shapes and arrays like this have the following built-in aliases:
shape_of_rank<N>
, an N-dimensional shape.array_ref_of_rank<T, N>
and array_of_rank<T, N, Allocator>
, N-dimensional arrays with a shape of shape_of_rank<N>
.Accessing array
or array_ref
is done via operator(...)
and operator[index_type]
. There are both variadic and index_type
overloads of operator()
. index_type
is a specialization of std::tuple
defined by shape
(and array
and array_ref
), e.g. my_3d_shape_type::index_type
.
array::for_each_value
and array_ref::for_each_value
calls a function with a reference to each value in the array.
for_all_indices
is a free function taking a shape object and a function to call with every index in the shape. for_each_index
is similar, calling a free function with the index as an instance of the index type my_3d_shape_type::index_type
.
The order in which each of for_each_value
, for_each_index
, and for_all_indices
execute their traversal over the shape is defined by shape_traits<Shape>
. The default implementation of shape_traits<Shape>::for_each_index
iterates over the innermost dimension as the innermost loop, and proceeds in order to the outermost dimension.
The default implementation of shape_traits<Shape>::for_each_value
iterates over a dynamically optimized shape. The order will vary depending on the properties of the shape.
There are overloads of for_all_indices
and for_each_index
accepting a permutation to indicate the loop order. In this example, the permutation <2, 0, 1>
iterates over the z
dimension as the innermost loop, then x
, then y
.
In the previous examples, no array parameters are compile time constants, so all of these accesses and loops expand to a flat_offset
expression where the mins, extents, and strides are runtime variables. This can prevent the compiler from generating efficient code. For example, the compiler may be able to auto-vectorize these loops, but if the stride of the dimension accessed by the vectorized loop is a runtime variable, the compiler will have to generate gathers and scatters instead of vector load and store instructions, even if the stride is one at runtime.
To avoid this, we need to make array parameters compile time constants. However, while making array parameters compile time constants helps the compiler generate efficient code, it also makes the program less flexible.
This library helps balance this tradeoff by enabling any of the array parameters to be compile time constants, but not require it. Which parameters should be made into compile time constants will vary depending on the use case. A common case is to make the innermost dimension have stride 1:
A dimension with unknown min and extent, and stride 1, is common enough that it has a built-in alias dense_dim<>
, and shapes with a dense first dimension are common enough that shapes and arrays have the following built-in aliases:
dense_shape<N>
, an N-dimensional shape with the first dimension being dense.dense_array_ref<T, N>
and dense_array<T, N, Allocator>
, N-dimensional arrays with a shape of dense_shape<N>
.There are other common examples that are easy to support. A very common array is an image where 3-channel RGB or 4-channel RGBA pixels are stored together in a 'chunky' format.
strided_dim<>
is another alias for dim<>
where the min and extent are unknown, and the stride may be a compile-time constant. image.h
is a small helper library of typical image shape and object types defined using arrays, including chunky_image_shape
.
Another common example is matrices indexed (row, column)
with the column dimension stored densely:
There are also many use cases for matrices with small constant sizes. This library provides auto_allocator<T, N>
, an std::allocator
compatible allocator that only allocates buffers of N
small fixed sized objects with automatic storage. This makes it possible to define a small matrix type that will not use any dynamic memory allocation:
matrix.h
is a small helper library of typical matrix shape and object types defined using arrays, including the examples above.
Shapes and arrays can be sliced and cropped using interval<Min, Extent>
objects, which are similar to dim<>
s. They can have either a compile-time constant or runtime valued min and extent. range(begin, end)
is a helper functions to construct an interval
.
The _
or all
constants are placeholders indicating the entire dimension should be preserved. Dimensions that are sliced are removed from the shape of the array.
When iterating a dim
, it is possible to split
it first by either a compile-time constant or a runtime-valued split factor. A split dim
produces an iterator range that produces interval<>
objects. This allows easy tiling of algorithms:
Both loops have extents that are not divided by their split factors. To avoid generating an array_ref
referencing data out of bounds of the original array, the split iterators modify the last iteration. The behavior of each kind of split is different:
yo
can vary, it is reduced on the last iteration. This strategy can accommodate dimensions of any extent.xo
must be a constant, the last iteration will be shifted to overlap the previous iteration. This strategy requires the extent of the dimension being split is greater than the split factor (but not a multiple!)Compile-time constant split factors produce ranges with compile-time extents, and shapes and arrays cropped with these ranges will have a corresponding dim<>
with a compile-time constant extent. This allows potentially significant optimizations to be expressed relatively easily!
The ein_reduce.h
header provides Einstein notation reductions and summation helpers, similar to np.einsum or tf.einsum. These are zero-cost abstractions for describing loops that allow expressing a wide variety of array operations. Einstein notation expression operands are constructed using the ein<i, j, ...>(x)
helper function, where x
can be any callable object, including an array<>
or array_ref<>
. i, j, ...
are constexpr
integers indicating which dimensions of the reduction operation are used to evaluate x
. Therefore, the number of arguments of x
must match the number of dimensions provided to ein
. Operands can be combined into larger expressions using typical binary operators.
Einstein notation expressions can be evaluated using one of the following functions:
ein_reduce(expression)
, evaluate an arbitrary Einstein notation expression
.lhs = make_ein_sum<T, i, j, ...>(rhs)
, evaluate the summation ein<i, j, ...>(lhs) += rhs
, and return lhs
. The shape of lhs
is inferred from the expression.Here are some examples using these reduction operations to compute summations:
We can use arbitrary functions as expression operands:
These operations generally produce loop nests that are as readily optimized by the compiler as hand-written loops. In this example, crosses
, xs
, and ys
have shape shape<dim<0, 3>, dense_dim<>>
, so the compiler will see small constant loops and likely be able to optimize this to similar efficiency as hand-written code, by unrolling and evaluating the function at compile time. The compiler also should be able to efficiently vectorize the l
dimension of the ein_reduce
, because that dimension has a constant stride 1.
The expression can be another kind of reduction, or not a reduction at all:
Reductions can have a mix of result and operand types:
These reductions also compose well with loop transformations like split
and array operations like slicing and cropping. For example, a matrix multiplication can be tiled like so:
This generates the following machine code(*) for the inner loop using clang 18 with -O2 -ffast-math:
This is 40-50x faster than a naive C implementation of nested loops on my machine, and it should be within a factor of 2 of the peak possible performance. A similar example that is only a little more complicated achieves around 90% of the peak possible performance.
(*) Unfortunately, this doesn't generate performant code currently and requires a few tweaks to work around an issue in LLVM. See the matrix example for the code that produces the above assembly. To summarise, it is currently necessary to perform the accumulation into a temporary buffer instead of accumulating directly into the output.
Most of the functions in this library are marked with __device__
, enabling them to be used in CUDA code. This includes array_ref<T, Shape>
and most of its helper functions. The exceptions to this are functions and classes that allocate memory, primarily array<T, Shape, Alloc>
.
This library is available on Compiler Explorer as Array
.