![]() |
VCTR
|
An expression template contains one or more possibilities to compute a certain (mathematical) expression. The most basic expression template implementation for a unary expression – this means an expression which transforms exactly one source vector into one destination vector must look like that:
The only exception to the minimum requirements outlined above are reduction expressions which are discussed separately below.
Every kind of expression has to take at least two template arguments:
Since all the member functions described above need to be implemented for every expression, there is a macro that replaces all the repetitive boilerplate code. With that the example above will look like this:
This should be the basic starting point for every expression template.
In order to use the operator<<
chaining syntax, a unary expression has to be wrapped into an ExpressionChainBuilder
instance. An expression chain builder holds all information about how to build a chain of expressions without actually building one right away. It supplies overloads for operator<<
that return different objects, based on what they are called on.
Pure ExpressionChainBuilder
instances have no members and should be constexpr
. Simply declare it like this at the end of your implementation:
Some expressions need runtime values to work, for an example the clamp
expression which needs to know the limit values. Since the actual expression instance does not yet exist while using the expression chain builder to set up an expression chain the generalised ExpressionChainBuilderWithRuntimeArgs
class allows storing one or more runtime arguments applied to each expression once the expression chain is set up. Expressions that need runtime arguments don't expose a constexpr ExpressionChainBuilder
instance but a free function that takes the arguments and returns an expression chain builder instance that stores the arguments. This is done with the makeExpressionChainBuilderWithRuntimeArgs function:
When the expression instance is created, the chain builder has to apply the argument to the expression. Therefore, we need to add an applyRuntimeArgs
member function to the expression template, which accepts the same number of arguments as passed to the expression chain builder instance and applies them to the actual expression:
the VCTR_COMMON_UNARY_EXPRESSION_MEMBERS
macro automatically takes care of calling that function from within the automatically generated iterateOverRuntimeArgChain function in case it exists.
The values are passed by copy since one expression chain builder instance could create multiple expression instances, therefore the arguments should be cheap to copy if possible. Otherwise, there are no limits to the type and number of arguments.
An alternative approach to pass values to an expression are compile time constants. To do so, you can add further std::integral_constant
like template arguments to the expression:
The corresponding expression chain builder is a variable template then:
The types of expressions described above are unary expressions, this means that they transform a single source vector or expression. Operations like e.g. additions and multiplications need multiple operands. They should take two source types as template argument and two constructor arguments accordingly. The VCTR_COMMON_BINARY_VEC_VEC_EXPRESSION_MEMBERS
and VCTR_COMMON_BINARY_SINGLE_VEC_EXPRESSION_MEMBERS
macros should be chosen to generate the boilerplate code in this case. Those expressions are exposed via free functions or operator overloads. This is how e.g. the vector addition implementation looks like:
Note that we have to take care that sizes match when combining multiple sources through an expression. This can be done by the assertCommonSize
helper function, which performs a compile-time check in case both expressions have a runtime defined extent or a run-time check in case one of them or both have a dynamic extent.
Besides the per-element evaluation through operator[]
, the expression can implement various other ways to evaluate the expression:
AVXRegister<value_type> getAVX (size_t i) const
for suitable types. Make sure to require at least archX64 && has::getAVX<SrcType>
SSERegister<value_type> getSSE (size_t i) const
for suitable types. Make sure to require at least archX64 && has::getSSE<SrcType>
NeonRegister<value_type> getNeon (size_t i) const
for suitable types. Make sure to require at least archARM && has::getAVX<SrcType>
Intel architecture specific implementations have to be prefixed with the VCTR_TARGET (<arch>)
macro to instruct the compiler to deliberately generate instructions for that instruction set, no matter what compiler flags are set. The calling side will do a runtime check if the corresponding functions are available at runtime. Valid values for <arch>
are "avx"
, "avx2</tt>" and `"sse4.1"`.
In case implementations are only available or make sense for specific constraints, you can constrain them and possibly add multiple implementations for different types, e.g. like this
Here we see some explicit overloads for floating point values, int32 values and unsigned integers. For int64, there is no straightforward abs function in SSE 4.1 so we don't implement it. Calling abs on an int64 vector might fall back to the default
operator[]
implementation then. When assigning the expression, the implementation tries its best to choose the most promising strategy for the given architecture it runs on, so it's a good idea to implement multiple possibilities.
In many cases, highly optimized vector operation libraries like Intel IPP or Apple Accelerate outperform our handwritten SIMD code. We can also use them to execute our expressions. To do so, we need to implement
evalNextVectorOpInExpressionChain
. Let's have a look of the Abs
template again to see how it's used:
evalNextVectorOpInExpressionChain
takes a destination memory location as argument and returns a source memory location. The returned pointer is the memory that the next expression should read from. The argument is the memory that we write our expression result to. When we assign an expression to a Vector, it will pass the vectors' storage as argument to evalNextVectorOpInExpressionChain
. This way we are able to write the expression result directly into the destination memory. A usual expression template should return the destination memory, making chained expression work in place on the destination memory. VctrBase
is the only class that returns its data
pointer from evalNextVectorOpInExpressionChain
, so if the source is a vector, the first expression will perform an out-of-place operation from the source memory into the destination memory.
As we write to the destination memory directly, there can be cases where we need the destination vector as a source vector while evaluating the expression. Take this one for an example: ` a = a + vctr::abs (b);
With the implementation strategy described above, the template would first call
evalNextVectorOpInExpressionChainon the
abs (b)expression, take the memory of
bas source and write the result into the memory of
a. Then it would perform the addition between the result of that computation and
a. But wait, at that point, we already replaced the value of
awith the result of
abs (b). We have an aliasing problem here. To overcome this problem, we need
isNotAliased. It takes the destination memory as argument and returns false in case it detects a case of aliasing. As aliasing is only a problem of binary expressions, unary expressions should simply forward the
isNotAliasedcall as a
constexpr` function. Binary expressions should do a check like this:
isNotAliased
is also implemented in the VCTR_COMMON_..._EXPRESSION_MEMBERS
macros, so you should not need to take care of that yourself.
As evaluating a binary expression with two expressions as sources can never work with our
evalNextVectorOpInExpressionChain
implementation strategy as it would need an intermediate buffer that we want to avoid at all costs, binary expressions should always be constrained by the is::suitableForBinaryEvalVectorOp
concept.
While the expressions discussed above are used to transform a source vector into a destination vector, reduction expressions reduce a source vector into a single reduction result value. Examples are e.g. a vector sum or finding the maximum value in a vector. They use the same class template signature and also define an
ExpressionChainBuilder
instance but their member functions look different. Futhermore, an expression chain terminated by a reduction expression is not evaluated lazily when assigned to a destination but are evaluated right away and returns the computed single value.
The most basic reduction expression template implementation must look like that:
The calling code will create a
std::array<value_type, 1>
variable which is initialised to reductionResultInitValue
. The single array value is then passed to reduceElementWise
which is called in a loop over all source elements. After all elements have been processed, the array is passed to finalizeReduction
which is expected to perform final computations. While in this case the array will only hold a single element, it can hold multiple sub results in other scenarios discussed below.
When platform specific vector operations should be used, the required signature is
This usually requires, that the source type supplies direct access to the
data()
pointer and is likely not suitable when chained expressions are used, as there is no scratch buffer to write the previous expression results to. Better constrain your implementation according to that, using e.g. the vctr::has::data
concept.
When SIMD operations should be used, the required signature is
They basically work the same as the element-wise implementations but they evaluate a whole SIMD register at a time. Possible residual elements are then evaluated using a scalar loop in the calling code. This leads to a SIMD register and a single scalar value as sub-results, which are passed to
finalizeReduction
for a last final reduction step.