![]() |
VCTR
|
An expression template contains one or more possibilities to compute a certain (mathematical) expression. The most basic expression template implementation for a unary expression – this means an expression which transforms exactly one source vector into one destination vector must look like that:
The only exception to the minimum requirements outlined above are reduction expressions which are discussed separately below.
Every kind of expression has to take at least two template arguments:
Since all the member functions described above need to be implemented for every expression, there is a macro that replaces all the repetitive boilerplate code. With that the example above will look like this:
This should be the basic starting point for every expression template.
In order to use the operator<< chaining syntax, a unary expression has to be wrapped into an ExpressionChainBuilder instance. An expression chain builder holds all information about how to build a chain of expressions without actually building one right away. It supplies overloads for operator<< that return different objects, based on what they are called on.
Pure ExpressionChainBuilder instances have no members and should be constexpr. Simply declare it like this at the end of your implementation:
Some expressions need runtime values to work, for an example the clamp expression which needs to know the limit values. Since the actual expression instance does not yet exist while using the expression chain builder to set up an expression chain the generalised ExpressionChainBuilderWithRuntimeArgs class allows storing one or more runtime arguments applied to each expression once the expression chain is set up. Expressions that need runtime arguments don't expose a constexpr ExpressionChainBuilder instance but a free function that takes the arguments and returns an expression chain builder instance that stores the arguments. This is done with the makeExpressionChainBuilderWithRuntimeArgs function:
When the expression instance is created, the chain builder has to apply the argument to the expression. Therefore, we need to add an applyRuntimeArgs member function to the expression template, which accepts the same number of arguments as passed to the expression chain builder instance and applies them to the actual expression:
the VCTR_COMMON_UNARY_EXPRESSION_MEMBERS macro automatically takes care of calling that function from within the automatically generated iterateOverRuntimeArgChain function in case it exists.
The values are passed by copy since one expression chain builder instance could create multiple expression instances, therefore the arguments should be cheap to copy if possible. Otherwise, there are no limits to the type and number of arguments.
An alternative approach to pass values to an expression are compile time constants. To do so, you can add further std::integral_constant like template arguments to the expression:
The corresponding expression chain builder is a variable template then:
The types of expressions described above are unary expressions, this means that they transform a single source vector or expression. Operations like e.g. additions and multiplications need multiple operands. They should take two source types as template argument and two constructor arguments accordingly. The VCTR_COMMON_BINARY_VEC_VEC_EXPRESSION_MEMBERS and VCTR_COMMON_BINARY_SINGLE_VEC_EXPRESSION_MEMBERS macros should be chosen to generate the boilerplate code in this case. Those expressions are exposed via free functions or operator overloads. This is how e.g. the vector addition implementation looks like:
Note that we have to take care that sizes match when combining multiple sources through an expression. This can be done by the assertCommonSize helper function, which performs a compile-time check in case both expressions have a runtime defined extent or a run-time check in case one of them or both have a dynamic extent.
Besides the per-element evaluation through operator[], the expression can implement various other ways to evaluate the expression:
AVXRegister<value_type> getAVX (size_t i) const for suitable types. Make sure to require at least archX64 && has::getAVX<SrcType>SSERegister<value_type> getSSE (size_t i) const for suitable types. Make sure to require at least archX64 && has::getSSE<SrcType>NeonRegister<value_type> getNeon (size_t i) const for suitable types. Make sure to require at least archARM && has::getNeon<SrcType>Intel architecture specific implementations have to be prefixed with the VCTR_TARGET (<arch>) macro to instruct the compiler to deliberately generate instructions for that instruction set, no matter what compiler flags are set. The calling side will do a runtime check if the corresponding functions are available at runtime. Valid values for <arch> are "avx", "fma", "avx2"" and `"sse4.1"`.
In case implementations are only available or make sense for specific constraints, you can constrain them and possibly
add multiple implementations for different types, e.g. like this
@icode {C++}
VCTR_FORCEDINLINE VCTR_TARGET ("sse4.1") SSERegister<value_type> getSSE (size_t i) const
requires (archX64 && has::getSSE<SrcType> && is::realFloatNumber<SrcElementType>)
{
static const auto sseSignBit = SSESrcType::broadcast (SrcElementType (-0.0));
return SSERetType::bitwiseAndNot (src.getSSE (i), sseSignBit);
}
VCTR_FORCEDINLINE VCTR_TARGET ("sse4.1") SSERegister<value_type> getSSE (size_t i) const
requires (archX64 && has::getSSE<SrcType> && std::same_as<int32_t, value_type>)
{
return SSERetType::abs (src.getSSE (i));
}
VCTR_FORCEDINLINE VCTR_TARGET ("sse4.1") SSERegister<value_type> getSSE (size_t i) const
requires (archX64 && has::getSSE<SrcType> && is::unsignedIntNumber<value_type>)
{
return src.getSSE (i); // unsigned integers are always positive
}
@endicode
Here we see some explicit overloads for floating point values, int32 values and unsigned integers. For int64, there
is no straightforward abs function in SSE 4.1 so we don't implement it. Calling abs on an int64 vector might fall back
to the default <tt>operator[]</tt> implementation then. When assigning the expression, the implementation tries its best to
choose the most promising strategy for the given architecture it runs on, so it's a good idea to implement multiple
possibilities.
Some SIMD based evaluations can gain performance by storing constants to a SIMD register once before looping over
the registers. These temporary registers are managed as private <tt>mutable</tt> member variables in the expression class.
They are mutable since expressions are usually passed as const reference to the destination container that evaluates
them, and they are used as temporary working buffer only. Since multithreaded expression evaluation is not supported,
this is safe. To avoid the need of declaring individual variables per register type which will never be used
simultaneously, we can use the <tt>SIMDRegisterUnion</tt> union template which contains a Neon, an AVX and an SSE register.
To initialize the values in the register before the SIMD evaluation starts, the expression has to expose a
<tt>prepare\<arch\>Evaluation</tt> function with <tt>\<arch\></tt> being one of <tt>Neon</tt>, <tt>AVX</tt> and <tt>SSE</tt> for every evaluation function
that it implements. Even if the expression does not make use of that feature, it has to forward those functions to
the source expressions. To avoid a lot of boilerplate code, the
<tt>VCTR_FORWARD_PREPARE_SIMD_EVALUATION_UNARY_EXPRESSION_MEMBER_FUNCTIONS</tt> and
<tt>VCTR_FORWARD_PREPARE_SIMD_EVALUATION_BINARY_EXPRESSION_MEMBER_FUNCTIONS</tt> macros can be used. A manual implementation
can look like this, taking an expression that sums a SIMD register with a scalar value:
@icode {C++}
public:
//...
// AVX Implementation
VCTR_FORCEDINLINE VCTR_TARGET ("avx") void prepareAVXEvaluation() const
requires has::prepareAVXEvaluation<SrcType>
{
src.prepareAVXEvaluation();
singleSIMD.avx = Expression::AVX::broadcast (single);
}
VCTR_FORCEDINLINE VCTR_TARGET ("fma") AVXRegister<value_type> getAVX (size_t i) const
requires (archX64 && has::getAVX<SrcType> && Expression::allElementTypesSame && Expression::CommonElement::isRealFloat)
{
return Expression::AVX::add (singleSIMD.avx, src.getAVX (i));
}
VCTR_FORCEDINLINE VCTR_TARGET ("avx2") AVXRegister<value_type> getAVX (size_t i) const
requires (archX64 && has::getAVX<SrcType> && Expression::allElementTypesSame && Expression::CommonElement::isInt)
{
return Expression::AVX::add (singleSIMD.avx, src.getAVX (i));
}
// SSE Implementation
VCTR_FORCEDINLINE VCTR_TARGET ("sse4.1") void prepareSSEEvaluation() const
requires has::prepareSSEEvaluation<SrcType>
{
src.prepareSSEEvaluation();
singleSIMD.sse = Expression::SSE::broadcast (single);
}
VCTR_FORCEDINLINE VCTR_TARGET ("sse4.1") SSERegister<value_type> getSSE (size_t i) const
requires (archX64 && has::getSSE<SrcType> && Expression::allElementTypesSame)
{
return Expression::SSE::add (singleSIMD.sse, src.getSSE (i));
}
private:
mutable SIMDRegisterUnion<Expression> singleSIMD {};
@endicode
Note: Don't forget the default initialization braces for your <tt>SIMDRegisterUnion</tt> member(s), otherwise the expression
class won't work in a constexpr context.
@section autotoc_md4 Platform Specific Vector Operations
In many cases, highly optimized vector operation libraries like Intel IPP or Apple Accelerate outperform our
handwritten SIMD code. We can also use them to execute our expressions. To do so, we need to implement
<tt>evalNextVectorOpInExpressionChain</tt>. Let's have a look of the <tt>Abs</tt> template again to see how it's used:
@icode {C++}
VCTR_FORCEDINLINE const value_type* evalNextVectorOpInExpressionChain (value_type* dst) const
requires (platformApple && has::evalNextVectorOpInExpressionChain<SrcType, value_type> && is::realFloatNumber<value_type>)
{
AccelerateRetType::abs (src.evalNextVectorOpInExpressionChain (dst), dst, int (size()));
return dst;
}
@endicode
<tt>evalNextVectorOpInExpressionChain</tt> takes a destination memory location as argument and returns a source memory
location. The returned pointer is the memory that the next expression should read from. The argument is the memory that
we write our expression result to. When we assign an expression to a Vector, it will pass the vectors' storage as
argument to <tt>evalNextVectorOpInExpressionChain</tt>. This way we are able to write the expression result directly into the
destination memory. A usual expression template should return the destination memory, making chained expression work in
place on the destination memory. <tt>VctrBase</tt> is the only class that returns its <tt>data</tt> pointer from
<tt>evalNextVectorOpInExpressionChain</tt>, so if the source is a vector, the first expression will perform an out-of-place
operation from the source memory into the destination memory.
As we write to the destination memory directly, there can be cases where we need the destination vector as a source
vector while evaluating the expression. Take this one for an example:
`
a = a + vctr::abs (b);
<tt>
With the implementation strategy described above, the template would first call</tt>evalNextVectorOpInExpressionChain<tt>on
the</tt>abs (b)<tt>expression, take the memory of</tt>b<tt>as source and write the result into the memory of</tt>a<tt>. Then it would
perform the addition between the result of that computation and</tt>a<tt>. But wait, at that point, we already replaced the
value of</tt>a<tt>with the result of</tt>abs (b)<tt>. We have an aliasing problem here. To overcome this problem, we need
</tt>isNotAliased<tt>. It takes the destination memory as argument and returns false in case it detects a case of aliasing. As
aliasing is only a problem of binary expressions, unary expressions should simply forward the</tt>isNotAliased<tt>call as a
</tt>constexpr` function. Binary expressions should do a check like this:
@icode {C++}
constexpr bool isNotAliased (const void* dst) const
{
if constexpr (is::expression<SrcAType> && is::anyVctr<SrcBType>)
{
return dst != srcB.data();
}
if constexpr (is::anyVctr<SrcAType> && is::expression<SrcBType>)
{
return dst != srcA.data();
}
return true;
}
@endicode
<tt>isNotAliased</tt> is also implemented in the <tt>VCTR_COMMON_..._EXPRESSION_MEMBERS</tt> macros, so you should not need to take
care of that yourself.
As evaluating a binary expression with two expressions as sources can never work with our
<tt>evalNextVectorOpInExpressionChain</tt> implementation strategy as it would need an intermediate buffer that we want to
avoid at all costs, binary expressions should always be constrained by the <tt>is::suitableForBinaryEvalVectorOp</tt> concept.
@section autotoc_md5 Reduction expressions
While the expressions discussed above are used to transform a source vector into a destination vector, reduction
expressions reduce a source vector into a single reduction result value. Examples are e.g. a vector sum or finding
the maximum value in a vector. They use the same class template signature and also define an <tt>ExpressionChainBuilder</tt>
instance but their member functions look different. Futhermore, an expression chain terminated by a reduction expression
is not evaluated lazily when assigned to a destination but are evaluated right away and returns the computed single
value.
The most basic reduction expression template implementation must look like that:
@icode {C++}
template <size_t extent, class SrcType>
class MyExpression : ExpressionTemplateBase
{
public:
using value_type = ValueType<SrcType>;
VCTR_COMMON_UNARY_EXPRESSION_MEMBERS (MyExpression, src)
static constexpr value_type reductionResultInitValue = std::numeric_limits<value_type>::min();
VCTR_FORCEDINLINE constexpr void reduceElementWise (value_type& result, size_t i) const
{
result = someComputation (result, src[i]);
}
template <size_t n>
VCTR_FORCEDINLINE static constexpr value_type finalizeReduction (const std::array<value_type, n>& subResults)
{
return finalComputationOnSubResults (subResults);
}
};
@endicode
The calling code will create a <tt>std::array\<value_type, 1\></tt> variable which is initialised to <tt>reductionResultInitValue</tt>.
The single array value is then passed to <tt>reduceElementWise</tt> which is called in a loop over all source elements. After
all elements have been processed, the array is passed to <tt>finalizeReduction</tt> which is expected to perform final
computations. While in this case the array will only hold a single element, it can hold multiple sub results in other
scenarios discussed below.
When platform specific vector operations should be used, the required signature is
@icode {c++}
value_type reduceVectorOp() const
@endicode
This usually requires, that the source type supplies direct access to the <tt>data()</tt> pointer and is
likely not suitable when chained expressions are used, as there is no scratch buffer to write the
previous expression results to. Better constrain your implementation according to that, using e.g.
the <tt>vctr::has::data</tt> concept.
When SIMD operations should be used, the required signature is
@icode {c++}
VCTR_FORCEDINLINE void reduceNeonRegisterWise (NeonRegister<value_type>& result, size_t i) const;
VCTR_FORCEDINLINE VCTR_TARGET ("fma") void reduceAVXRegisterWise (AVXRegister<value_type>& result, size_t i) const;
VCTR_FORCEDINLINE VCTR_TARGET ("avx2") void reduceAVXRegisterWise (AVXRegister<value_type>& result, size_t i) const;
VCTR_FORCEDINLINE VCTR_TARGET ("sse4.1") void reduceSSERegisterWise (SSERegister<value_type>& result, size_t i) const;
They basically work the same as the element-wise implementations but they evaluate a whole SIMD register at a time. Possible residual elements are then evaluated using a scalar loop in the calling code. This leads to a SIMD register and a single scalar value as sub-results, which are passed to finalizeReduction for a last final reduction step.