.. _dsgn:
Design
======
.. _dsgn_explain_deps:
Dependencies
------------
*Feinsum*'s design is closely tied with :mod:`loopy` and OpenCL [OpenCL_2011]_.
Loopy provides a language for specifying code-transformations based on the
Polyhedral model and OpenCL-specification to support various platforms (CPUs,
GPUs, DSPs, etc.) at the same time keeping *feinsum*'s code maintainable by
avoiding separate code-paths for CUDA, ISPC, etc. with virtually no performance
gains. (see `POCL `_)
.. _dsgn_batched_einsum_defn:
What is a “Batched Einstein Summation”?
---------------------------------------
Batched Einstein Summations (einsum) represents a computational primitive to
serve as an abstraction between a *performance engineer* and a *compiler*. The
compiler could lower the program as a composition of certain computational
primitives and enjoy the benefits of code-transformations on those primitives
prescribed by a performance engineers. Some examples of such computational
primitives for similar purposes are: `BLAS `_,
`LAPACK `_, `DNN
`_, etc.
**Definition**: A Batched Einstein Summation (:math:`\mathcal{B}_m^k`) describes a computation
with :math:`k` multi-dimensional arrays :math:`A_1, A_2 \ldots, A_k` as the
inputs and :math:`m`-multi-dimensional arrays :math:`R_1, R_2, \ldots, R_m` as
the outputs. The outputs are the results of an Einstein summation operation
with :math:`n` operands. It is defined by a 4-tuple :math:`(\mathcal{E},
\mathcal{A}_{m \times n}, \mathcal{T}, \mathcal{S})`:
#. :math:`\mathcal{E}` is the indexing expression used in defining an Einstein summation
with :math:`n` multi-dimensional arrays as the operands. We borrow the
grammar of :math:`\mathcal{E}` from ``numpy.einsum``. Each :math:`R_1,
\ldots R_m` is a result of an E -einstein summation with :math:`n` operands.
#. :math:`\mathcal{A}_{m \times n}` is the argument matrix where each
:math:`\mathcal{A}_{i j} \subseteq \{ A_{1,} \ldots, A_k \}` is a set of
array names. The :math:`j`-th operand of the einstein summation computing
:math:`R_i` is the term :math:`\prod_{a \in \mathcal{A}_{i j}}
a [\mathcal{E}_{j, 1}, \mathcal{E}_{j, 2}, \ldots]` where,
:math:`\mathcal{E}_{j, 1}, \mathcal{E}_{j, 2}, \ldots` are the indices for the
:math:`j`-th operand of :math:`\mathcal{E}`.
#. :math:`\mathcal{T}` is a mapping from the input arrays to their
numeric data types in the computation.
#. :math:`\mathcal{S}` is a mapping from the input arrays to their
shapes. A shape is a tuple of non-negative integers or :math:`\infty`.
:math:`\infty` is a symbol for a non-negative integer which is much greater
than the cache-size of the target hardware.
No two arrays in the set :math:`\{ A_1, A_{2,} \ldots ., A_k, R_1, R_2, \ldots
., R_m \}` alias each other.
In :mod:`feinsum`, we use a slightly modified definition of *Batched-einsum* to
be able to be able to apply the transformation knowledge to a wider-class of
expressions without a noticeable increase in complexity on the performance
engineer's behalf. The transformations are recorded on
:mod:`loopy` kernels such that the :math:`j`-th operand of the einsum computing
:math:`R_i` is of the form :math:`\rho_{ij}(a_1[\mathcal{E}_{j, 1},
\mathcal{E}_{j, 2}, \ldots], a_2[\mathcal{E}_{j, 1}, \mathcal{E}_{j, 2},
\ldots], \ldots)`, where, :math:`\mathcal{A}_{i j} = \{a_1, a_2, \ldots\}` and
:math:`\rho_{ij}` is a function of the form
:math:`\rho_{ij}:\mathcal{T}[a_1]\times\mathcal{T}[a_2]\times\ldots\mapsto
\mathcal{C}\left(\mathcal{T}[a_1]\times\mathcal{T}[a_2]\times\ldots\right)`,
where, :math:`\mathcal{C}` is a mapping on the numeric data-types as provided
by :func:`numpy.find_common_type`.
The transformation writer prescribes the transformation space without making
any assumptions about :math:`rho_{ij}` i.e. the prescribed transformation space
has to be only a function of :math:`(\mathcal{E}, \mathcal{A}_{m \times n},
\mathcal{T}, \mathcal{S})` and NOT :math:`\rho_{m \times n}`.
One might note that the Batched-einsum expressions are a sub-class of the
modified Batched-einsum expressions by simply using :math:`\rho_{ij}(x_1, x_2,
\ldots) = x_1\cdot x_2\cdot\ldots`. Throughout this manual we will use
Batched-einsum and modified Batched-einsum interchangeably.
.. _dsgn_loopy_grammar:
Grammar of :mod:`loopy` kernels
-------------------------------
We rely on a grammar of :mod:`loopy` that defines a Batched-einsum expression
to serve as a contract between the transformation implementer and the
optimizing compiler that would reuse the transformations. We also point out
that a grammar of loopy kernels indirectly also specifies a *schedule* as per
the code-generation implementation in :mod:`loopy`.
We first describe the grammar of a loopy kernel that corresponds to a Batched
einsum kernel with a single output. Extending this grammar to a Batched-einsum
with more than one outputs is trivial and is covered via examples.
#. The einsum expression is matched to a single instruction inside the kernel.
#. The instruction must contain at most one sum-reduction expression in its
RHS. This includes the predicates guarding the instruction's execution.
#. The multiplicative terms within the expression over which sum-reduction
is being performed are inferred as the operand expression of the einsum.
#. The loopy kernel's domain formed by the loops nesting outside the
instruction along with the reduction inames' domains must correspond to
that of a hypercube.
#. The #dimensions of the instruction's assignee array must be equal to
the loop nest dimension surrounding the instruction.
#. The instruction must write to an array (in the global address space)
such that all the iname dependencies of the instruction must appear
in the indexing expression of the assignee array of the instruction.
#. Any substitution invocation in the instruction's RHS is considered as
an argument of the einsum expression.
#. Similar to the constraints on indexing into the assignee array the
arguments to the substitution invocations must be just inames (either
reduction inames or the inames corresponding to the instruction's loop
nest).
#. The einsum is constructed as:
- The indexing expression :math:`\mathcal{E}`'s output indices are obtained
by reading the assignee array's indexing inames.
- The indexing expression :math:`\mathcal{E}`'s input operands' indices are obtained
by gathering the substitution rule invocation's arguments.
- The numeric data-type of a substitution rule is inferred by calling
:func:`loopy.infer_unknown_types` on the substitution rule's expression.
- The shapes of the input operands are inferred from the loopy kernel's domains.
With these rules we can infer an einsum expression from a :mod:`loopy` kernel.
Inferring a batched einsum expression is simply applying the above rules to
a collection of instructions in a loopy kernel. We rely on a canonicalization
routine (see :func:`~feinsum.canonicalization.canonicalize_einsum`) to ensure that the parsing
for the above grammar of expressions is deterministic.
We note that :mod:`feinsum` does not impose any constraints on the substitution
rule's expression. We leave that upon the compiler to ensure that the rule's RHS
has memory access pattern close to that of a multi-dimensional array with similar
stride patterns.
.. _dsgn_why_perf_engg:
Why keep a performance engineer in the loop?
--------------------------------------------
As of this writing, limited solutions are available that unify the heuristics
needed to generate roofline-performing code for a single Einstein summation.
Approaches that rely on an auto-tuning phase have been proposed for generating
optimized device codes for a sub-class of Einstein-summations and for
a particular architecture, for eg. Cogent [Kim_2019]_ generates optimized GPU
kernels for tensor-contractions.
Hence, *feinsum* provides abstractions to develop (and implement)
code-transformations for certain sub-classes of Batched-einsums.