.. _omatadd_batch:

omatadd_batch
==============

Computes a group of out-of-place scaled matrix additions using general
matrices.


Description
***********

The ``omatadd_batch`` routines perform a series of out-of-place scaled
matrix additions. They are similar to the ``omatadd`` routines, but
the ``omatadd_batch`` routines perform matrix operations with a group
of matrices.

The matrices are always in a strided format for this API. The
operation is defined as:

.. code-block::

   for i = 0 … batch_size – 1
       A is a matrix at offset i * stridea in a
       B is a matrix at offset i * strideb in b
       C is a matrix at offset i * stridec in c
       C = alpha * op(A) + beta * op(B)
   end for

where:

- ``op(X)`` is one of ``op(X) = X``, ``op(X) = X'``, or
  ``op(X) = conjg(X')``
- ``alpha`` and ``beta`` are scalars
- A, B and C are matrices

The API is available with USM pointers or buffer arguments for the
input and output arrays.

The input buffers or arrays ``a`` and ``b`` contain all the input
matrices, and the single output buffer or array ``c`` contains all the
output matrices. The locations of the individual matrices within the
buffer or array are given by stride lengths, while the number of
matrices is given by the ``batch_size`` parameter.

API
***

Syntax
------

USM arrays:

.. code-block::

   event omatadd_batch(queue &queue,
       transpose transa,
       transpose transb,
       std::int64_t m,
       std::int64_t n,
       T alpha,
       const T *a,
       std::int64_t lda,
       std::int64_t stride_a,
       T beta,
       T *b,
       std::int64_t ldb,
       std::int64_t stride_b,
       T *c,
       std::int64_t ldc,
       std::int64_t stride_c,
       std::int64_t batch_size,
       const std::vector<event> &dependencies = {});

Buffer arrays:

.. code-block::

   void omatadd_batch(queue &queue, transpose transa,
                      transpose transb,
                      std::int64_t m, std::int64_t n,
                      T alpha, cl::sycl::buffer<T, 1> &a,
                      std::int64_t lda, std::int64_t stride_a,
                      T beta, cl::sycl::buffer<T, 1> &b,
                      std::int64_t ldb, std::int64_t stride_b,
                      cl::sycl::buffer<T, 1> &c, std::int64_t ldc,
                      std::int64_t stride_c,
                      std::int64_t batch_size);

``omatadd_batch`` supports the following precisions and devices:

.. list-table::
   :header-rows: 1

   * -  T
     -  Devices Supported
   * -  ``float``
     -  Host, CPU, and GPU
   * -  ``double``
     -  Host, CPU, and GPU
   * -  ``std::complex<float>``
     -  Host, CPU, and GPU
   * -  ``std::complex<double>``
     -  Host, CPU, and GPU

Input Parameters
----------------

transa
   Specifies ``op(A)``, the transposition operation applied to the
   matrices A.

transb
   Specifies ``op(B)``, the transposition operation applied to the
   matrices B.

m
   Number of rows for the result matrix C. Must be at least zero.

n
   Number of columns for the result matrix C. Must be at least zero.

alpha
   Scaling factor for the matrices A.

a
   Buffer or array holding the input matrices A. Must have size at least
   ``stride_a*batch_size``.

lda
   Leading dimension of the A matrices. If matrices are stored using
   column major layout, ``lda`` must be at least ``m`` if A is not
   transposed or ``n`` if A is transposed. If matrices are stored using
   row major layout, ``lda`` must be at least ``n`` if A is not transposed
   or at least ``m`` if A is transposed. Must be positive.

stride_a
   Stride between the different A matrices. If matrices are stored using
   column major layout, ``stride_a`` must be at least ``lda*n`` if A is not
   transposed or at least ``lda*m`` if A is transposed. If matrices are
   stored using row major layout, ``stride_a`` must be at least ``lda*m``
   if B is not transposed or at least ``lda*n`` if A is transposed.

beta
   Scaling factor for the matrices B.

b
   Buffer or array holding the input matrices B. Must have size at least
   ``stride_b*batch_size``.

ldb
   Leading dimension of the B matrices. If matrices are stored using
   column major layout, ``ldb`` must be at least ``m`` if B is not
   transposed or ``n`` if B is transposed. If matrices are stored using
   row major layout, ``ldb`` must be at least ``n`` if B is not transposed
   or at least ``m`` if B is transposed. Must be positive.

stride_b
   Stride between the different B matrices. If matrices are stored using
   column major layout, ``stride_b`` must be at least ``ldb*n`` if B is not
   transposed or at least ``ldb*m`` if B is transposed. If matrices are
   stored using row major layout, ``stride_b`` must be at least ``ldb*m``
   if B is not transposed or at least ``ldb*n`` if B is transposed.

ldc
   Leading dimension of the A matrices. If matrices are stored using
   column major layout, ``lda`` must be at least ``m``. If matrices are
   stored using row major layout, ``lda`` must be at least ``n``. Must be
   positive.

stride_c
   Stride between the different C matrices. If matrices are stored using
   column major layout, ``stride_c`` must be at least ``ldc*n``. If matrices
   are stored using row major layout, ``stride_c`` must be at least
   ``ldc*m``.

batch_size
   Specifies the number of input and output matrices to add.

dependencies
   List of events to wait for before starting computation, if any.
   If omitted, defaults to no dependencies.


Output Parameters
-----------------

c
   Output buffer or array, overwritten by ``batch_size`` matrix addition
   operations of the form ``alpha*op(A) + beta*op(B)``. Must have size
   at least ``stride_c*batch_size``.