.. _omatcopy_batch:

omatcopy_batch
==============

Computes a group of out-of-place scaled matrix transpose or copy operations
using general matrices.


.. contents::
    :local:
    :depth: 1

Description
***********

The ``omatcopy_batch`` routines perform a series of out-of-place scaled
matrix copies or transpositions. They are similar to the ``omatcopy``
routines, but the ``omatcopy_batch`` routines perform matrix operations with
a group of matrices.

The operation for the strided API is defined as:

.. code-block::

   for i = 0 … batch_size – 1
       A and B are matrices at offset i * stridea in a and i * strideb in b
       B = alpha * op(A)
   end for

The operation for the group API is defined as:

.. code-block::

   idx = 0
   for i = 0 … group_count – 1
       m, n, alpha, lda, ldb and group_size at position i in their respective arrays
       for j = 0 … group_size – 1
           A and B are matrices at position idx in their respective arrays
           B = alpha * op(A)
           idx := idx + 1
       end for
   end for

where:

- ``op(X)`` is one of ``op(X) = X``, ``op(X) = X'``, or
  ``op(X) = conjg(X')``
- ``alpha`` is a scalar
- A and B are matrices

The strided API is available with USM pointers or buffer arguments for the
input and output arrays, while the group API is available only with USM
pointers.

For the strided API, the single input buffer or array contains all the input
matrices, and the single output buffer or array contains all the output
matrices. The locations of the individual matrices within the buffer or
array are given by stride lengths, while the number of matrices is given by
the ``batch_size`` parameter.

For the group API, the matrices are given by arrays of pointers. A and B
represent matrices stored at addresses pointed to by a_array and b_array
respectively. The number of entries in a_array and b_array is
``total_batch_count`` = the sum of all of the ``group_size`` entries.

API
***

Syntax
------

**Strided API**

USM arrays:

.. code-block::

   event omatcopy_batch(queue &queue,
       transpose trans,
       std::int64_t m,
       std::int64_t n,
       T alpha,
       const T *a,
       std::int64_t lda,
       std::int64_t stride_a,
       T *b,
       std::int64_t ldb,
       std::int64_t stride_b,
       std::int64_t batch_size,
       const vector_class<event> &dependencies = {});

Buffer arrays:

.. code-block::

   void omatcopy_batch(queue &queue, transpose trans,
                       std::int64_t m, std::int64_t n,
                       T alpha, cl::sycl::buffer<T, 1> &a,
                       std::int64_t lda, std::int64_t stride_a,
                       cl::sycl::buffer<T, 1> &b, std::int64_t ldb,
                       std::int64_t stride_b, std::int64_t batch_size);

**Group API**

.. code-block::

   event omatcopy_batch(queue &queue, const transpose *trans_array,
                        const std::int64_t *m_array,
                        const std::int64_t *n_array,
                        const T *alpha_array, const T **a_array,
                        const std::int64_t *lda_array, T **b_array,
                        const std::int64_t *ldb_array,
                        std::int64_t group_count,
                        const std::int64_t *groupsize,
                        const vector_class<event> &dependencies = {});

``omatcopy_batch`` supports the following precisions and devices:

.. list-table::
   :header-rows: 1

   * -  T
     -  Devices Supported
   * -  ``float``
     -  Host, CPU, and GPU
   * -  ``double``
     -  Host, CPU, and GPU
   * -  ``std::complex<float>``
     -  Host, CPU, and GPU
   * -  ``std::complex<double>``
     -  Host, CPU, and GPU

Input Parameters
----------------

**Strided API**

trans
   Specifies ``op(A)``, the transposition operation applied to the
   matrices A.

m
   Number of rows for each matrix A. Must be at least zero.

n
   Number of columns for each matrix A. Must be at least zero.

alpha
   Scaling factor for the matrix transposition or copy.

a
   Buffer or array holding the input matrices A. Must have size at least
   ``stride_a*batch_size``.

lda
   Leading dimension of the A matrices. If matrices are stored using
   column major layout, ``lda`` must be at least ``m``. If matrices are
   stored using row major layout, ``lda`` must be at least ``n``. Must be
   positive.

stride_a
   Stride between the different A matrices. If matrices are stored using
   column major layout, ``stride_a`` must be at least ``lda*n``. If matrices
   are stored using row major layout, ``stride_a`` must be at least
   ``lda*m``.

b
   Buffer or array holding the input matrices B. Must have size at least
   ``stride_b*batch_size``.

ldb
   Leading dimension of the B matrices. If matrices are stored using
   column major layout, ``ldb`` must be at least ``m`` if B is not
   transposed or ``n`` if B is transposed. If matrices are stored using
   row major layout, ``ldb`` must be at least ``n`` if B is not transposed
   or at least ``m`` if B is transposed. Must be positive.

stride_b
   Stride between the different B matrices. If matrices are stored using
   column major layout, ``stride_b`` must be at least ``ldb*n`` if B is not
   transposed or at least ``ldb*m`` if B is transposed. If matrices are
   stored using row major layout, ``stride_b`` must be at least ``ldb*m``
   if B is not transposed or at least ``ldb*n`` if B is transposed.

batch_size
   Specifies the number of matrices to transpose or copy.

dependencies
   List of events to wait for before starting computation, if any.
   If omitted, defaults to no dependencies.

**Group API**

trans_array
   Array of size ``group_count``. Each element ``i`` in the array specifies
   ``op(A)`` the transposition operation applied to the matrices A.

m_array
   Array of size ``group_count`` of number of rows of A. Each must be
   at least zero.

n_array
   Array of size ``group_count`` of number of columns of A. Each must be
   at least zero.

alpha_array
   Array of size ``group_count`` containing scaling factors for the
   operation.

a_array
   Array of size ``total_batch_count`` of pointers to A matrices. If
   matrices are stored in column major layout, the array allocated for each
   A matrix of the group ``i`` must be of size at least
   ``lda_array[i] * n_array[i]``. If matrices are stored in row major
   layout, the array allocated for each A matrix of the group ``i`` must be
   of size at least ``lda_array[i]*m_array[i]``.

lda_array
   Array of size ``group_count`` of leading dimension of the A matrices. If
   matrices are stored using column major layout, ``lda_array[i]`` must be
   at least ``m_array[i]``. If matrices are stored using row major layout,
   ``lda_array[i]`` must be at least ``n_array[i]``. Each must be positive.

b_array
   Array of size ``total_batch_count`` of pointers used to store B matrices.
   If matrices are stored using column major layout, the array allocated
   for each B matrix of the group ``i`` must be of size at least
   ``ldb_array[i] * n_array[i]`` if B is not transposed or
   ``ldb_array[i]*m_array[i]`` if B is transposed. If matrices are stored
   using row major layout, the array allocated for each B matrix of the
   group ``i`` must be of size at least ``ldb_array[i] * m_array[i]`` if B
   is not transposed or ``ldb_array[i]*n_array[i]`` if B is transposed.

ldb_array
   Array of size ``group_count`` of leading dimension of the B matrices. If
   matrices are stored using column major layout, ``ldb_array[i]`` must be
   at least ``m_array[i]`` if B is not transposed or at least ``n_array[i]``
   if B is transposed. If matrices are stored using row major layout,
   ``ldb_array[i]`` must be at least ``n_array[i]`` if B is not transposed
   or at least ``m_array[i]`` if B is transposed. Each must be positive.

group_count
   Number of groups. Must be at least 0.

group_size
   Array of size ``group_count`. The element ``group_size[i]`` is the number
   of matrices in the group ``i``. Each element in ``group_size`` must be
   at least 0.

dependencies
   List of events to wait for before starting computation, if any. If
   omitted, defaults to no dependencies.

Output Parameters
-----------------

**Strided API**

b
   Output buffer, overwritten by ``batch_size`` matrix transpose or copy
   operations of the form ``alpha*op(A)``.

**Group API**

b_array
   Output array of pointers to B matrices, overwritten by
   ``total_batch_count`` matrix transpose or copy operations of the form
   ``alpha*op(A)``.