.. _onemkl_blas_gemm_batch:

gemm_batch
==========

Computes groups of matrix-matrix product with general matrices.

Description
***********

The ``gemm_batch`` routines are batched versions of :ref:`onemkl_blas_gemm`, 
performing multiple ``gemm`` operations in a single call. Each ``gemm`` operation performs 
a matrix-matrix product with general matrices.

``gemm_batch`` supports the following precisions:

.. list-table::
   :header-rows: 1

   * -  T
   * -  ``sycl::half``
   * -  ``float``
   * -  ``double``
   * -  ``std::complex<float>``
   * -  ``std::complex<double>``


gemm_batch (Buffer Version)
---------------------------
Buffer version of ``gemm_batch`` supports only strided API.

**Strided API**
---------------

Strided API operation is defined as:

.. code-block::

      for i = 0 … batch_size – 1
          A, B and C are matrices at offset i * stridea, i * strideb, i * stridec in a, b and c.
          C = alpha * op(A) * op(B) + beta * C
      end for

where:

- op(``X``) is one of op(``X``) = ``X``, or op(``X``) = ``X``\ :sup:`T`, or op(``X``) = ``X``\ :sup:`H`

- ``alpha`` and ``beta`` are scalars

- ``A``, ``B``, and ``C`` are matrices

- op(``A``) is ``m`` x ``k``, op(``B``) is ``k`` x ``n``, and ``C`` is ``m`` x ``n``

For strided API, ``a``, ``b`` and ``c`` buffers contains all the input matrices. The stride between matrices is given by the stride parameters. Total number of matrices in ``a``, ``b`` and ``c`` buffers is given by ``batch_size`` parameter.

Syntax
------

.. code-block:: cpp

   namespace oneapi::mkl::blas::column_major {
       void gemm_batch(sycl::queue &queue,
                       oneapi::mkl::transpose transa,
                       oneapi::mkl::transpose transb,
                       std::int64_t m,
                       std::int64_t n,
                       std::int64_t k,
                       T alpha,
                       sycl::buffer<T,1> &a,
                       std::int64_t lda,
                       std::int64_t stridea,
                       sycl::buffer<T,1> &b,
                       std::int64_t ldb,
                       std::int64_t strideb,
                       T beta,
                       sycl::buffer<T,1> &c,
                       std::int64_t ldc,
                       std::int64_t stridec,
                       std::int64_t batch_size)
   }

.. code-block:: cpp

   namespace oneapi::mkl::blas::row_major {
       void gemm_batch(sycl::queue &queue,
                       oneapi::mkl::transpose transa,
                       oneapi::mkl::transpose transb,
                       std::int64_t m,
                       std::int64_t n,
                       std::int64_t k,
                       T alpha,
                       sycl::buffer<T,1> &a,
                       std::int64_t lda,
                       std::int64_t stridea,
                       sycl::buffer<T,1> &b,
                       std::int64_t ldb,
                       std::int64_t strideb,
                       T beta,
                       sycl::buffer<T,1> &c,
                       std::int64_t ldc,
                       std::int64_t stridec,
                       std::int64_t batch_size)
   }


Input Parameters
----------------

queue
   The queue where the routine should be executed.

transa
   Specifies op(``A``), transposition operation applied to matrices ``A``. See :ref:`data-types` for more details.

transb
   Specifies op(``B``), transposition operation applied to matrices ``B``. See :ref:`data-types` for more details.

m
   Number of rows of matrices op(``A``) and matrices ``C``. Must be at least zero.

n
   Number of columns of matrices op(``B``) and matrices ``C``. Must be at least zero.

k
   Number of columns of matrices op(``A``) and rows of matrices op(``B``). Must be at least zero.

alpha
   Scaling factor for matrix-matrix products.

a
   Buffer holding input matrices ``A``. Size of the buffer must be at least ``stridea`` * ``batch_size``.

lda
   Leading dimension of matrices ``A``. Must be positive. 

   .. list-table::
      :header-rows: 1
  
      * -
        - ``transa`` = ``transpose::nontrans``
        - ``transa`` = ``transpose::trans`` or ``trans`` = ``transpose::conjtrans``
      * - Column major
        - Must be at least ``m``
        - Must be at least ``k``
      * - Row major
        - Must be at least ``k``
        - Must be at least ``m``

stridea
   Stride between two consecutive ``A`` matrices. 

   .. list-table::
      :header-rows: 1
  
      * -
        - ``transa`` = ``transpose::nontrans``
        - ``transa`` = ``transpose::trans`` or ``trans`` = ``transpose::conjtrans``
      * - Column major
        - Must be at least ``lda`` * ``k``
        - Must be at least ``lda`` * ``m``
      * - Row major
        - Must be at least ``lda`` * ``m``
        - Must be at least ``lda`` * ``k``

b
   Buffer holding input matrices ``B``. Size of the buffer must be at least ``strideb`` * ``batch_size``.


ldb
   Leading dimension of matrices ``B``. Must be positive. 

   .. list-table::
      :header-rows: 1
  
      * -
        - ``transa`` = ``transpose::nontrans``
        - ``transa`` = ``transpose::trans`` or ``trans`` = ``transpose::conjtrans``
      * - Column major
        - Must be at least ``k``
        - Must be at least ``n``
      * - Row major
        - Must be at least ``n``
        - Must be at least ``k``

strideb
   Stride between two consecutive ``B`` matrices. 

   .. list-table::
      :header-rows: 1
  
      * -
        - ``transa`` = ``transpose::nontrans``
        - ``transa`` = ``transpose::trans`` or ``trans`` = ``transpose::conjtrans``
      * - Column major
        - Must be at least ``ldb`` * ``n``
        - Must be at least ``ldb`` * ``k``
      * - Row major
        - Must be at least ``ldb`` * ``k``
        - Must be at least ``ldb`` * ``n``

beta
   Scaling factor for matrices ``C``.

c
   Buffer holding input/output matrices ``C``. Size of the buffer must be at least ``stridec`` * ``batch_size``.

ldc
   Leading dimension of matrices ``C``. Must be positive. 

   .. list-table::
  
      * - Column major
        - Must be at least ``m``
      * - Row major
        - Must be at least ``n``

stridec
   Stride between two consecutive ``C`` matrices. 

   .. list-table::
  
      * - Column major
        - Must be at least ``ldc`` * ``n``
      * - Row major
        - Must be at least ``ldc`` * ``m``

batch_size
   Specifies the number of matrix multiply operations to perform. Must be at least zero. 

Output Parameters
-----------------

c
   Output buffer overwritten by ``batch_size`` ``gemm`` operations of the form ``alpha`` * op(``A``) * op(``B``) + ``beta`` * ``C``.

.. note::
   If ``beta`` = 0, matrices ``C`` do not need to be initialized before calling ``gemm_batch``.


gemm_batch (USM Version)
------------------------

USM version of ``gemm_batch`` supports group API and strided API.

**Group API**
-------------

Group API operation is defined as:

.. code-block::

      idx = 0
      for i = 0 … group_count – 1
          for j = 0 … group_size – 1
              A, B, and C are matrices in a[idx], b[idx] and c[idx]
              C = alpha[i] * op(A) * op(B) + beta[i] * C
              idx = idx + 1
          end for
      end for

where:

-  op(``X``) is one of op(``X``) = ``X``, or op(``X``) = ``X``\ :sup:`T`, or op(``X``) = ``X``\ :sup:`H`

-  ``alpha`` and ``beta`` are scalars

-  ``A``, ``B``, and ``C`` are matrices

- op(``A``) is ``m`` x ``k``, op(``B``) is ``k`` x ``n``, and ``C`` is ``m`` x ``n``

For group API, ``a``, ``b`` and ``c`` arrays contain the pointers for all the input matrices. 
The total number of matrices in ``a``, ``b`` and ``c`` are given by: 

.. math::

      total\_batch\_count = \sum_{i=0}^{group\_count-1}group\_size[i]    

Group API supports both pointer and span inputs.

The advantage of using span instead of pointer is that the sizes of
the array can vary and the size of the span can be queried at
runtime. For each ``gemm`` parameter, except the output matrices, span
can be of size 1, the number of groups or the total batch size. For
output matrices, to ensure all computation are independent, size
of the span must be the total batch size.

Depending on the size of spans, each parameter for the ``gemm`` computation is used as follows:

.. list-table::

   * - span size = 1
     - Parameter is reused for all ``gemm`` computations
   * - span size = ``group_count``
     - Parameter is reused for all ``gemm`` within a group. Each group will have a different value for this parameter.  This is same as ``gemm_batch`` group API with pointers.
   * - span size = ``total_batch_count``
     - Each ``gemm`` computation use a different value for this parameter.

Syntax
------

.. code-block:: cpp

   namespace oneapi::mkl::blas::column_major {
       sycl::event gemm_batch(sycl::queue &queue,
                              oneapi::mkl::transpose *transa,
                              oneapi::mkl::transpose *transb,
                              std::int64_t *m,
                              std::int64_t *n,
                              std::int64_t *k,
                              T *alpha,
                              const T **a,
                              std::int64_t *lda,
                              const T **b,
                              std::int64_t *ldb,
                              T *beta,
                              T **c,
                              std::int64_t *ldc,
                              std::int64_t group_count,
                              std::int64_t *group_size,
                              const std::vector<sycl::event> &dependencies = {})

       sycl::event gemm_batch(sycl::queue &queue,
                              const sycl::span<oneapi::mkl::transpose> &transa,
                              const sycl::span<oneapi::mkl::transpose> &transb,
                              const sycl::span<std::int64_t> &m,
                              const sycl::span<std::int64_t> &n,
                              const sycl::span<std::int64_t> &k,
                              const sycl::span<std::int64_t> &alpha,
                              const sycl::span<const T*> &a,
                              const sycl::span<std::int64_t> &lda,
                              const sycl::span<const T*> &b,
                              const sycl::span<std::int64_t> &ldb,
                              const sycl::span<T> &beta,
                              sycl::span<T*> &c,
                              const sycl::span<std::int64_t> &ldc,
                              size_t group_count,
                              const sycl::span<size_t> &group_sizes,
                              const std::vector<sycl::event> &dependencies = {})
   }

.. code-block:: cpp

   namespace oneapi::mkl::blas::row_major {
       sycl::event gemm_batch(sycl::queue &queue,
                              oneapi::mkl::transpose *transa,
                              oneapi::mkl::transpose *transb,
                              std::int64_t *m,
                              std::int64_t *n,
                              std::int64_t *k,
                              T *alpha,
                              const T **a,
                              std::int64_t *lda,
                              const T **b,
                              std::int64_t *ldb,
                              T *beta,
                              T **c,
                              std::int64_t *ldc,
                              std::int64_t group_count,
                              std::int64_t *group_size,
                              const std::vector<sycl::event> &dependencies = {})

       sycl::event gemm_batch(sycl::queue &queue,
                              const sycl::span<oneapi::mkl::transpose> &transa,
                              const sycl::span<oneapi::mkl::transpose> &transb,
                              const sycl::span<std::int64_t> &m,
                              const sycl::span<std::int64_t> &n,
                              const sycl::span<std::int64_t> &k,
                              const sycl::span<std::int64_t> &alpha,
                              const sycl::span<const T*> &a,
                              const sycl::span<std::int64_t> &lda,
                              const sycl::span<const T*> &b,
                              const sycl::span<std::int64_t> &ldb,
                              const sycl::span<T> &beta,
                              sycl::span<T*> &c,
                              const sycl::span<std::int64_t> &ldc,
                              size_t group_count,
                              const sycl::span<size_t> &group_sizes,
                              const std::vector<sycl::event> &dependencies = {})
   }

Input Parameters
----------------

queue
   The queue where the routine should be executed.

transa
   Array or span of ``group_count`` ``oneapi::mkl::transpose`` values. ``transa[i]`` specifies op(``A``), the transposition operation applied to matrices ``A`` in group ``i``. See :ref:`data-types` for more details.

transb
   Array or span of ``group_count`` ``oneapi::mkl::transpose`` values. ``transb[i]`` specifies op(``B``), the transposition operation applied to matrices ``B`` in group ``i``. See :ref:`data-types` for more details.

m
   Array or span of ``group_count`` integers. ``m[i]`` specifies number of rows of matrices op(``A``) and matrices ``C`` in group ``i``. All entries must be at least zero.

n
   Array or span of ``group_count`` integers. ``n[i]`` specifies number of columns of matrices op(``B``) and matrices ``C`` in group ``i``. All entries must be at least zero.

k
   Array or span of ``group_count`` integers. ``k[i]`` specifies number of columns of matrices op(``A``) and rows of matrices op(``B``) in group ``i``. All entries must be at least zero.

alpha
   Array or span of ``group_count`` scalar elements.  ``alpha[i]`` specifies scaling factor for matrix-matrix products in group ``i``.

a
   Array of ``total_batch_count`` pointers for input matrices ``A`` or span of ``total_batch_count`` input matrices ``A``. See :ref:`matrix-storage` for more details.

   .. list-table::
      :header-rows: 1
  
      * -
        - ``transa`` = ``transpose::nontrans``
        - ``transa`` = ``transpose::trans`` or ``trans`` = ``transpose::conjtrans``
      * - Column major
        - Size of array ``A[i]`` must be at least ``lda[i]`` * ``k[i]``
        - Size of array ``A[i]`` must be at least ``lda[i]`` * ``m[i]``
      * - Row major
        - Size of array ``A[i]`` must be at least ``lda[i]`` * ``m[i]``
        - Size of array ``A[i]`` must be at least ``lda[i]`` * ``k[i]``

lda
   Array or span of ``group_count`` integers. ``lda[i]`` specifies leading dimension of matrices ``A`` in group ``i``. Must be positive. 

   .. list-table::
      :header-rows: 1
  
      * -
        - ``transa`` = ``transpose::nontrans``
        - ``transa`` = ``transpose::trans`` or ``trans`` = ``transpose::conjtrans``
      * - Column major
        - Must be at least ``m[i]``
        - Must be at least ``k[i]``
      * - Row major
        - Must be at least ``k[i]``
        - Must be at least ``m[i]``

b
   Array of ``total_batch_count`` pointers for input matrices ``B`` or span of ``total_batch_count`` input matrices ``B``. See :ref:`matrix-storage` for more details.

   .. list-table::
      :header-rows: 1
  
      * -
        - ``transa`` = ``transpose::nontrans``
        - ``transa`` = ``transpose::trans`` or ``trans`` = ``transpose::conjtrans``
      * - Column major
        - Size of array ``B[i]`` must be at least ``ldb[i]`` * ``n[i]``
        - Size of array ``B[i]`` must be at least ``ldb[i]`` * ``k[i]``
      * - Row major
        - Size of array ``B[i]`` must be at least ``ldb[i]`` * ``k[i]``
        - Size of array ``B[i]`` must be at least ``ldb[i]`` * ``n[i]``

ldb
   Array or span of ``group_count`` integers. ``ldb[i]`` specifies leading dimension of matrices ``B`` in group ``i``. Must be positive. 

   .. list-table::
      :header-rows: 1
  
      * -
        - ``transa`` = ``transpose::nontrans``
        - ``transa`` = ``transpose::trans`` or ``trans`` = ``transpose::conjtrans``
      * - Column major
        - Must be at least ``k[i]``
        - Must be at least ``n[i]``
      * - Row major
        - Must be at least ``n[i]``
        - Must be at least ``k[i]``

beta
   Array or span of ``group_count`` scalar elements.  ``beta[i]`` specifies scaling factor for matrices ``C`` in group ``i``.

c
   Array of ``total_batch_count`` pointers for input/output matrices ``C`` or span of ``total_batch_count`` input/output matrices ``C``. See :ref:`matrix-storage` for more details.

   .. list-table::
  
      * - Column major
        - Size of array ``C[i]`` must be at least ``ldc[i]`` * ``n[i]``
      * - Row major
        - Size of array ``C[i]`` must be at least ``ldc[i]`` * ``m[i]``

ldc
   Array or span of ``group_count`` integers. ``ldc[i]`` specifies leading dimension of matrices ``C`` in group ``i``. Must be positive. 

   .. list-table::
  
      * - Column major
        - Must be at least ``m[i]``
      * - Row major
        - Must be at least ``n[i]``

group_count
   Number of groups. Must be at least zero.

group_size
   Array or span of ``group_count`` integers. ``group_size[i]`` specifies the number of ``gemm`` operations in group ``i``. Each element in ``group_size`` must be at least zero.

dependencies
   List of events to wait for before starting computation, if any. If omitted, defaults to no dependencies.

Output Parameters
-----------------

c
   Array of pointers to output matrices ``C`` overwritten by ``total_batch_count`` ``gemm`` operations of the form ``alpha`` * op(``A``) * op(``B``) + ``beta`` * ``C``.

.. note::
   If ``beta`` = 0, matrices ``C`` do not need to be initialized before calling ``gemm_batch``.

Return Values
-------------

Output event to wait on to ensure computation is complete.


Examples
********

An example of how to use USM version of ``gemm_batch`` can be found in |O-MKL| installation directory, under:

.. code-block::

   examples/dpcpp/blas/source/gemm_batch.cpp


**Strided API**
---------------

Strided API operation is defined as:

.. code-block::

      for i = 0 … batch_size – 1
          A, B and C are matrices at offset i * stridea, i * strideb, i * stridec in a, b and c.
          C = alpha * op(A) * op(B) + beta * C
      end for

where:

-  op(``X``) is one of op(``X``) = ``X``, or op(``X``) = ``X``\ :sup:`T`, or op(``X``) = ``X``\ :sup:`H`

-  ``alpha`` and ``beta`` are scalars

-  ``A``, ``B``, and ``C`` are matrices

- op(``A``) is ``m`` x ``k``, op(``B``) is ``k`` x ``n``, and ``C`` is ``m`` x ``n``

For strided API, ``a``, ``b`` and ``c`` arrays contains all the input matrices. The stride between matrices is either given by the stride parameters. Total number of matrices in ``a``, ``b`` and ``c`` arrays is given by ``batch_size`` parameter.

Syntax
------

.. code-block:: cpp

   namespace oneapi::mkl::blas::column_major {
       sycl::event gemm_batch(sycl::queue &queue,
                              oneapi::mkl::transpose transa,
                              oneapi::mkl::transpose transb,
                              std::int64_t m,
                              std::int64_t n,
                              std::int64_t k,
                              T alpha,
                              const T *a,
                              std::int64_t lda,
                              std::int64_t stridea,
                              const T *b,
                              std::int64_t ldb,
                              std::int64_t strideb,
                              T beta,
                              T *c,
                              std::int64_t ldc,
                              std::int64_t stridec,
                              std::int64_t batch_size,
                              const std::vector<sycl::event> &dependencies = {})
   }

.. code-block:: cpp

   namespace oneapi::mkl::blas::row_major {
       sycl::event gemm_batch(sycl::queue &queue,
                              oneapi::mkl::transpose transa,
                              oneapi::mkl::transpose transb,
                              std::int64_t m,
                              std::int64_t n,
                              std::int64_t k,
                              T alpha,
                              const T *a,
                              std::int64_t lda,
                              std::int64_t stridea,
                              const T *b,
                              std::int64_t ldb,
                              std::int64_t strideb,
                              T beta,
                              T *c,
                              std::int64_t ldc,
                              std::int64_t stridec,
                              std::int64_t batch_size,
                              const std::vector<sycl::event> &dependencies = {})
   }

Input Parameters
----------------

queue
   The queue where the routine should be executed.

transa
   Specifies op(``A``), transposition operation applied to matrices ``A``. See :ref:`data-types` for more details.

transb
   Specifies op(``B``), transposition operation applied to matrices ``B``. See :ref:`data-types` for more details.

m
   Number of rows of matrices op(``A``) and matrices ``C``. Must be at least zero.

n
   Number of columns of matrices op(``B``) and matrices ``C``. Must be at least zero.

k
   Number of columns of matrices op(``A``) and rows of matrices op(``B``). Must be at least zero.

alpha
   Scaling factor for matrix-matrix products.

a
   Pointer to input matrices ``A``. Size of the array must be at least ``stridea`` * ``batch_size``.

lda
   Leading dimension of matrices ``A``. Must be positive. 

   .. list-table::
      :header-rows: 1
  
      * -
        - ``transa`` = ``transpose::nontrans``
        - ``transa`` = ``transpose::trans`` or ``trans`` = ``transpose::conjtrans``
      * - Column major
        - Must be at least ``m``
        - Must be at least ``k``
      * - Row major
        - Must be at least ``k``
        - Must be at least ``m``

stridea
   Stride between two consecutive ``A`` matrices. 

   .. list-table::
      :header-rows: 1
  
      * -
        - ``transa`` = ``transpose::nontrans``
        - ``transa`` = ``transpose::trans`` or ``trans`` = ``transpose::conjtrans``
      * - Column major
        - Must be at least ``lda`` * ``k``
        - Must be at least ``lda`` * ``m``
      * - Row major
        - Must be at least ``lda`` * ``m``
        - Must be at least ``lda`` * ``k``

b
   Pointer to input matrices ``B``. Size of the array must be at least ``strideb`` * ``batch_size``.


ldb
   Leading dimension of matrices ``B``. Must be positive. 

   .. list-table::
      :header-rows: 1
  
      * -
        - ``transa`` = ``transpose::nontrans``
        - ``transa`` = ``transpose::trans`` or ``trans`` = ``transpose::conjtrans``
      * - Column major
        - Must be at least ``k``
        - Must be at least ``n``
      * - Row major
        - Must be at least ``n``
        - Must be at least ``k``

strideb
   Stride between two consecutive ``B`` matrices. 

   .. list-table::
      :header-rows: 1
  
      * -
        - ``transa`` = ``transpose::nontrans``
        - ``transa`` = ``transpose::trans`` or ``trans`` = ``transpose::conjtrans``
      * - Column major
        - Must be at least ``ldb`` * ``n``
        - Must be at least ``ldb`` * ``k``
      * - Row major
        - Must be at least ``ldb`` * ``k``
        - Must be at least ``ldb`` * ``n``

beta
   Scaling factor for matrices ``C``.

c
   Pointer to input/output matrices ``C``. Size of the array must be at least ``stridec`` * ``batch_size``.

ldc
   Leading dimension of matrices ``C``. Must be positive. 

   .. list-table::
  
      * - Column major
        - Must be at least ``m``
      * - Row major
        - Must be at least ``n``

stridec
   Stride between two consecutive ``C`` matrices. 

   .. list-table::
  
      * - Column major
        - Must be at least ``ldc`` * ``n``
      * - Row major
        - Must be at least ``ldc`` * ``m``

batch_size
   Specifies the number of matrix multiply operations to perform. Must be at least zero. 

dependencies
   List of events to wait for before starting computation, if any. If omitted, defaults to no dependencies.

Output Parameters
-----------------

c
   Pointer to output matrices ``C`` overwritten by ``batch_size`` ``gemm`` operations of the form ``alpha`` * op(``A``) * op(``B``) + ``beta`` * ``C``.

.. note::
   If ``beta`` = 0, matrices ``C`` do not need to be initialized before calling ``gemm_batch``.

Return Values
-------------

Output event to wait on to ensure computation is complete.