.. _introduction-to-the-intel-oneapi-math-kernel-library-onemkl-blas-and-lapack-with-dpcpp:

Introduction to the |IONE-MKL| BLAS and LAPACK with DPC++
=============================================================================================


This guide provides an overview of the |IONE-MKL| BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear
Algebra Package) application programming interface for the Data Parallel
C++ (DPC++) language. It is aimed at users who have had some prior
experience with the standard BLAS and LAPACK API.


In general, the DPC++ APIs for BLAS and LAPACK are similar to standard
BLAS and LAPACK, sharing the same routine names and argument orders.
Unlike standard routines, DPC++ routines are designed to run
asynchronously on a compute device (CPU or GPU), and typically use
device memory for inputs and outputs. To support this functionality, the
data types of many arguments have changed and each routine takes an
additional argument (a DPC++ queue), which specifies where the routine
should be executed. There are several smaller API changes that are
detailed below.


In |O-MKL|, all DPC++ routines and associated data types belong to the
``oneapi::mkl`` namespace. CPU-based |O-MKL| routines are still available
via the C interface (which uses the global namespace). Additionally,
each BLAS-specific routine is in the ``oneapi::mkl::blas``,
``oneapi::mkl::blas::column_major``, and ``oneapi::mkl::blas::row_major``
namespaces. By default column major layout is assumed for all BLAS
functions in ``oneapi::mkl::blas`` namespace, BLAS functions in
``oneapi::mkl::blas::column_major`` namespace can also be used when matrices are stored using column major layout. To use row major layout to store matrices, BLAS functions in ``oneapi::mkl::blas::row_major`` namespace must be used. For example, ``oneapi::mkl::blas::gemm`` is the DPC++ routine for matrix multiplication using column major layout for storing matrices, while ``::{cblas_}{s, d, c, z}gemm`` is the traditional CPU-based version. Currently LAPACK DPC++ APIs do not support matrices stored using row major layout.

Differences between Standard BLAS/LAPACK and DPC++ |O-MKL| APIs
***************************************************************

Naming—BLAS Only
----------------


DPC++ BLAS APIs are templated on precision. For example, unlike standard BLAS API having four different routines for GEMM computation with names based on precision (``sgemm``, ``dgemm``, ``cgemm`` and ``zgemm``), the DPC++ BLAS has only one entry point for GEMM computation named ``gemm`` accepting ``float``, ``double``, ``half``, ``bfloat16``, ``std::complex<float>``, and ``std::complex<double>`` data types.


References
----------

All DPC++ objects (buffers and queues) are passed by reference, rather than by pointer. Other parameters are typically passed by value.


Queues
------


Every DPC++ BLAS and LAPACK routine has an extra parameter at the beginning: A DPC++ queue (type ``queue&``), where computational tasks are submitted. A queue can be associated with the host device, a CPU device, or a GPU device. In the pre-alpha release of DPC++, the host, CPU, and GPU devices are supported for all BLAS functions. Refer to the :ref:`overview-of-intel-oneapi-math-kernel-library-onemkl-lapack-for-dpcpp` documentation for a complete list of supported LAPACK functions on host, CPU, and GPU devices.


Vector and Matrix Types
-----------------------

DPC++ uses buffers to store data on a device and share data between devices and the host. Accordingly, vector and matrix inputs to DPC++ BLAS and LAPACK routines are DPC++ buffer types. Currently, all buffers must be one-dimensional, but you may use DPC++'s ``buffer::reinterpret()`` member function to convert a higher-dimensional buffer to a one-dimensional one.


For example, the ``gemv`` routine takes a matrix ``A`` and vectors ``x``, ``y``. For the real double precision case, each of these parameters has types:


*  ``double*`` in standard BLAS
*  ``buffer<double,1>&`` in DPC++ BLAS


Scalars
-------

Scalar inputs are passed by value for all BLAS functions.


Complex Numbers
---------------

In DPC++, complex numbers are represented with C++ ``std::complex`` types. For instance, ``MKL_Complex8`` can be replaced by ``std::complex<float>``.


This is true for scalar, vector, and matrix arguments. For instance, a double-precision complex vector would have type ``buffer<std::complex<double>,1>``.


Return Values
-------------

Some BLAS and LAPACK routines (``dot``, ``nrm2``, ``lapy2``, ``asum``, ``iamax``) return a scalar result as their return value. In DPC++, to support asynchronous computation, these routines take an additional buffer argument that occurs at the end of the argument list. The result value is stored in this buffer when the computation completes. These routines, like the other DPC++ routines, have a return type of ``void``.


Computation Options (Character Parameters)
------------------------------------------

Standard BLAS and LAPACK use special alphabetic characters to control operations: Transposition of matrices, storage of symmetric and triangular matrices, etc. In DPC++, these special characters are replaced by scoped enum types for extra type safety.


For example, the BLAS matrix-vector multiplication ``dgemv`` takes a character argument ``trans``, which can be one of ``N`` or ``T``, specifying whether the input matrix ``A`` should be transposed before multiplication.


In DPC++, ``trans`` is a member of the scoped ``enum`` type ``oneapi::mkl::transpose``. You may use the traditional character-based names ``oneapi::mkl::transpose::N`` and ``oneapi::mkl::transpose::T``, or the equivalent, more descriptive names ``oneapi::mkl::transpose::nontrans`` and ``oneapi::mkl::transpose::trans``.


See the :ref:`data-types` for more information on the new types.


Matrix Layout (Row Major and Column Major)
******************************************

The standard BLAS and LAPACK APIs require a Fortran layout for matrices (column major), where matrices are stored column-by-column in memory and the entries in each column are stored in consecutive memory locations. |O-MKL| for DPC++ likewise assumes this matrix layout. Row major layout is not supported directly, but you may still use DPC++ BLAS by treating row major matrices as transposed column major matrices.


Example for BLAS
----------------

Below is a short excerpt of a program calling standard BLAS ``dgemm``:

.. code-block::

      double *A = …, *B = …, *C = …;
      double alpha = 2.0, beta = 3.0;


      int m = 16, n = 20, k = 24;
      int lda = m, ldb = n, ldc = m;


      dgemm(“N”, “T”, &m, &n, &k, &alpha, A, &lda, B, &ldb, &beta, C, &ldc);


The DPC++ equivalent of this excerpt would be as follows:

.. code-block::

      using namespace cl::sycl;
      using namespace mkl;


      queue Q(…);
      buffer A = …, B = …, C = …;
      int m = 16, n = 20, k = 24;
      int lda = m, ldb = n, ldc = m;
      blas::gemm(Q, transpose::N, transpose::T, m, n, k, 2.0, A, lda, B, ldb, 3.0, C, ldc);


Example for LAPACK
------------------

Below is a short excerpt of a program calling a standard LAPACK ``dgetrf`` (LU factorization):


.. code-block::

      double *A = …;


      MKL_INT m = 16, n = 20;
      MKL_INT lda = m;
      MKL_INT ipiv[m];


      dgetrf(&m, &n, A, &lda, ipiv, &info);


The DPC++ equivalent of this excerpt would be as follows:


.. code-block::

      using namespace cl::sycl;
      using namespace mkl;


      queue Q(…);
      buffer<double,1> A = …;
      int64_t m = 16, n = 20;
      int64_t lda = m;
      buffer<int64_t,1> ipiv(range<1>(m));


      dgetrf(Q, m, n, A, lda, ipiv, info);