Skip to content

Latest commit

 

History

History
1063 lines (968 loc) · 47.3 KB

using.org

File metadata and controls

1063 lines (968 loc) · 47.3 KB

Linking an external application with Chameleon libraries

Compilation and link with Chameleon libraries have been tested with the GNU compiler suite gcc/gfortran and the Intel compiler suite icc/ifort.

Flags required

The compiler, linker flags that are necessary to build an application using Chameleon are given through the pkg-config mechanism.

export PKG_CONFIG_PATH=/home/jdoe/install/chameleon/lib/pkgconfig:$PKG_CONFIG_PATH
pkg-config --cflags chameleon
pkg-config --libs chameleon
pkg-config --libs --static chameleon

The .pc files required are located in the sub-directory lib/pkgconfig of your Chameleon install directory.

Static linking in C

Lets imagine you have a file main.c that you want to link with Chameleon static libraries. Lets consider /home/yourname/install/chameleon is the install directory of Chameleon containing sub-directories include/ and lib/. Here could be your compilation command with gcc compiler:

gcc -I/home/yourname/install/chameleon/include -o main.o -c main.c

Now if you want to link your application with Chameleon static libraries, you could do:

gcc main.o -o main                                         \
/home/yourname/install/chameleon/lib/libchameleon.a        \
/home/yourname/install/chameleon/lib/libchameleon_starpu.a \
/home/yourname/install/chameleon/lib/libcoreblas.a         \
-lstarpu-1.2 -Wl,--no-as-needed -lmkl_intel_lp64           \
-lmkl_sequential -lmkl_core -lpthread -lm -lrt

As you can see in this example, we also link with some dynamic libraries starpu-1.2, Intel MKL libraries (for BLAS/LAPACK/CBLAS/LAPACKE), pthread, m (math) and rt. These libraries will depend on the configuration of your Chameleon build. You can find these dependencies in .pc files we generate during compilation and that are installed in the sub-directory lib/pkgconfig of your Chameleon install directory. Note also that you could need to specify where to find these libraries with -L option of your compiler/linker.

Before to run your program, make sure that all shared libraries paths your executable depends on are known. Enter ldd main to check. If some shared libraries paths are missing append them in the LD_LIBRARY_PATH (for Linux systems) environment variable (DYLD_LIBRARY_PATH on Mac).

Dynamic linking in C

For dynamic linking (need to build Chameleon with CMake option BUILD_SHARED_LIBS=ON) it is similar to static compilation/link but instead of specifying path to your static libraries you indicate the path to dynamic libraries with -L option and you give the name of libraries with -l option like this:

gcc main.o -o main \
-L/home/yourname/install/chameleon/lib \
-lchameleon -lchameleon_starpu -lcoreblas \
-lstarpu-1.2 -Wl,--no-as-needed -lmkl_intel_lp64 \
-lmkl_sequential -lmkl_core -lpthread -lm -lrt

Note that an update of your environment variable LD_LIBRARY_PATH (DYLD_LIBRARY_PATH on Mac) with the path of the libraries could be required before executing

export LD_LIBRARY_PATH=path/to/libs:path/to/chameleon/lib

Using Chameleon executables

Chameleon provides several test executables that are compiled and linked with Chameleon’s dependencies. Instructions about the arguments to give to executables are accessible thanks to the option -[-]help or -[-]h. This set of binaries are separated into three categories and can be found in three different directories:

  • example: contains examples of API usage and more specifically the sub-directory lapack_to_morse/ provides a tutorial that explains how to use Chameleon functionalities starting from a full LAPACK code, see Tutorial LAPACK to Chameleon
  • testing: contains testing drivers to check numerical correctness of Chameleon linear algebra routines with a wide range of parameters
    ./testing/stesting 4 1 LANGE 600 100 700
        

    Two first arguments are the number of cores and gpus to use. The third one is the name of the algorithm to test. The other arguments depend on the algorithm, here it lies for the number of rows, columns and leading dimension of the problem.

    Name of algorithms available for testing are:

    • LANGE: norms of matrices Infinite, One, Max, Frobenius
    • GEMM: general matrix-matrix multiply
    • HEMM: hermitian matrix-matrix multiply
    • HERK: hermitian matrix-matrix rank k update
    • HER2K: hermitian matrix-matrix rank 2k update
    • SYMM: symmetric matrix-matrix multiply
    • SYRK: symmetric matrix-matrix rank k update
    • SYR2K: symmetric matrix-matrix rank 2k update
    • PEMV: matrix-vector multiply with pentadiagonal matrix
    • TRMM: triangular matrix-matrix multiply
    • TRSM: triangular solve, multiple rhs
    • POSV: solve linear systems with symmetric positive-definite matrix
    • GESV_INCPIV: solve linear systems with general matrix
    • GELS: linear least squares with general matrix
    • GELS_HQR: gels with hierarchical tree
    • GELS_SYSTOLIC: gels with systolic tree
  • timing: contains timing drivers to assess performances of Chameleon routines. There are two sets of executables, those who do not use the tile interface and those who do (with _tile in the name of the executable). Executables without tile interface allocates data following LAPACK conventions and these data can be given as arguments to Chameleon routines as you would do with LAPACK. Executables with tile interface generate directly the data in the format Chameleon tile algorithms used to submit tasks to the runtime system. Executables with tile interface should be more performant because no data copy from LAPACK matrix layout to tile matrix layout are necessary. Calling example:
    ./timing/time_dpotrf --n_range=1000:10000:1000 --nb=320
                         --threads=9 --gpus=3
                         --nowarmup
        

    List of main options that can be used in timing:

    • --help: Show usage
    • Machine parameters
      • -t x, --threads=x: Number of CPU workers (default: automatic detection through runtime)
      • -g x, --gpus=x: Number of GPU workers (default: 0)
      • -P x, --P=x: Rows (P) in the PxQ process grid (default: 1)
      • --nocpu: All GPU kernels are exclusively executed on GPUs
    • Matrix parameters
      • -m x, --m=X, --M=x: Dimension (M) of the matrices (default: N)
      • -n x, --n=X, --N=x: Dimension (N) of the matrices
      • -N R, --n_range=R: Range of N values to time with R=Start:Stop:Step (default: 500:5000:500)
      • -k x, --k=x, --K=x, --nrhs=x: Dimension (K) of the matrices or number of right-hand size (default: 1). This is useful for GEMM algorithms (k is the shared dimension and must be defined >1 to consider matrices and not vectors)
      • -b x, --nb=x: NB size. (default: 320)
      • -i x, --ib=x: IB size. (default: 32)
    • Check/prints
      • --niter=X: Number of iterations performed for each test (default: 1)
      • -W, --nowarning: Do not show warnings
      • -w, --nowarmup: Cancel the warmup run to pre-load libraries
      • -c, --check: Check result
      • -C, --inc: Check on inverse
      • --mode=x : Change the xLATMS matrix mode generation for SVD/EVD (default: 4). It must be between 0 and 20 included.
    • Profiling parameters
      • -T, --trace: Enable trace generation
      • --progress: Display progress indicator
      • -d, --dag: Enable DAG generation. Generates a dot_dag_file.dot.
      • -p, --profile: Print profiling informations
    • HQR parameters
      • -a x, --qr_a=x, --rhblk=x: Define the size of the local TS trees in housholder reduction trees for QR and LQ factorization. N is the size of each subdomain (default: -1)
      • -l x, --llvl=x: Tree used for low level reduction inside nodes (default: -1)
      • -L x, --hlvl=x: Tree used for high level reduction between nodes, only if P > 1 (default: -1). Possible values are -1: Automatic, 0: Flat, 1: Greedy, 2: Fibonacci, 3: Binary, 4: Replicated greedy.
      • -D, --domino: Enable the domino between upper and lower trees
    • Advanced options
      • --nobigmat: Disable single large matrix allocation for multiple tiled allocations
      • -s, --sync: Enable synchronous calls in wrapper function such as POTRI
      • -o, --ooc: Enable out-of-core (available only with StarPU)
      • -G, --gemm3m: Use gemm3m complex method
      • --bound: Compare result to area bound

    List of timing algorithms available:

    • LANGE: norms of matrices
    • GEMM: general matrix-matrix multiply
    • TRSM: triangular solve
    • POTRF: Cholesky factorization with a symmetric positive-definite matrix
    • POTRI: Cholesky inversion
    • POSV: solve linear systems with symmetric positive-definite matrix
    • GETRF_NOPIV: LU factorization of a general matrix using the tile LU algorithm without row pivoting
    • GESV_NOPIV: solve linear system for a general matrix using the tile LU algorithm without row pivoting
    • GETRF_INCPIV: LU factorization of a general matrix using the tile LU algorithm with partial tile pivoting with row interchanges
    • GESV_INCPIV: solve linear system for a general matrix using the tile LU algorithm with partial tile pivoting with row interchanges matrix
    • GEQRF: QR factorization of a general matrix
    • GELQF: LQ factorization of a general matrix
    • QEQRF_HQR: GEQRF with hierarchical tree
    • QEQRS: solve linear systems using a QR factorization
    • GELS: solves overdetermined or underdetermined linear systems involving a general matrix using the QR or the LQ factorization
    • GESVD: general matrix singular value decomposition

Execution trace using StarPU

<sec:trace>

StarPU can generate its own trace log files by compiling it with the --with-fxt option at the configure step (you can have to specify the directory where you installed FxT by giving --with-fxt=... instead of --with-fxt alone). By doing so, traces are generated after each execution of a program which uses StarPU in the directory pointed by the STARPU_FXT_PREFIX environment variable.

export STARPU_FXT_PREFIX=/home/jdoe/fxt_files/

When executing a ./timing/... Chameleon program, if it has been enabled (StarPU compiled with FxT and -DCHAMELEON_ENABLE_TRACING=ON), you can give the option --trace to tell the program to generate trace log files.

Finally, to generate the trace file which can be opened with Vite program, you can use the starpu_fxt_tool executable of StarPU. This tool should be in $STARPU_INSTALL_REPOSITORY/bin. You can use it to generate the trace file like this:

path/to/your/install/starpu/bin/starpu_fxt_tool -i prof_filename

There is one file per mpi processus (prof_filename_0, prof_filename_1 …). To generate a trace of mpi programs you can call it like this:

path/to/your/install/starpu/bin/starpu_fxt_tool -i prof_filename*

The trace file will be named paje.trace (use -o option to specify an output name). Alternatively, for non mpi execution (only one processus and profiling file), you can set the environment variable STARPU_GENERATE_TRACE=1 to automatically generate the paje trace file.

Use simulation mode with StarPU-SimGrid

<sec:simu>

Simulation mode can be activated by setting the cmake option CHAMELEON_SIMULATION to ON. This mode allows you to simulate execution of algorithms with StarPU compiled with SimGrid. To do so, we provide some perfmodels in the simucore/perfmodels/ directory of Chameleon sources. To use these perfmodels, please set your STARPU_HOME environment variable to path/to/your/chameleon_sources/simucore/perfmodels. Finally, you need to set your STARPU_HOSTNAME environment variable to the name of the machine to simulate. For example: STARPU_HOSTNAME=mirage. Note that only POTRF kernels with block sizes of 320 or 960 (simple and double precision) on mirage and sirocco machines are available for now. Database of models is subject to change.

Chameleon API

Chameleon provides routines to solve dense general systems of linear equations, symmetric positive definite systems of linear equations and linear least squares problems, using LU, Cholesky, QR and LQ factorizations. Real arithmetic and complex arithmetic are supported in both single precision and double precision. Routines that compute linear algebra are of the following form:

MORSE_name[_Tile[_Async]]
  • all user routines are prefixed with MORSE
  • in the pattern MORSE_name[_Tile[_Async]], name follows the BLAS/LAPACK naming scheme for algorithms (e.g. sgemm for general matrix-matrix multiply simple precision)
  • Chameleon provides three interface levels
    • MORSE_name: simplest interface, very close to CBLAS and LAPACKE, matrices are given following the LAPACK data layout (1-D array column-major). It involves copy of data from LAPACK layout to tile layout and conversely (to update LAPACK data), see Step1.
    • MORSE_name_Tile: the tile interface avoid copies between LAPACK and tile layouts. It is the standard interface of Chameleon and it should achieved better performance than the previous simplest interface. The data are given through a specific structure called a descriptor, see Step2.
    • MORSE_name_Tile_Async: similar to the tile interface, it avoids synchonization barrier normally called between Tile routines. At the end of an Async function, completion of tasks is not guaranteed and data are not necessarily up-to-date. To ensure that tasks have been all executed, a synchronization function has to be called after the sequence of Async functions, see Step4.

MORSE routine calls have to be preceded from

MORSE_Init( NCPU, NGPU );

to initialize MORSE and the runtime system and followed by

MORSE_Finalize();

to free some data and finalize the runtime and/or MPI.

Tutorial LAPACK to Chameleon

<sec:tuto>

This tutorial is dedicated to the API usage of Chameleon. The idea is to start from a simple code and step by step explain how to use Chameleon routines. The first step is a full BLAS/LAPACK code without dependencies to Chameleon, a code that most users should easily understand. Then, the different interfaces Chameleon provides are exposed, from the simplest API (step1) to more complicated ones (until step4). The way some important parameters are set is discussed in step5. step6 is an example about distributed computation with MPI. Finally step7 shows how to let Chameleon initialize user’s data (matrices/vectors) in parallel.

Source files can be found in the example/lapack_to_morse/ directory. If CMake option CHAMELEON_ENABLE_EXAMPLE is ON then source files are compiled with the project libraries. The arithmetic precision is double. To execute a step X, enter the following command:

./stepX --option1 --option2 ...

Instructions about the arguments to give to executables are accessible thanks to the option -[-]help or -[-]h. Note there exist default values for options.

For all steps, the program solves a linear system $Ax=B$ The matrix values are randomly generated but ensure that matrix \$A\$ is symmetric positive definite so that $A$ can be factorized in a $LL^T$ form using the Cholesky factorization.

The different steps of the tutorial are:

  • Step0: a simple Cholesky example using the C interface of BLAS/LAPACK
  • Step1: introduces the LAPACK equivalent interface of Chameleon
  • Step2: introduces the tile interface
  • Step3: indicates how to give your own tile matrix to Chameleon
  • Step4: introduces the tile async interface
  • Step5: shows how to set some important parameters
  • Step6: introduces how to benefit from MPI in Chameleon
  • Step7: introduces how to let Chameleon initialize the user’s matrix data

Step0

The C interface of BLAS and LAPACK, that is, CBLAS and LAPACKE, are used to solve the system. The size of the system (matrix) and the number of right hand-sides can be given as arguments to the executable (be careful not to give huge numbers if you do not have an infinite amount of RAM!). As for every step, the correctness of the solution is checked by calculating the norm $||Ax-B||/(||A||||x||+||B||)$. The time spent in factorization+solve is recorded and, because we know exactly the number of operations of these algorithms, we deduce the number of operations that have been processed per second (in GFlops/s). The important part of the code that solves the problem is:

/* Cholesky factorization:
 * A is replaced by its factorization L or L^T depending on uplo */
LAPACKE_dpotrf( LAPACK_COL_MAJOR, 'U', N, A, N );
/* Solve:
 * B is stored in X on entry, X contains the result on exit.
 * Forward ...
 */
cblas_dtrsm(
    CblasColMajor,
    CblasLeft,
    CblasUpper,
    CblasConjTrans,
    CblasNonUnit,
    N, NRHS, 1.0, A, N, X, N);
/* ... and back substitution */
cblas_dtrsm(
    CblasColMajor,
    CblasLeft,
    CblasUpper,
    CblasNoTrans,
    CblasNonUnit,
    N, NRHS, 1.0, A, N, X, N);

Step1

<sec:tuto_step1>

It introduces the simplest Chameleon interface which is equivalent to CBLAS/LAPACKE. The code is very similar to step0 but instead of calling CBLAS/LAPACKE functions, we call Chameleon equivalent functions. The solving code becomes:

/* Factorization: */
MORSE_dpotrf( UPLO, N, A, N );
/* Solve: */
MORSE_dpotrs(UPLO, N, NRHS, A, N, X, N);

The API is almost the same so that it is easy to use for beginners. It is important to keep in mind that before any call to MORSE routines, MORSE_Init has to be invoked to initialize MORSE and the runtime system. Example:

MORSE_Init( NCPU, NGPU );

After all MORSE calls have been done, a call to MORSE_Finalize is required to free some data and finalize the runtime and/or MPI.

MORSE_Finalize();

We use MORSE routines with the LAPACK interface which means the routines accepts the same matrix format as LAPACK (1-D array column-major). Note that we copy the matrix to get it in our own tile structures, see details about this format here Tile Data Layout. This means you can get an overhead coming from copies.

Step2

<sec:tuto_step2>

This program is a copy of step1 but instead of using the LAPACK interface which reads to copy LAPACK matrices inside MORSE routines we use the tile interface. We will still use standard format of matrix but we will see how to give this matrix to create a MORSE descriptor, a structure wrapping data on which we want to apply sequential task-based algorithms. The solving code becomes:

/* Factorization: */
MORSE_dpotrf_Tile( UPLO, descA );
/* Solve: */
MORSE_dpotrs_Tile( UPLO, descA, descX );

To use the tile interface, a specific structure MORSE_desc_t must be created. This can be achieved from different ways.

  1. Use the existing function MORSE_Desc_Create: means the matrix data are considered contiguous in memory as it is considered in PLASMA (Tile Data Layout).
  2. Use the existing function MORSE_Desc_Create_OOC: means the matrix data is allocated on-demand in memory tile by tile, and possibly pushed to disk if that does not fit memory.
  3. Use the existing function MORSE_Desc_Create_User: it is more flexible than Desc_Create because you can give your own way to access to tile data so that your tiles can be allocated wherever you want in memory, see next paragraph Step3.
  4. Create you own function to fill the descriptor. If you understand well the meaning of each item of MORSE_desc_t, you should be able to fill correctly the structure.

In Step2, we use the first way to create the descriptor:

MORSE_Desc_Create(&descA, NULL, MorseRealDouble,
                  NB, NB, NB*NB, N, N,
                  0, 0, N, N,
                  1, 1);
  • descA is the descriptor to create.
  • The second argument is a pointer to existing data. The existing data must follow LAPACK/PLASMA matrix layout Tile Data Layout (1-D array column-major) if MORSE_Desc_Create is used to create the descriptor. The MORSE_Desc_Create_User function can be used if you have data organized differently. This is discussed in the next paragraph Step3. Giving a NULL pointer means you let the function allocate memory space. This requires to copy your data in the memory allocated by the *Desc_Create. This can be done with
    MORSE_Lapack_to_Tile(A, N, descA);
        
  • Third argument of @code{Desc_Create} is the datatype (used for memory allocation).
  • Fourth argument until sixth argument stand for respectively, the number of rows (NB), columns (NB) in each tile, the total number of values in a tile (NB*NB), the number of rows (N), colmumns (N) in the entire matrix.
  • Seventh argument until ninth argument stand for respectively, the beginning row (0), column (0) indexes of the submatrix and the number of rows (N), columns (N) in the submatrix. These arguments are specific and used in precise cases. If you do not consider submatrices, just use 0, 0, NROWS, NCOLS.
  • Two last arguments are the parameter of the 2-D block-cyclic distribution grid, see ScaLAPACK. To be able to use other data distribution over the nodes, MORSE_Desc_Create_User function should be used.

Step3

<sec:tuto_step3>

This program makes use of the same interface than Step2 (tile interface) but does not allocate LAPACK matrices anymore so that no copy between LAPACK matrix layout and tile matrix layout are necessary to call MORSE routines. To generate random right hand-sides you can use:

/* Allocate memory and initialize descriptor B */
MORSE_Desc_Create(&descB,  NULL, MorseRealDouble,
                  NB, NB,  NB*NB, N, NRHS,
                  0, 0, N, NRHS, 1, 1);
/* generate RHS with random values */
MORSE_dplrnt_Tile( descB, 5673 );

The other important point is that is it possible to create a descriptor, the necessary structure to call MORSE efficiently, by giving your own pointer to tiles if your matrix is not organized as a 1-D array column-major. This can be achieved with the MORSE_Desc_Create_User routine. Here is an example:

MORSE_Desc_Create_User(&descA, matA, MorseRealDouble,
                       NB, NB, NB*NB, N, N,
                       0, 0, N, N, 1, 1,
                       user_getaddr_arrayofpointers,
                       user_getblkldd_arrayofpointers,
                       user_getrankof_zero);

Firsts arguments are the same than MORSE_Desc_Create routine. Following arguments allows you to give pointer to functions that manage the access to tiles from the structure given as second argument. Here for example, matA is an array containing addresses to tiles, see the function allocate_tile_matrix defined in step3.h. The three functions you have to define for Desc_Create_User are:

  • a function that returns address of tile $A(m,n)$, m and n standing for the indexes of the tile in the global matrix. Lets consider a matrix @math{4x4} with tile size 2x2, the matrix contains four tiles of indexes: $A(m=0,n=0)$, $A(m=0,n=1)$, $A(m=1,n=0)$, $A(m=1,n=1)$
  • a function that returns the leading dimension of tile $A(m,*)$
  • a function that returns MPI rank of tile $A(m,n)$

Examples for these functions are vizible in step3.h. Note that the way we define these functions is related to the tile matrix format and to the data distribution considered. This example should not be used with MPI since all tiles are affected to processus 0, which means a large amount of data will be potentially transfered between nodes.

Step4

<sec:tuto_step4>

This program is a copy of step2 but instead of using the tile interface, it uses the tile async interface. The goal is to exhibit the runtime synchronization barriers. Keep in mind that when the tile interface is called, like MORSE_dpotrf_Tile, a synchronization function, waiting for the actual execution and termination of all tasks, is called to ensure the proper completion of the algorithm (i.e. data are up-to-date). The code shows how to exploit the async interface to pipeline subsequent algorithms so that less synchronisations are done. The code becomes:

/* Morse structure containing parameters and a structure to interact with
 * the Runtime system */
MORSE_context_t *morse;
/* MORSE sequence uniquely identifies a set of asynchronous function calls
 * sharing common exception handling */
MORSE_sequence_t *sequence = NULL;
/* MORSE request uniquely identifies each asynchronous function call */
MORSE_request_t request = MORSE_REQUEST_INITIALIZER;
int status;

...

morse_sequence_create(morse, &sequence);

/* Factorization: */
MORSE_dpotrf_Tile_Async( UPLO, descA, sequence, &request );

/* Solve: */
MORSE_dpotrs_Tile_Async( UPLO, descA, descX, sequence, &request);

/* Synchronization barrier (the runtime ensures that all submitted tasks
 * have been terminated */
RUNTIME_barrier(morse);
/* Ensure that all data processed on the gpus we are depending on are back
 * in main memory */
RUNTIME_desc_getoncpu(descA);
RUNTIME_desc_getoncpu(descX);

status = sequence->status;

Here the sequence of dpotrf and dpotrs algorithms is processed without synchronization so that some tasks of dpotrf and dpotrs can be concurently executed which could increase performances. The async interface is very similar to the tile one. It is only necessary to give two new objects MORSE_sequence_t and MORSE_request_t used to handle asynchronous function calls.

potri_async.png

Step5

<sec:tuto_step5>

Step5 shows how to set some important parameters. This program is a copy of Step4 but some additional parameters are given by the user. The parameters that can be set are:

  • number of Threads
  • number of GPUs

    The number of workers can be given as argument to the executable with --threads= and --gpus= options. It is important to notice that we assign one thread per gpu to optimize data transfer between main memory and devices memory. The number of workers of each type CPU and CUDA must be given at MORSE_Init.

    if ( iparam[IPARAM_THRDNBR] == -1 ) {
        get_thread_count( &(iparam[IPARAM_THRDNBR]) );
        /* reserve one thread par cuda device to optimize memory transfers */
        iparam[IPARAM_THRDNBR] -=iparam[IPARAM_NCUDAS];
    }
    NCPU = iparam[IPARAM_THRDNBR];
    NGPU = iparam[IPARAM_NCUDAS];
    /* initialize MORSE with main parameters */
    MORSE_Init( NCPU, NGPU );
        
  • matrix size
  • number of right-hand sides
  • block (tile) size

    The problem size is given with --n= and --nrhs= options. The tile size is given with option --nb=. These parameters are required to create descriptors. The size tile NB is a key parameter to get performances since it defines the granularity of tasks. If NB is too large compared to N, there are few tasks to schedule. If the number of workers is large this leads to limit parallelism. On the contrary, if NB is too small (i.e. many small tasks), workers could not be correctly fed and the runtime systems operations could represent a substantial overhead. A trade-off has to be found depending on many parameters: problem size, algorithm (drive data dependencies), architecture (number of workers, workers speed, workers uniformity, memory bus speed). By default it is set to 128. Do not hesitate to play with this parameter and compare performances on your machine.

  • inner-blocking size

    The inner-blocking size is given with option --ib=. This parameter is used by kernels (optimized algorithms applied on tiles) to perform subsequent operations with data block-size that fits the cache of workers. Parameters NB and IB can be given with MORSE_Set function:

    MORSE_Set(MORSE_TILE_SIZE,        iparam[IPARAM_NB] );
    MORSE_Set(MORSE_INNER_BLOCK_SIZE, iparam[IPARAM_IB] );
        

Step6

<sec:tuto_step6>

This program is a copy of Step5 with some additional parameters to be set for the data distribution. To use this program properly MORSE must use StarPU Runtime system and MPI option must be activated at configure. The data distribution used here is 2-D block-cyclic, see for example ScaLAPACK for explanation. The user can enter the parameters of the distribution grid at execution with --p= option. Example using OpenMPI on four nodes with one process per node:

mpirun -np 4 ./step6 --n=10000 --nb=320 --ib=64 --threads=8 --gpus=2 --p=2

In this program we use the tile data layout from PLASMA so that the call

MORSE_Desc_Create_User(&descA, NULL, MorseRealDouble,
                       NB, NB, NB*NB, N, N,
                       0, 0, N, N,
                       GRID_P, GRID_Q,
                       morse_getaddr_ccrb,
                       morse_getblkldd_ccrb,
                       morse_getrankof_2d);

is equivalent to the following call

MORSE_Desc_Create(&descA, NULL, MorseRealDouble,
                  NB, NB, NB*NB, N, N,
                  0, 0, N, N,
                  GRID_P, GRID_Q);

functions morse_getaddr_ccrb, morse_getblkldd_ccrb, morse_getrankof_2d being used in Desc_Create. It is interesting to notice that the code is almost the same as Step5. The only additional information to give is the way tiles are distributed through the third function given to MORSE_Desc_Create_User. Here, because we have made experiments only with a 2-D block-cyclic distribution, we have parameters P and Q in the interface of Desc_Create but they have sense only for 2-D block-cyclic distribution and then using morse_getrankof_2d function. Of course it could be used with other distributions, being no more the parameters of a 2-D block-cyclic grid but of another distribution.

Step7

<sec:tuto_step7>

This program is a copy of step6 with some additional calls to build a matrix from within chameleon using a function provided by the user. This can be seen as a replacement of the function like MORSE_dplgsy_Tile() that can be used to fill the matrix with random data, MORSE_dLapack_to_Tile() to fill the matrix with data stored in a lapack-like buffer, or MORSE_Desc_Create_User() that can be used to describe an arbitrary tile matrix structure. In this example, the build callback function are just wrapper towards CORE_xxx() functions, so the output of the program step7 should be exactly similar to that of step6. The difference is that the function used to fill the tiles is provided by the user, and therefore this approach is much more flexible.

The new function to understand is MORSE_dbuild_Tile, e.g.

struct data_pl data_A={(double)N, 51, N};
MORSE_dbuild_Tile(MorseUpperLower, descA, (void*)&data_A, Morse_build_callback_plgsy);

The idea here is to let Chameleon fill the matrix data in a task-based fashion (parallel) by using a function given by the user. First, the user should define if all the blocks must be entirelly filled or just the upper/lower part with, e.g. MorseUpperLower. We still relies on the same structure MORSE_desc_t which must be initialized with the proper parameters, by calling for example MORSE_Desc_Create. Then, an opaque pointer is used to let the user give some extra data used by his function. The last parameter is the pointer to the user’s function.

List of available routines

Linear Algebra routines

We list the linear algebra routines of the form MORSE_name[_Tile[_Async]] (name follows LAPACK naming scheme, see http://www.netlib.org/lapack/lug/node24.html) that can be used with the Chameleon library. For details about these functions please refer to the doxygen documentation. name can be one of the following:

  • BLAS 2/3 routines
    • gemm: matrix matrix multiply and addition
    • hemm: gemm with A Hermitian
    • herk: rank k operations with A Hermitian
    • her2k: rank 2k operations with A Hermitian
    • lauum: computes the product U * U’ or L’ * L, where the triangular factor U or L is stored in the upper or lower triangular part of the array A
    • symm: gemm with A symmetric
    • syrk: rank k operations with A symmetric
    • syr2k: rank 2k with A symmetric
    • trmm: gemm with A triangular
  • Triangular solving routines
    • trsm: computes triangular solve
    • trsmpl: performs the forward substitution step of solving a system of linear equations after the tile LU factorization of the matrix
    • trsmrv:
    • trtri: computes the inverse of a complex upper or lower triangular matrix A
  • LL’ (Cholesky) routines
    • posv: linear systems solving using Cholesky factorization
    • potrf: Cholesky factorization
    • potri: computes the inverse of a complex Hermitian positive definite matrix A using the Cholesky factorization A
    • potrimm:
    • potrs: linear systems solving using existing Cholesky factorization
    • sysv: linear systems solving using Cholesky decomposition with A symmetric
    • sytrf: Cholesky decomposition with A symmetric
    • sytrs: linear systems solving using existing Cholesky decomposition with A symmetric
  • LU routines
    • gesv_incpiv: linear systems solving with LU factorization and partial pivoting
    • gesv_nopiv: linear systems solving with LU factorization and without pivoting
    • getrf_incpiv: LU factorization with partial pivoting
    • getrf_nopiv: LU factorization without pivoting
    • getrs_incpiv: linear systems solving using existing LU factorization with partial pivoting
    • getrs_nopiv: linear systems solving using existing LU factorization without pivoting
  • QR/LQ routines
    • gelqf: LQ factorization
    • gelqf_param: gelqf with hqr
    • gelqs: computes a minimum-norm solution min || A*X - B || using the LQ factorization
    • gelqs_param: gelqs with hqr
    • gels: Uses QR or LQ factorization to solve a overdetermined or underdetermined linear system with full rank matrix
    • gels_param: gels with hqr
    • geqrf: QR factorization
    • geqrf_param: geqrf with hqr
    • geqrs: computes a minimum-norm solution min || A*X - B || using the RQ factorization
    • hetrd: reduces a complex Hermitian matrix A to real symmetric tridiagonal form S
    • geqrs_param: geqrs with hqr
    • tpgqrt: generates a partial Q matrix formed with a blocked QR factorization of a “triangular-pentagonal” matrix C, which is composed of a unused triangular block and a pentagonal block V, using the compact representation for Q. See tpqrt to generate V
    • tpqrt: computes a blocked QR factorization of a “triangular-pentagonal” matrix C, which is composed of a triangular block A and a pentagonal block B, using the compact representation for Q
    • unglq: generates an M-by-N matrix Q with orthonormal rows, which is defined as the first M rows of a product of the elementary reflectors returned by MORSE_zgelqf
    • unglq_param: unglq with hqr
    • ungqr: generates an M-by-N matrix Q with orthonormal columns, which is defined as the first N columns of a product of the elementary reflectors returned by MORSE_zgeqrf
    • ungqr_param: ungqr with hqr
    • unmlq: overwrites C with Q*C or C*Q or equivalent operations with transposition on conjugate on C (see doxygen documentation)
    • unmlq_param: unmlq with hqr
    • unmqr: similar to unmlq (see doxygen documentation)
    • unmqr_param: unmqr with hqr
  • EVD/SVD
    • gesvd: singular value decomposition
    • heevd: eigenvalues/eigenvectors computation with A Hermitian
  • Extra routines
    • Norms
      • lange: computes norm of a matrix (Max, One, Inf, Frobenius)
      • lanhe: lange with A Hermitian
      • lansy: lange with A symmetric
      • lantr: lange with A triangular
    • Random matrices generation
      • plghe: generates a random Hermitian matrix
      • plgsy: generates a random symmetrix matrix
      • plrnt: generates a random matrix
    • Others
      • geadd: general matrix matrix addition
      • lacpy: copy matrix into another
      • lascal: scales a matrix
      • laset: copy the triangular part of a matrix into another, set a value for the diagonal and off-diagonal part
      • tradd: trapezoidal matrices addition

Options routines

Enable MORSE feature.

int MORSE_Enable  (MORSE_enum option);

Feature to be enabled:

  • MORSE_WARNINGS: printing of warning messages,
  • MORSE_AUTOTUNING: autotuning for tile size and inner block size,
  • MORSE_PROFILING_MODE: activate kernels profiling,
  • MORSE_PROGRESS: to print a progress status,
  • MORSE_GEMM3M: to enable the use of the gemm3m blas bunction.

Disable MORSE feature.

int MORSE_Disable (MORSE_enum option);

Symmetric to MORSE_Enable.

Set MORSE parameter.

int MORSE_Set     (MORSE_enum param, int  value);

Parameters to be set:

  • MORSE_TILE_SIZE: size matrix tile,
  • MORSE_INNER_BLOCK_SIZE: size of tile inner block,
  • MORSE_HOUSEHOLDER_MODE: type of householder trees (FLAT or TREE),
  • MORSE_HOUSEHOLDER_SIZE: size of the groups in householder trees,
  • MORSE_TRANSLATION_MODE: related to the MORSE_Lapack_to_Tile, see ztile.c.

Get value of MORSE parameter.

int MORSE_Get     (MORSE_enum param, int *value);

Auxiliary routines

Reports MORSE version number.

int MORSE_Version        (int *ver_major, int *ver_minor, int *ver_micro);

Initialize MORSE: initialize some parameters, initialize the runtime and/or MPI.

int MORSE_Init           (int nworkers, int ncudas);

Finalyze MORSE: free some data and finalize the runtime and/or MPI.

int MORSE_Finalize       (void);

Suspend MORSE runtime to poll for new tasks, to avoid useless CPU consumption when no tasks have to be executed by MORSE runtime system.

int MORSE_Pause          (void);

Symmetrical call to MORSE_Pause, used to resume the workers polling for new tasks.

int MORSE_Resume         (void);

Return the MPI rank of the calling process.

int MORSE_My_Mpi_Rank    (void);

Return the size of the distributed computation

int MORSE_Comm_size( int *size )

Return the rank of the distributed computation

int MORSE_Comm_rank( int *rank )

Prepare the distributed processes for computation

int MORSE_Distributed_start(void)

Clean the distributed processes after computation

int MORSE_Distributed_stop(void)

Return the number of CPU workers initialized by the runtime

int MORSE_GetThreadNbr()

Conversion from LAPACK layout to tile layout.

int MORSE_Lapack_to_Tile (void *Af77, int LDA, MORSE_desc_t *A);

Conversion from tile layout to LAPACK layout.

int MORSE_Tile_to_Lapack (MORSE_desc_t *A, void *Af77, int LDA);

Descriptor routines

Create matrix descriptor, internal function.

int MORSE_Desc_Create(MORSE_desc_t **desc, void *mat, MORSE_enum dtyp,
                      int mb, int nb, int bsiz, int lm, int ln,
                      int i, int j, int m, int n, int p, int q);

Create matrix descriptor, user function.

int MORSE_Desc_Create_User(MORSE_desc_t **desc, void *mat, MORSE_enum dtyp,
                           int mb, int nb, int bsiz, int lm, int ln,
                           int i, int j, int m, int n, int p, int q,
                           void* (*get_blkaddr)( const MORSE_desc_t*, int, int),
                           int (*get_blkldd)( const MORSE_desc_t*, int ),
                           int (*get_rankof)( const MORSE_desc_t*, int, int ));

Create matrix descriptor for tiled matrix which may not fit memory.

int MORSE_Desc_Create_OOC(MORSE_desc_t **descptr, MORSE_enum dtyp, int mb, int nb, int bsiz,
                          int lm, int ln, int i, int j, int m, int n, int p, int q);

User’s function version of MORSE_Desc_Create_OOC.

int MORSE_Desc_Create_OOC_User(MORSE_desc_t **descptr, MORSE_enum dtyp, int mb, int nb, int bsiz,
                               int lm, int ln, int i, int j, int m, int n, int p, int q,
                               int (*get_rankof)( const MORSE_desc_t*, int, int ));

Destroys matrix descriptor.

int MORSE_Desc_Destroy (MORSE_desc_t **desc);

Ensures that all data of the descriptor are up-to-date.

int MORSE_Desc_Acquire (MORSE_desc_t  *desc);

Release the data of the descriptor acquired by the application. Should be called if MORSE_Desc_Acquire has been called on the descriptor and if you do not need to access to its data anymore.

int MORSE_Desc_Release (MORSE_desc_t  *desc);

Ensure that all data are up-to-date in main memory (even if some tasks have been processed on GPUs).

int MORSE_Desc_Flush(MORSE_desc_t  *desc, MORSE_sequence_t *sequence);

Set the sizes for the MPI tags. Default value: tag_width=31, tag_sep=24, meaning that the MPI tag is stored in 31 bits, with 24 bits for the tile tag and 7 for the descriptor. This function must be called before any descriptor creation.

void MORSE_user_tag_size(int user_tag_width, int user_tag_sep);

Sequences routines

Create a sequence.

int MORSE_Sequence_Create  (MORSE_sequence_t **sequence);

Destroy a sequence.

int MORSE_Sequence_Destroy (MORSE_sequence_t *sequence);

Wait for the completion of a sequence.

int MORSE_Sequence_Wait    (MORSE_sequence_t *sequence);

Terminate a sequence.

int MORSE_Sequence_Flush(MORSE_sequence_t *sequence, MORSE_request_t *request)