Skip to content

Interpreting the output of mpisee‐through

Ioannis Vardas edited this page Sep 1, 2024 · 1 revision

Welcome to the mpisee wiki!

This example shows how to interpret the output of mpisee-through.

We profile a 2D stencil communication pattern that has two implementations.

Essential mpisee-through commands

  • The default query displays all data in each communicator by summarizing across ranks:
    mpisee-through.py -i /path/to/mpisee_profile.db
        
  • Display all data by separating ranks:
    mpisee-through.py -i /path/to/mpisee_profile.db -a
        
  • Display data for collective MPI operations only:
    mpisee-through.py -i /path/to/mpisee_profile.db -c
        
  • Display data for point-to-point MPI operations only:
    mpisee-through.py -i /path/to/mpisee_profile.db -p
        
  • The following switches can be combined with the above options:
    • Display data for specific MPI ranks, e.g., Ranks 0 and 12: -r 0,12.
    • Display data for a specific buffer range, e.g., 0-1024: -b 0:1024.
    • Display data for MPI operations within a specific time range, e.g, from 0.02 to 0.8 seconds: -t 0.02:0.8.
  • Use -h switch for a complete list of commands.

Example analysis using a 2D stencil communication pattern

Create a Cartesian communicator with six MPI processes arranged as follows:

3 4 5
0 1 2

Each process sends five integers to each of its neighbors. We use two implementations to perform this communication: MPI_Neighbor_alltoallv and MPI_Sendrecv.

MPI_Neighbor_alltoallv implementation

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#define DIMS 2

int main(int argc, char* argv[])
{
    int rank, size, i,j, counts;
    int dims[DIMS], periods[DIMS], reorder = 1;
    int coords[DIMS];
    MPI_Comm comm_2d;
    int ndims;
    int src, dst;
    int to_print;
    to_print = atoi(argv[1]);

    // MPI Initialization
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    if ( size !=6 ){
        printf("This application is meant to be run with 6 MPI processes.\n");
        MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE);
        return -1;
    }
    dims[0] = 2; // y Dimension
    dims[1] = 3; // x Dimension
    periods[0] = periods[1] = 0;

    // Create the Cartesian topology
    MPI_Cart_create(MPI_COMM_WORLD, DIMS, dims, periods, reorder, &comm_2d);
    MPI_Comm_rank(comm_2d, &rank);
    MPI_Cart_coords(comm_2d, rank, DIMS, coords);
    PMPI_Cartdim_get(comm_2d, &ndims);
    counts = 5;
    int sendcounts[2*ndims];
    int sendbuf[2*ndims*counts];
    int recvbuf[2*ndims*counts];
    int recvcounts[2*ndims];
    int sdispls[2*ndims];
    int rdispls[2*ndims];
    for ( i =0; i<2*ndims; ++i){
        //sendbuf[i] = rank+i*10;
        for (j = 0; j<counts; ++j){
            sendbuf[i*counts+j] = rank+i*10+j;
        }
        for (j = 0; j<counts; ++j){
            recvbuf[i*counts+j] = -1;
        }
        //recvbuf[i] = -1;
        sendcounts[i] = counts;
        recvcounts[i] = counts;
        sdispls[i] = i*counts;
        rdispls[i] = i*counts;
    }


    MPI_Neighbor_alltoallv(sendbuf, sendcounts, sdispls, MPI_INT, recvbuf, recvcounts, rdispls, MPI_INT, comm_2d);

    if (rank == to_print) {
        for (i = 0; i <ndims; ++i) {
            MPI_Cart_shift(comm_2d, i, 1, &src, &dst);
            for (j = 0; j < recvcounts[DIMS * i]; ++j) {
                if (src != MPI_PROC_NULL) {
                    printf("Rank %d Neighbor %d %d\n", rank, src, recvbuf[rdispls[DIMS * i] + j]);
                }
            }
            MPI_Cart_shift(comm_2d, i, -1, &src, &dst);
            for (j = 0; j < recvcounts[DIMS * i+1]; ++j) {
                if (src != MPI_PROC_NULL) {
                    printf("Rank %d Neighbor %d %d\n", rank, src, recvbuf[rdispls[DIMS * i+1] + j]);
                }
            }
        }
    }


    MPI_Comm_free(&comm_2d);
    MPI_Finalize();
    return EXIT_SUCCESS;
}

Default query for analysis

To analyze the profile of the above code: mpisee-through.py -i /path/to/mpisee_profile.db. Which outputs the following:

Comm Name   Processes           Comm Size   MPI Operation       Min Buffer  Max Buffer  Calls       Max Time(s)  Avg Time(s)  Total Volume(Bytes)
a0.1        [0,1,...,5]         6           Neighbor_alltoallv  0           128         1           0.001509     0.001500     280

The above output shows that in the communicator named a0.1 which is the cartesian communicator of size 6. The processes participating in this communicator have MPI ranks 0-5 and they perform one MPI_Neighbor_alltoallv with buffer sizes between 0-128. This query accumulates the bytes sent by all processes in the last column (Total Volume). The total amount of bytes sent by all processes is 280. Finally we can see the maximum and average time among these processes to perform the collective.

Per-process analysis

mpisee-through.py -i /path/to/mpisee_profile.db -a:

Comm Name   Processes           Comm Size   Rank      MPI Operation       Min Buffer  Max Buffer  Calls       Time(s)      Volume(Bytes)  
a0.1        [0,1,...,5]         6           5         Neighbor_alltoallv  0           128         1           0.001509     40
a0.1        [0,1,...,5]         6           4         Neighbor_alltoallv  0           128         1           0.001506     60
a0.1        [0,1,...,5]         6           2         Neighbor_alltoallv  0           128         1           0.001505     40
a0.1        [0,1,...,5]         6           1         Neighbor_alltoallv  0           128         1           0.001502     60
a0.1        [0,1,...,5]         6           0         Neighbor_alltoallv  0           128         1           0.001496     40
a0.1        [0,1,...,5]         6           3         Neighbor_alltoallv  0           128         1           0.001483     40

This output shows the bytes sent by each process when calling MPI_Neighbor_alltoallv. Processes that are on the corners of the grid (MPI ranks 0,2,3,5) have only two neighbors, and they send 5 integers to each of them; 2*5*4=40. Whereas processes 1 and 3 have three neighbors; therefore, they send 60 bytes. The sum of the Volume column is 280, which is equal to the Total Volume column in the previous output.

Analysis of specific MPI ranks

For example to analyze the data of ranks 1 and 4 use: -a -r 1,4

Comm Name   Processes           Comm Size   Rank      MPI Operation       Min Buffer  Max Buffer  Calls       Time(s)      Volume(Bytes)  
a0.1        [0,1,...,5]         6           4         Neighbor_alltoallv  0           128         1           0.001506     60
a0.1        [0,1,...,5]         6           1         Neighbor_alltoallv  0           128         1           0.001502     60

MPI_Sendrecv implementation

Replace the call to MPI_Neighbor_alltoallv with the following:

for ( i=0; i<ndims; ++i ){

    MPI_Cart_shift(comm_2d, i, 1, &src, &dst);
    MPI_Sendrecv(&sendbuf[sdispls[DIMS * i+1]], sendcounts[DIMS * i+1], MPI_INT, dst, 0,
                 &recvbuf[rdispls[DIMS * i]], recvcounts[DIMS * i], MPI_INT, src, 0,
                 comm_2d, MPI_STATUS_IGNORE);
    MPI_Cart_shift(comm_2d, i, -1, &src, &dst);
    MPI_Sendrecv(&sendbuf[sdispls[DIMS * i]], sendcounts[DIMS * i], MPI_INT, dst, 0,
                 &recvbuf[rdispls[DIMS * i+1]], recvcounts[DIMS * i+1], MPI_INT, src, 0,
                 comm_2d, MPI_STATUS_IGNORE);

}

Default query for analysis

Comm Name   Processes           Comm Size   MPI Operation       Min Buffer  Max Buffer  Calls       Max Time(s)  Avg Time(s)  Total Volume(Bytes)
a0.1        [0,1,...,5]         6           Sendrecv            0           128         12          0.000098     0.000088     280

Compared to the output with MPI_Neighbor_alltoallv, here we have 12 calls since every process calls MPI_Sendrecv twice. Notice that the calls were not accumulated in the case of MPI_Neighbor_alltoallv because it is a collective call. The total volume is the same.

Per-process analysis

Comm Name   Processes           Comm Size   Rank      MPI Operation       Min Buffer  Max Buffer  Calls       Time(s)      Volume(Bytes)  
a0.1        [0,1,...,5]         6           0         Sendrecv            0           128         2           0.000098     40
a0.1        [0,1,...,5]         6           2         Sendrecv            0           128         2           0.000095     40
a0.1        [0,1,...,5]         6           3         Sendrecv            0           128         2           0.000088     40
a0.1        [0,1,...,5]         6           4         Sendrecv            0           128         2           0.000086     60
a0.1        [0,1,...,5]         6           5         Sendrecv            0           128         2           0.000084     40
a0.1        [0,1,...,5]         6           1         Sendrecv            0           128         2           0.000080     60

We now see that each process calls MPI_Sendrecv twice and the corresponding volume. Notice that the time spent in MPI_Sendrecv is less than the MPI_Neighbor_alltoallv.

Clone this wiki locally