I am writing an MPI program that has the first instance working as a master, sending and receiving results from its workers.
The receive function does something like this:
struct result *check_for_message(void) {
...
static unsigned int message_size;
static char *buffer;
static bool started_reception = false;
static MPI_Request req;
if (!started_reception) {
MPI_Irecv(&message_size, 1, MPI_INT, MPI_ANY_SOURCE, SIZE_TAG,
MPI_COMM_WORLD, &req);
started_reception = true;
} else {
int flag = 0;
MPI_Status status;
MPI_Test(&req, &flag, &status);
if (flag == 1) {
started_reception = false;
buffer = calloc(message_size + 1, sizeof(char));
DIE_IF_NULL(buffer); // printf + MPI_Finalize + exit
MPI_Request content_req;
MPI_Irecv(buffer, MAX_MSG_SIZE, MPI_CHAR, status.MPI_SOURCE, CONTENT_TAG,
MPI_COMM_WORLD, &content_req);
MPI_Wait(&content_req, MPI_STATUS_IGNORE);
ret = process_request(buffer);
free(buffer);
}
}
...
}
The send function does something like this:
MPI_Request size_req;
MPI_Request content_req;
MPI_Isend(&size, 1, MPI_INT, dest, SIZE_TAG, MPI_COMM_WORLD, &size_req);
MPI_Wait(&size_req, MPI_STATUS_IGNORE);
MPI_Isend(buf, size, MPI_CHAR, dest, CONTENT_TAG, MPI_COMM_WORLD,
&content_req);
MPI_Wait(&content_req, MPI_STATUS_IGNORE);
I noticed that if I remove the MPI_Wait in the sending function it often happens that the execution blocks or some sort of SIGNAL stops the execution of an instance(I can check the output but I think it was something about a free error SIGSEGV).
When I add the MPI_Wait it always seems to run perfectly. Could it be something related to the order in which the two sends perform? Aren't they supposed to be in order?
I run the program locally with -n 16 but have also tested with -n 128. The messages that I send are above 50 chars (90% of the time), some being even > 300 chars.
question from:
https://stackoverflow.com/questions/65865257/mpi-i-send-mpi-irecv-issue