AMS Monitoring and Benchmark #94

lpottier · 2025-02-21T20:33:58Z

This PR adds monitoring capabilities to AMS, especially for the RabbitMQ case:

On AMSlib side: we have a way to output detailed monitoring data at the message level when using RabbitMQ
On the python side: this PR extends the existing monitoring design to add fine-grain reporting per RabbitMQ message

This PR also fixes #83 and harmonizes the name of different RabbitMQ fields across the stack:

rabbitmq-outbound-queue -> rabbitmq-queue-physics
rabbitmq-exchange -> rabbitmq-exchange-training
rabbitmq-routing-key -> rabbitmq-key-training

Various optimizations:

The last 2 bits used for padding in AMSMessage are now used to encode the message ID (useful for end-to-end monitoring)
Moved from std::vector to std::list when keeping a buffer of AMSMessage, std::list incurs fewer move operations when deleting messages

Signed-off-by: Loic Pottier <[email protected]>

Changed several JSON fields in RMQ config: - rabbitmq-outbound-queue -> rabbitmq-queue-physics - rabbitmq-exchange -> rabbitmq-exchange-training - rabbitmq-routing-key -> rabbitmq-key-training Signed-off-by: Loic Pottier <[email protected]>

Signed-off-by: Loic Pottier <[email protected]>

…age ID from AMSlib Signed-off-by: Loic Pottier <[email protected]>

koparasy · 2025-02-21T20:47:35Z

src/AMSWorkflow/ams/rmq.py

@@ -56,9 +60,9 @@ def header_format(self) -> str:
        - 4 bytes are the number of elements in the message. Limit max: 2^32 - 1
        - 2 bytes are the input dimension. Limit max: 65535
        - 2 bytes are the output dimension. Limit max: 65535
-        - 2 bytes are for aligning memory to 8
+        - 2 bytes are the message ID given by AMSlib (local to each MPI rank). Limit max: 65535


What happens if the application sends more than the max messages?

Valid question. I should add a failsafe for that case.

koparasy · 2025-02-21T20:58:24Z

src/AMSlib/wf/rmqdb.cpp

+  current_offset += sizeof(uint16_t);
+  // Message ID (should be 2 bytes)
+  uint16_t new_message_id;
+  std::memcpy(&new_message_id, data_blob + current_offset, sizeof(uint16_t));


We should avoid memcpy and use it only if we need to (e.g. we get a blob of data). A couple of lines above we do this:

uint16_t new_domain_size = (reinterpret_cast<uint16_t*>(data_blob + current_offset))[0];

Good catch. I will fix that.

koparasy

@lpottier It is hard for me to understand what exactly we are trying to do here. As far as I understood the code we are trying to:

Have a performance benchmark. This follows the design of the example code. Correct?
You extend rmq to monitor performance. You monitor performance by attaching a timestamp? and outputing it to a JSON file?
You extend rmq with additional caliper capabilities. Are these matched with the JSON monitoring ?
There is some locking/unlocking in mutexes. What is the reasoning of this?
Some extensions modify the msg header. Is this necessary?

lpottier · 2025-02-21T23:04:43Z

@koparasy Yes the PR is a bit all over the place.. You understood correctly.

Yes, it does follow the example code with some extra options
Yes
I added some Caliper monitoring on the AMSlib to take into the time when we restart the broker (if that happens)
The reasoning was that both publisher and consumer will use output their data to the same JSON (one JSON per MPI rank) but I have not implemented the rabbitmq monitoring on the consumer as we do not need right now. The locks are useless now but will become useful later
It is necessary if we want to monitor end-to-end, but based on our discussion earlier this week we will not need it, so I can revert the code back to the original AMSMessage design.

koparasy · 2025-02-24T16:06:54Z

src/AMSWorkflow/ams/stage.py

+        # FIXME: temporary solution to kill properly the stager when using srun
+        self.write_pid()
+
+    def write_pid(self):
+        """
+        Write the PID of the current process in
+        a file. Append it to a file if the file
+        exists (multiple stagers could be running).
+
+        This is useful to kill the stager.
+
+        FIXME: this solution is not very clean or elegant
+        and can be improved. The simulations side could send a message
+        on a specific queue to signify that no more data will arrive for example.
+        """
+
+        with open("ams-stagers.pid", 'a') as f:
+            f.write(f" {os.getpid()}")
+


I don't understand this code. Do you need the pid to do kill -SIG <signal> PID ?

Unfortunately, yes. I need the PID of the stager to send a signal when scheduling tasks with Slurm. That "fix" is only needed when using srun to start the stager because srun wraps its target process and creates another process with another PID. If you kill the srun PID the signal is not being captured properly. I have to write the internal PID somewhere if we want to exit cleanly

koparasy · 2025-02-24T16:10:08Z

@lpottier I am trying to come with a plan here to see what we need, what we can merge in and what needs either modification:

I think commit 7aef9b279ec1f1c08664859343ab0f89d195c3af is standalone and clean. So we can merge this as is? Do you have any objections?

koparasy · 2025-02-24T16:13:25Z

I would like to split this commit into 2 commits:

The changes that fix the naming.
The code that adds caliper calls. You are adding some calls in other commits as well. Try to add all of the necessary calls in this commit. Effectively, from what I saw you are bounding the time of storing a single invocation.

koparasy · 2025-02-24T16:19:38Z

Add the changes to pyproject.toml in a separate commit and make a PR, I will merge this asap

koparasy · 2025-02-24T16:27:22Z

Next, I would like a commit that picks the python changes of the AMSMonitor (the ones that track the time required to write a message etc). This I believe requires the changes from:

koparasy · 2025-02-24T16:31:59Z

I think there are some additional changes you made changes in RMQ orthogonal to our monitoring. Is this correct? If so, we probably should make a separate PR for them as well.

koparasy · 2025-02-24T16:34:29Z

After these changes we will still need to address the monitoring and the benchmark. For monitoring after the discussion we had with Timo we do not need something so complicated. We will likely need to simplify.

The benchmark is likely to be correct as is. We may want to move it under tests. But let's leave that as a last step.

lpottier · 2025-02-25T23:44:13Z

@koparasy Okay sounds good, I will make the requested separate PRs.

I think commit 7aef9b279ec1f1c08664859343ab0f89d195c3af is standalone and clean. So we can merge this as is? Do you have any objections?

Yes this commit is standalone.

koparasy · 2025-02-25T23:45:15Z

@koparasy Okay sounds good, I will make the requested separate PRs.

I think commit 7aef9b279ec1f1c08664859343ab0f89d195c3af is standalone and clean. So we can merge this as is? Do you have any objections?

Yes this commit is standalone.

Make it a separate PR and ping me.

lpottier · 2025-03-11T17:51:51Z

We can close this PR as we merged all the changes we wanted in different PRs.

lpottier added 8 commits February 21, 2025 12:07

Added infrastructure to build benchmarks

8eba33f

Signed-off-by: Loic Pottier <[email protected]>

Added some Caliper calls + fix #83

34d9c89

Changed several JSON fields in RMQ config: - rabbitmq-outbound-queue -> rabbitmq-queue-physics - rabbitmq-exchange -> rabbitmq-exchange-training - rabbitmq-routing-key -> rabbitmq-key-training Signed-off-by: Loic Pottier <[email protected]>

Added monitoring capabilities to AMSlib

06e7559

Signed-off-by: Loic Pottier <[email protected]>

Added various options to the python side

11d141a

Signed-off-by: Loic Pottier <[email protected]>

Added capabilities to AMSMonitor to handle list of metrics to track

195e6ba

Signed-off-by: Loic Pottier <[email protected]>

Added signal manager for Pipeline so we can exit in a cleaner manner

7aef9b2

Signed-off-by: Loic Pottier <[email protected]>

Added monitoring for the RMQ stage of the pipeline

e965aa3

Signed-off-by: Loic Pottier <[email protected]>

We now use the 2 bytes for padding in AMSMessage for sending the mess…

540fa13

…age ID from AMSlib Signed-off-by: Loic Pottier <[email protected]>

lpottier requested a review from koparasy February 21, 2025 20:34

lpottier changed the title ~~Benchmark DB~~ AMS Monitoring and Benchmark Feb 21, 2025

koparasy reviewed Feb 21, 2025

View reviewed changes

koparasy reviewed Feb 24, 2025

View reviewed changes

lpottier closed this Mar 11, 2025

AMS Monitoring and Benchmark #94

AMS Monitoring and Benchmark #94

Uh oh!

Conversation

lpottier commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

koparasy Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

lpottier Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

koparasy Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

lpottier Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

koparasy left a comment

Choose a reason for hiding this comment

Uh oh!

lpottier commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

koparasy Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

lpottier Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

koparasy commented Feb 24, 2025

Uh oh!

koparasy commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

koparasy commented Feb 24, 2025

Uh oh!

koparasy commented Feb 24, 2025

Uh oh!

koparasy commented Feb 24, 2025

Uh oh!

koparasy commented Feb 24, 2025

Uh oh!

lpottier commented Feb 25, 2025

Uh oh!

koparasy commented Feb 25, 2025

Uh oh!

lpottier commented Mar 11, 2025

Uh oh!

Uh oh!

lpottier commented Feb 21, 2025 •

edited

Loading

lpottier commented Feb 21, 2025 •

edited

Loading

koparasy commented Feb 24, 2025 •

edited

Loading