Skip to content

Integrating tracer into object lifecycles #403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: staging
Choose a base branch
from

Conversation

aryanjassal
Copy link
Member

Description

To gain observability into polykey, we are integrating a tracer system similar to OpenTracing. This is currently just an experiment. The visualiser and the integration is being tracked in this PR. As we are currently monkey-patching to get the spans integrated, that code cannot be shown in the PR. I might make some comments if I need feedback

Issues Fixed

Tasks

  • 1. Get tracer patched into polykey as an experiment
  • 2. Add web-based visualiser

Final checklist

  • Domain specific tests
  • Full tests
  • Updated inline-comment documentation
  • Lint fixed
  • Squash and rebased
  • Sanity check the final build

@aryanjassal
Copy link
Member Author

aryanjassal commented May 5, 2025

Interestingly, a lot of tests are failing. That shouldn't really happen as the patch should not interfere with the actual functioning of the code. This probably flew under the radar as I was focusing on getting the spans working rather than doing a full test each change. I'll have to dig into this.

On the other hand, I've made index.html file as a starting point for your visualiser work. A span.jsonl file exists and will keep getting updated on any significant format change. Make sure to commit whenever you get a new feature or something. You can make progress reports or other comments in this PR itself. @Abby010

@aryanjassal aryanjassal force-pushed the feature-tracing-integration branch from 03ddbf3 to 7e3b0f0 Compare May 7, 2025 00:37
@CMCDragonkai
Copy link
Member

CMCDragonkai commented May 7, 2025 via email

@aryanjassal
Copy link
Member Author

I did some work on the visualiser, and this is what we came up with. The red lines are without an end span event, and the blue ones are with. If spans don't have a corresponding end event, they are coloured differently. I do notice something pretty interesting here -- the destroy method on CreateDestroyStartStop classes are never called. I checked this in the logs and they also show, for example, the TaskManager only stopping and not being destroyed, even though it is marked with CreateDestroyStartStop and also has implemented a destroy() method.

INFO:polykey.PolykeyAgent:Stopping TaskManager
INFO:polykey.PolykeyAgent:Stopping Processing
INFO:polykey.PolykeyAgent:Stopped Processing
INFO:polykey.PolykeyAgent:Stopping Tasks
INFO:polykey.PolykeyAgent:Stopped Tasks
INFO:polykey.PolykeyAgent:Stopped TaskManager
INFO:polykey.PolykeyAgent.Audit:Stopping Audit
INFO:polykey.PolykeyAgent.Audit:Stopped Audit
INFO:polykey.PolykeyAgent.DB:Stopping DB
INFO:polykey.PolykeyAgent.DB:Stopped DB
INFO:polykey.PolykeyAgent.KeyRing:Stopping KeyRing
INFO:polykey.PolykeyAgent.KeyRing:Stopped KeyRing
INFO:polykey.PolykeyAgent.Schema:Stopping Schema
INFO:polykey.PolykeyAgent.Schema:Stopped Schema
INFO:polykey.PolykeyAgent.WorkerManager:Destroying WorkerManager
INFO:polykey.PolykeyAgent.WorkerManager:Destroyed WorkerManager
INFO:polykey.PolykeyAgent.Status:Stopping Status
INFO:polykey.PolykeyAgent.Status:Writing Status to /home/aryanj/.local/share/polykey/status.json
INFO:polykey.PolykeyAgent.Status:Status is DEAD
INFO:polykey.PolykeyAgent:Stopped PolykeyAgent

image

Another thing to note is that js-id cannot be used here as it uses node-specific libraries like crypto or perf. Thus, to extract the timestamp from the id, I had to manually implement the functions.

Copy link
Member Author

Actually, after checking the Polykey source code, I can see that the destroy methods are called only when Polykey agent is destroyed. In this case, how should I handle the create/destroy events in CreateDestroyStartStop when it might not be destroyed in that run? Should I leave them in as-is or not track the create/destroy events of CreateDestroyStartStop?

@CMCDragonkai
Copy link
Member

I did some work on the visualiser, and this is what we came up with. The red lines are without an end span event, and the blue ones are with. If spans don't have a corresponding end event, they are coloured differently. I do notice something pretty interesting here -- the destroy method on CreateDestroyStartStop classes are never called. I checked this in the logs and they also show, for example, the TaskManager only stopping and not being destroyed, even though it is marked with CreateDestroyStartStop and also has implemented a destroy() method.

INFO:polykey.PolykeyAgent:Stopping TaskManager
INFO:polykey.PolykeyAgent:Stopping Processing
INFO:polykey.PolykeyAgent:Stopped Processing
INFO:polykey.PolykeyAgent:Stopping Tasks
INFO:polykey.PolykeyAgent:Stopped Tasks
INFO:polykey.PolykeyAgent:Stopped TaskManager
INFO:polykey.PolykeyAgent.Audit:Stopping Audit
INFO:polykey.PolykeyAgent.Audit:Stopped Audit
INFO:polykey.PolykeyAgent.DB:Stopping DB
INFO:polykey.PolykeyAgent.DB:Stopped DB
INFO:polykey.PolykeyAgent.KeyRing:Stopping KeyRing
INFO:polykey.PolykeyAgent.KeyRing:Stopped KeyRing
INFO:polykey.PolykeyAgent.Schema:Stopping Schema
INFO:polykey.PolykeyAgent.Schema:Stopped Schema
INFO:polykey.PolykeyAgent.WorkerManager:Destroying WorkerManager
INFO:polykey.PolykeyAgent.WorkerManager:Destroyed WorkerManager
INFO:polykey.PolykeyAgent.Status:Stopping Status
INFO:polykey.PolykeyAgent.Status:Writing Status to /home/aryanj/.local/share/polykey/status.json
INFO:polykey.PolykeyAgent.Status:Status is DEAD
INFO:polykey.PolykeyAgent:Stopped PolykeyAgent

image

Another thing to note is that js-id cannot be used here as it uses node-specific libraries like crypto or perf. Thus, to extract the timestamp from the id, I had to manually implement the functions.

There are browser analogues to crypto and perf. Now that we have ESM we can do things like import adapter's that switch implementation.

@CMCDragonkai
Copy link
Member

Actually, after checking the Polykey source code, I can see that the destroy methods are called only when Polykey agent is destroyed. In this case, how should I handle the create/destroy events in CreateDestroyStartStop when it might not be destroyed in that run? Should I leave them in as-is or not track the create/destroy events of CreateDestroyStartStop?

Tracing should live beyond the agent itself. It is pre-and-post agent object.

Also I want you to try doing vertical scroll. Not horizontal scroll.

Plus I want you to create forking and merging lines.

@CMCDragonkai
Copy link
Member

Also try projecting into logical/vector time rather than real time.

Copy link
Member Author

Over a 30-second run, the tracer system emitted around 30k lines of events, totalling around 700KiB. We probably need to design a more efficient format to store the data in. We probably need to write the data in a more binary format, as it already leans well into it. Most of the fields can easily be represented in binary or originate as binary arrays. For example, the idSortable is a binary ID format, and the type can be encoded in a single byte. The span name is the only field which cannot be compressed much further.

Moreover, I have encountered some interesting issues with the span emission. The parents are often malformed. For example, some QuicStreamsand DBTransactions were being emitted directly by PolykeyAgent, when they are in reality being emitted by, say, QUICClient or NodeManager or something else. I need to investigate why parent resolution is still kinda janky and fix it. I need both quality data and quality visualiser to work with the data properly.

@CMCDragonkai
Copy link
Member

You can keep it as text for now to keep it simple. Log rotation can be used if it hits 1 MiB. But later there is a more compressed binary json. In js-db we looked into using something like that before.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented May 15, 2025

And it was not bson!!

@tegefaulkes
Copy link
Contributor

We can keep it JSON for now but simplified keys and deduplicating can help a lot here.

Use an ENUM for the keys, so rather than each key being repeated "id","start", "name", etc, the ENUM will map it to 1, 2, 3. For deduplication rather than repeat the ID over and over again, include a line that maps the ID and Name to a number and then reference this number when you need to know what it is.

You'll cut out 80% of the size doing this without the need for a binary format.

@CMCDragonkai
Copy link
Member

Yes let's not overoptimize ahead of time. Just push it to the edge. Run on a beefier computer if necessary. Swap with Matthew?

@CMCDragonkai
Copy link
Member

I want to see multi-gigabyte log files before you start doing any optimisation.

@CMCDragonkai
Copy link
Member

Stream consumption obviously, don't load it all into memory. Or use memory mapping if you have binary structure that is one to one to the loaded memory and disk layout. Use chatgpt to help you understand how "BIG data is managed".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants