Integrating tracer into object lifecycles #403

aryanjassal · 2025-05-05T02:39:19Z

Description

To gain observability into polykey, we are integrating a tracer system similar to OpenTracing. This is currently just an experiment. The visualiser and the integration is being tracked in this PR. As we are currently monkey-patching to get the spans integrated, that code cannot be shown in the PR. I might make some comments if I need feedback

Issues Fixed

Relates to Integrate high-level observability using Tracer #392

Tasks

1. Get tracer patched into polykey as an experiment
2. Add web-based visualiser

Final checklist

aryanjassal · 2025-05-05T02:47:29Z

Interestingly, a lot of tests are failing. That shouldn't really happen as the patch should not interfere with the actual functioning of the code. This probably flew under the radar as I was focusing on getting the spans working rather than doing a full test each change. I'll have to dig into this.

On the other hand, I've made index.html file as a starting point for your visualiser work. A span.jsonl file exists and will keep getting updated on any significant format change. Make sure to commit whenever you get a new feature or something. You can make progress reports or other comments in this PR itself. @Abby010

CMCDragonkai · 2025-05-07T11:53:34Z

I'll need you to work on this yourself. Also consider backwards compatibility or is it "side-compat" with open tracing. It may be possible to reuse their ecosystem.

aryanjassal · 2025-05-08T08:10:41Z

I did some work on the visualiser, and this is what we came up with. The red lines are without an end span event, and the blue ones are with. If spans don't have a corresponding end event, they are coloured differently. I do notice something pretty interesting here -- the destroy method on CreateDestroyStartStop classes are never called. I checked this in the logs and they also show, for example, the TaskManager only stopping and not being destroyed, even though it is marked with CreateDestroyStartStop and also has implemented a destroy() method.

INFO:polykey.PolykeyAgent:Stopping TaskManager
INFO:polykey.PolykeyAgent:Stopping Processing
INFO:polykey.PolykeyAgent:Stopped Processing
INFO:polykey.PolykeyAgent:Stopping Tasks
INFO:polykey.PolykeyAgent:Stopped Tasks
INFO:polykey.PolykeyAgent:Stopped TaskManager
INFO:polykey.PolykeyAgent.Audit:Stopping Audit
INFO:polykey.PolykeyAgent.Audit:Stopped Audit
INFO:polykey.PolykeyAgent.DB:Stopping DB
INFO:polykey.PolykeyAgent.DB:Stopped DB
INFO:polykey.PolykeyAgent.KeyRing:Stopping KeyRing
INFO:polykey.PolykeyAgent.KeyRing:Stopped KeyRing
INFO:polykey.PolykeyAgent.Schema:Stopping Schema
INFO:polykey.PolykeyAgent.Schema:Stopped Schema
INFO:polykey.PolykeyAgent.WorkerManager:Destroying WorkerManager
INFO:polykey.PolykeyAgent.WorkerManager:Destroyed WorkerManager
INFO:polykey.PolykeyAgent.Status:Stopping Status
INFO:polykey.PolykeyAgent.Status:Writing Status to /home/aryanj/.local/share/polykey/status.json
INFO:polykey.PolykeyAgent.Status:Status is DEAD
INFO:polykey.PolykeyAgent:Stopped PolykeyAgent

Another thing to note is that js-id cannot be used here as it uses node-specific libraries like crypto or perf. Thus, to extract the timestamp from the id, I had to manually implement the functions.

aryanjassal · 2025-05-08T08:26:06Z

Actually, after checking the Polykey source code, I can see that the destroy methods are called only when Polykey agent is destroyed. In this case, how should I handle the create/destroy events in CreateDestroyStartStop when it might not be destroyed in that run? Should I leave them in as-is or not track the create/destroy events of CreateDestroyStartStop?

CMCDragonkai · 2025-05-08T08:35:47Z

I did some work on the visualiser, and this is what we came up with. The red lines are without an end span event, and the blue ones are with. If spans don't have a corresponding end event, they are coloured differently. I do notice something pretty interesting here -- the destroy method on CreateDestroyStartStop classes are never called. I checked this in the logs and they also show, for example, the TaskManager only stopping and not being destroyed, even though it is marked with CreateDestroyStartStop and also has implemented a destroy() method.
INFO:polykey.PolykeyAgent:Stopping TaskManager
INFO:polykey.PolykeyAgent:Stopping Processing
INFO:polykey.PolykeyAgent:Stopped Processing
INFO:polykey.PolykeyAgent:Stopping Tasks
INFO:polykey.PolykeyAgent:Stopped Tasks
INFO:polykey.PolykeyAgent:Stopped TaskManager
INFO:polykey.PolykeyAgent.Audit:Stopping Audit
INFO:polykey.PolykeyAgent.Audit:Stopped Audit
INFO:polykey.PolykeyAgent.DB:Stopping DB
INFO:polykey.PolykeyAgent.DB:Stopped DB
INFO:polykey.PolykeyAgent.KeyRing:Stopping KeyRing
INFO:polykey.PolykeyAgent.KeyRing:Stopped KeyRing
INFO:polykey.PolykeyAgent.Schema:Stopping Schema
INFO:polykey.PolykeyAgent.Schema:Stopped Schema
INFO:polykey.PolykeyAgent.WorkerManager:Destroying WorkerManager
INFO:polykey.PolykeyAgent.WorkerManager:Destroyed WorkerManager
INFO:polykey.PolykeyAgent.Status:Stopping Status
INFO:polykey.PolykeyAgent.Status:Writing Status to /home/aryanj/.local/share/polykey/status.json
INFO:polykey.PolykeyAgent.Status:Status is DEAD
INFO:polykey.PolykeyAgent:Stopped PolykeyAgent
Another thing to note is that js-id cannot be used here as it uses node-specific libraries like crypto or perf. Thus, to extract the timestamp from the id, I had to manually implement the functions.

There are browser analogues to crypto and perf. Now that we have ESM we can do things like import adapter's that switch implementation.

CMCDragonkai · 2025-05-08T08:36:39Z

Actually, after checking the Polykey source code, I can see that the destroy methods are called only when Polykey agent is destroyed. In this case, how should I handle the create/destroy events in CreateDestroyStartStop when it might not be destroyed in that run? Should I leave them in as-is or not track the create/destroy events of CreateDestroyStartStop?

Tracing should live beyond the agent itself. It is pre-and-post agent object.

Also I want you to try doing vertical scroll. Not horizontal scroll.

Plus I want you to create forking and merging lines.

CMCDragonkai · 2025-05-08T08:37:15Z

Also try projecting into logical/vector time rather than real time.

aryanjassal · 2025-05-15T06:45:54Z

Over a 30-second run, the tracer system emitted around 30k lines of events, totalling around 700KiB. We probably need to design a more efficient format to store the data in. We probably need to write the data in a more binary format, as it already leans well into it. Most of the fields can easily be represented in binary or originate as binary arrays. For example, the idSortable is a binary ID format, and the type can be encoded in a single byte. The span name is the only field which cannot be compressed much further.

Moreover, I have encountered some interesting issues with the span emission. The parents are often malformed. For example, some QuicStreamsand DBTransactions were being emitted directly by PolykeyAgent, when they are in reality being emitted by, say, QUICClient or NodeManager or something else. I need to investigate why parent resolution is still kinda janky and fix it. I need both quality data and quality visualiser to work with the data properly.

CMCDragonkai · 2025-05-15T08:50:55Z

You can keep it as text for now to keep it simple. Log rotation can be used if it hits 1 MiB. But later there is a more compressed binary json. In js-db we looked into using something like that before.

CMCDragonkai · 2025-05-15T08:51:03Z

And it was not bson!!

tegefaulkes · 2025-05-15T23:43:34Z

We can keep it JSON for now but simplified keys and deduplicating can help a lot here.

Use an ENUM for the keys, so rather than each key being repeated "id","start", "name", etc, the ENUM will map it to 1, 2, 3. For deduplication rather than repeat the ID over and over again, include a line that maps the ID and Name to a number and then reference this number when you need to know what it is.

You'll cut out 80% of the size doing this without the need for a binary format.

CMCDragonkai · 2025-05-18T04:22:03Z

Yes let's not overoptimize ahead of time. Just push it to the edge. Run on a beefier computer if necessary. Swap with Matthew?

CMCDragonkai · 2025-05-18T04:22:29Z

I want to see multi-gigabyte log files before you start doing any optimisation.

CMCDragonkai · 2025-05-18T04:22:59Z

Stream consumption obviously, don't load it all into memory. Or use memory mapping if you have binary structure that is one to one to the loaded memory and disk layout. Use chatgpt to help you understand how "BIG data is managed".

aryanjassal assigned aryanjassal and Abby010 May 5, 2025

aryanjassal mentioned this pull request May 5, 2025

Visualise event-based spans for tracing MatrixAI/js-logger#52

Closed

wip: integrating tracer into object lifecycles

7e3b0f0

aryanjassal force-pushed the feature-tracing-integration branch from 03ddbf3 to 7e3b0f0 Compare May 7, 2025 00:37

chore: spans only log events in verbose mode

4e46a77

aryanjassal added 6 commits May 12, 2025 09:47

wip: span visualiser

8269547

chore: cleaned up webviz

bfc6c38

chore: better parent tracking in spans

082bb43

feat: added logical and time mode to vertical span rendering

23f5dd5

chore: cleaned up spacing for webviz

b069992

wip: added sibling-aware nesting

5131694

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrating tracer into object lifecycles #403

Integrating tracer into object lifecycles #403

aryanjassal commented May 5, 2025

aryanjassal commented May 5, 2025 •

edited

Loading

CMCDragonkai commented May 7, 2025 via email

aryanjassal commented May 8, 2025

aryanjassal commented May 8, 2025

CMCDragonkai commented May 8, 2025

CMCDragonkai commented May 8, 2025

CMCDragonkai commented May 8, 2025

aryanjassal commented May 15, 2025

CMCDragonkai commented May 15, 2025

CMCDragonkai commented May 15, 2025 •

edited

Loading

tegefaulkes commented May 15, 2025

CMCDragonkai commented May 18, 2025

CMCDragonkai commented May 18, 2025

CMCDragonkai commented May 18, 2025

Integrating tracer into object lifecycles #403

Are you sure you want to change the base?

Integrating tracer into object lifecycles #403

Conversation

aryanjassal commented May 5, 2025

Description

Issues Fixed

Tasks

Final checklist

aryanjassal commented May 5, 2025 • edited Loading

CMCDragonkai commented May 7, 2025 via email

aryanjassal commented May 8, 2025

aryanjassal commented May 8, 2025

CMCDragonkai commented May 8, 2025

CMCDragonkai commented May 8, 2025

CMCDragonkai commented May 8, 2025

aryanjassal commented May 15, 2025

CMCDragonkai commented May 15, 2025

CMCDragonkai commented May 15, 2025 • edited Loading

tegefaulkes commented May 15, 2025

CMCDragonkai commented May 18, 2025

CMCDragonkai commented May 18, 2025

CMCDragonkai commented May 18, 2025

aryanjassal commented May 5, 2025 •

edited

Loading

CMCDragonkai commented May 15, 2025 •

edited

Loading