-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
fix(next): Wrap all Random APIs with a safe runner #18700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(next): Wrap all Random APIs with a safe runner #18700
Conversation
size-limit report 📦
|
node-overhead report 🧳Note: This is a synthetic benchmark with a minimal express app and does not necessarily reflect the real-world performance impact in an application.
|
8138056 to
a5c2c86
Compare
chargome
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline I'm really happy that this indeed works! 🚀
I generally wanted to avoid having to adapt any core logic to accommodate for cacheComponents, but we could revert once Vercel maybe ships an escape hatch.
| * Which will generate and set a trace id in the propagation context, which should trigger the random API error if unpatched | ||
| * See: https://github.com/getsentry/sentry-javascript/issues/18392 | ||
| */ | ||
| export function generateMetadata() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we expand this test case, or better duplicate it and run some async computation within generateMetadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, I will add some logic there
c581f30 to
f379ed5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a safe runner mechanism to wrap random and time-based APIs (Date.now, Math.random, performance.now, crypto.*) to prevent errors in Next.js cache components. It adds a new ESLint rule to enforce consistent usage of these wrapped APIs across the codebase.
Key Changes:
- Implemented
runInRandomSafeContext()wrapper and safe helper functions (safeDateNow(),safeMathRandom()) in@sentry/core - Added Next.js-specific initialization that prepares AsyncLocalStorage snapshot for safe context execution
- Created ESLint rule
no-unsafe-random-apisto enforce usage of safe wrappers in designated packages - Updated all existing random/time API calls to use the safe wrappers across core, node, opentelemetry, and Next.js packages
Reviewed changes
Copilot reviewed 31 out of 31 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/core/src/utils/safeRandomGeneratorRunner.ts | Core implementation of safe runner mechanism with helper functions for Date.now() and Math.random() |
| packages/core/src/index.ts | Exports safe runner APIs with _INTERNAL prefix for use by other SDK packages |
| packages/core/src/utils/time.ts | Updates time-related functions to use runInRandomSafeContext wrapper |
| packages/core/src/utils/tracing.ts | Updates trace propagation to use safeMathRandom() |
| packages/core/src/scope.ts | Updates propagation context generation to use safeMathRandom() |
| packages/core/src/tracing/trace.ts | Updates trace initialization to use safeMathRandom() |
| packages/core/src/utils/misc.ts | Updates UUID generation to use safe wrappers |
| packages/core/src/utils/ratelimit.ts | Updates rate limiting to use safeDateNow() |
| packages/core/src/client.ts | Updates error sampling to use safeMathRandom() |
| packages/core/src/integrations/mcp-server/correlation.ts | Updates span storage to use safeDateNow() |
| packages/nextjs/src/server/prepareSafeIdGeneratorContext.ts | Next.js-specific implementation that prepares AsyncLocalStorage snapshot as safe context runner |
| packages/nextjs/src/server/index.ts | Calls prepareSafeIdGeneratorContext() at SDK initialization |
| packages/nextjs/src/common/pages-router-instrumentation/wrapApiHandlerWithSentryVercelCrons.ts | Updates cron monitoring to use _INTERNAL_safeDateNow() |
| packages/nextjs/src/config/withSentryConfig.ts | Adds eslint-disable comment for intentional Math.random() usage in tunnel route generation |
| packages/nextjs/src/config/polyfills/perf_hooks.js | Adds eslint-disable comment for polyfill code |
| packages/node-core/src/integrations/context.ts | Updates context integration to use _INTERNAL_safeDateNow() |
| packages/opentelemetry/src/spanExporter.ts | Updates span exporter timestamp handling to use _INTERNAL_safeDateNow() |
| packages/opentelemetry/src/sampler.ts | Updates sampling to use _INTERNAL_safeMathRandom() |
| packages/eslint-plugin-sdk/src/rules/no-unsafe-random-apis.js | New ESLint rule to enforce wrapping of random/time APIs |
| packages/eslint-plugin-sdk/src/index.js | Registers the new ESLint rule |
| packages/eslint-plugin-sdk/test/lib/rules/no-unsafe-random-apis.test.ts | Test suite for the new ESLint rule |
| packages/core/.eslintrc.js | Enables no-unsafe-random-apis rule with test file exceptions |
| packages/nextjs/.eslintrc.js | Enables no-unsafe-random-apis rule with test file exceptions |
| packages/node/.eslintrc.js | Enables no-unsafe-random-apis rule with test file exceptions |
| packages/node-core/.eslintrc.js | Enables no-unsafe-random-apis rule with test file exceptions |
| packages/opentelemetry/.eslintrc.js | Enables no-unsafe-random-apis rule with test file exceptions |
| packages/vercel-edge/.eslintrc.js | Enables no-unsafe-random-apis rule with test file exceptions |
| dev-packages/e2e-tests/test-applications/nextjs-16-cacheComponents/tests/cacheComponents.spec.ts | Adds E2E tests for metadata generation in cache components |
| dev-packages/e2e-tests/test-applications/nextjs-16-cacheComponents/app/metadata/page.tsx | Test page for synchronous metadata generation with cache components |
| dev-packages/e2e-tests/test-applications/nextjs-16-cacheComponents/app/metadata-async/page.tsx | Test page for asynchronous metadata generation with cache components |
| .size-limit.js | Updates bundle size limits to account for new safe runner code |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
02878ec to
f3c7b35
Compare
So the general approach seems sound but I worry about this kind of lint rule. Exiting out of the context to generate arbitrary random values is potentially wrong when prerendering with Cache Components in Next.js. The reason it makes sense for Span ids is that these ideally don't ever leak into the render context and thus cannot cause a prerender to observe anything non-deterministic. (They're side effects for telemetry only) This is actually not always true though b/c if you pass a Cache Function to startActiveSpan then the Span object will get passed into the Cache Function and you've just created a situation where the Cache Function will never be a cache hit b/c it always receives an argument with a random value. The best remediation for non-determinism is to guard it behind something async. Once you're past the first task of the prerender you can do all the non-deterministic stuff you want. By linting to ensure you bypass the deterministic check for sync APIs you may create a situation in the future where someone without understanding the full implications adds something that does leak into the render context or flows into a Cache Function and breaks something. The idea with erroring in these cases is it allows you to respond before you ship something broken to production but if you disable the ability to error by exiting the ALS it'll be much harder to detect that this has happened. I was originally thinking only the Span creation would be exempted. But it seems there are things in here for event processing and more. Are these all necessary? |
|
@gnoff Thanks for taking a look!
I thought it would be hard for future maintainers to know when they should exit the ALS as they may not be even working on Next.js related stuff.
Thanks for the explanation! I suppose then a better approach is to NOT patch the random value call sites like I did here and instead wrap the APIs that do need them. Like span ID, trace ID, timestamps for logs and spans, etc. Even if it is duplicated around the codebase, since in those cases we know for a fact it won't be used for a side-effect. Would that be a better approach? It would require implicit understanding how a certain API interacts with Next.js but would better match the intention of wrapping the non-side effect-y APIs.
I can revert them one by one to see which ones are necessary, from my tests I needed to patch only those initially:
It was a cat and mouse game, every time I patched one call, there was another which prompted me to see if eslint can catch those. I will try reverting and keep only the minimal changes and we can give it another look to see if it makes sense instead of sprinkling the wrapper everywhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gotta be honest, I find it pretty sad that the only way to fix this NextJS nonsense is to adjust our core package. This means that every other SDK, every framework, and platform, every browser/client-side framework, etc will have the added, unnecesssary overhead of doing this lookup. This is especially painful because generating such ids is a hot path in basically every telemetry data type we collect. Don't take this personally, I understand that it's somewhat necessary.
but we could revert once Vercel maybe ships an escape hatch.
I would love that but my fear is as soon as we merge this that there's less urgency to do so, meaning it'll likely not happen.
Technical question: How does this work with OTel instrumentation, for instance when an OTel span is creating a spanId? IIUC, NextJS patches its own span generation but does this also apply to e.g. db instrumentation?
EDIT: I realize this is formulated quite harshly but I'm really not happy about one framework forcing us to compromise in every other "innocent" SDK.
packages/core/src/scope.ts
Outdated
| this._propagationContext = { | ||
| traceId: generateTraceId(), | ||
| sampleRand: Math.random(), | ||
| traceId: withRandomSafeContext(generateTraceId), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Is the reason for not directly calling withRandomSafeContext within generateTraceId performance? No objections, just curious. I guess the obvious downside is that we miss occasions if we apply the wrapper on this higher level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally did that, across all random API calls but that's an overkill and as Josh said, it might actually "escape" user code that would break the cache components and wouldn't be discovered.
So after re-working it yesterday, I'm applying it selectively where I know for a fact that it gets called in the Next.js SDK codepath, it will be a cat and mouse game without something to enforce it like the eslint rule i had earlier, but I can revert to that if u think it is safer to do so and we don't mind the overkill.
lets bike shed it actually today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work with OTel instrumentation, for instance when an OTel span is creating a spanId?
In Next.js they patch the enire tracer: https://github.com/vercel/next.js/blob/canary/packages/next/src/server/lib/router-utils/instrumentation-node-extensions.ts#L51-L112
|
@logaretm we just discussed this offline: Could we maybe try to patch these parts of code/node in our nextjs package to keep all of this out of the original core package? |
|
@chargome I'm not sure how to do that, some of these are deep calls within the pipeline. Let's take it to bikeshedding today? |
d60840a to
bcd9397
Compare
2630a8d to
0805c12
Compare
Lms24
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving for the lack of a better alternative 🥲 Thanks a lot for experimenting with this and also for presenting and explaining the fix! I agree that the only way to avoid accidental breakage is to wrap the random APIs on the lowest level.
I hope NextJS folks are aware that they're putting a significant burden on the entire JS ecosystem with requiring library maintainers to avoid using perfectly fine JS APIs.
Currently the cache components feature in Next.js prevents us from using any random value APIs like:
Date.nowperformance.nowMath.randomcrypto.*We tried resolving this by patching several span methods, but then we have plenty of other instances where we use those APIs, like in trace propagation, timestamp generation for logs, and more.
Running around and patching them one by one in the Next.js SDK isn't a viable solution since most of those functionalities are strictly internal and cannot be patched from the outside, and adding escape hatches for each of them is not maintainable.
So I'm testing out the other way around, by hunting those APIs down and wrapping them with a safe runner that acts as an escape hatch. Some of the Vercel engineers suggested doing that, but we need to do it for
almost every call(see Josh comment below).The idea is an SDK can "turn on" the safe runner by injecting a global function that executes a callback and returns its results. I
How does this fix it for Next.js?
The Next.js SDK case, a safe runner would be an
AsyncLocalStoragesnapshot which is captured at the server runtime init, way before any rendering is done.I kept the API internal as much as possible to avoid users messing up with it, but the
@sentry/opentelemetrySDK also needed this functionality so I exported the API with_INTERNALprefix as we already do.I tested this in a simple Next.js app and it no longer errors out, and all current tests pass. I still need to take a look at the traces and see how would they look in cached component cases.
Charly is already working on this and may have a proper solution, but I thought to just see if we can ship a stopgap until then.
On the bright side, this seems to fix it as well for Webpack.
closes #18392
closes #18340