⚡ perf(link): skip quote roundtrip on safe URL paths#13985
Closed
gaborbernat wants to merge 1 commit into
Closed
Conversation
gaborbernat
added a commit
to gaborbernat/pip
that referenced
this pull request
May 6, 2026
a460696 to
da4ac5c
Compare
da4ac5c to
cbc2351
Compare
The Simple-API JSON response from Warehouse hands out file URLs whose path is pure ASCII alphanumerics plus `_-.~/`, every character of which sits in `urllib.parse.quote`'s default-safe set. `_clean_url_path` still pays a full `urllib.parse.unquote` followed by `urllib.parse.quote` per link to guarantee idempotency, even though the round-trip cannot change anything on these inputs. With ~65000 links walked across an 8-pass cross-platform lock that is ~6% of user-CPU time spent reproducing the input verbatim. Guard the round-trip with a single negative-class `re.search`. When the path contains no character outside the always-safe alphabet, return it unchanged; otherwise fall through to the existing logic. The pre-check is ~250 ns and the work it skips averages ~900 ns on a real-world wheel link (4.7x faster on the fast path), so even paths that fall through pay only a small constant overhead. A new `test_clean_url_path_idempotent_for_safe_paths` parametrize asserts the fast path is a true identity for the alphabets it claims to cover. The existing `test_clean_url_path` cases all carry at least one unsafe char and keep exercising the slow path.
cbc2351 to
ae0657c
Compare
Author
|
Closing in favour of #13986, which short-circuits at the outer _ensure_quoted_url layer and subsumes this patch — when _ensure_quoted_url returns early on a clean URL, _clean_url_path is never entered, so the fast path here becomes dead code. The benchmarks confirm the two cover the same set of links and the wins do not stack. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The Simple-API JSON response from Warehouse hands out file URLs whose path is pure ASCII alphanumerics plus the always-safe
_-.~/set, yet_clean_url_pathstill pays a fullurllib.parse.unquotefollowed byurllib.parse.quoteper link to guarantee idempotency. Walking ~65000 links across an 8-pass cross-platform lock spends roughly 6% of user-CPU time reproducing the input verbatim. ⚡Guard the round-trip with a single negative-class
re.searchagainst the always-safe alphabet. When every character of the path passes, return it unchanged; otherwise fall through to the existing logic, byte-for-byte preserved. The pre-check is ~250 ns and the work it skips averages ~900 ns on a real-world wheel link, so the fast path is 4.7x faster per call and paths that fall through pay only a small constant overhead.Function-level micro-bench against
/packages/12/34/567/somepackage-1.2.3-py3-none-any.whl:_clean_url_pathdrops from 1144 ns/call to 244 ns/call, a 79% reduction in CPU time. End-to-end against a cross-platform lock pipeline iterating ~65000 links across 8 resolver passes (n=12 paired runs alternating betweenHEADandHEAD + this patch): user-CPU mean falls 6.3% (10/12 paired runs faster) with stdev 2.5x lower under the patch.No behaviour change for any URL containing a character outside
[A-Za-z0-9_./~-]. Local-path inputs,%-escaped paths, and paths carrying@or other reserved characters all flow through the original cleaning logic untouched.