Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making urlparse WHATWG conformant #88049

Open
orsenthil opened this issue Apr 18, 2021 · 8 comments
Open

Making urlparse WHATWG conformant #88049

orsenthil opened this issue Apr 18, 2021 · 8 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error type-feature A feature request or enhancement

Comments

@orsenthil
Copy link
Member

BPO 43883
Nosy @gpshead, @orsenthil, @vstinner, @serhiy-storchaka, @mlissner, @tirkarthi

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = 'https://github.com/orsenthil'
closed_at = None
created_at = <Date 2021-04-18.19:43:37.075>
labels = ['type-bug']
title = 'Making urlparse WHATWG conformant'
updated_at = <Date 2021-05-05.01:34:26.506>
user = 'https://github.com/orsenthil'

bugs.python.org fields:

activity = <Date 2021-05-05.01:34:26.506>
actor = 'gregory.p.smith'
assignee = 'orsenthil'
closed = False
closed_date = None
closer = None
components = []
creation = <Date 2021-04-18.19:43:37.075>
creator = 'orsenthil'
dependencies = []
files = []
hgrepos = []
issue_num = 43883
keywords = []
message_count = 4.0
messages = ['391344', '391347', '391427', '392969']
nosy_count = 6.0
nosy_names = ['gregory.p.smith', 'orsenthil', 'vstinner', 'serhiy.storchaka', 'Mike.Lissner', 'xtreak']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'needs patch'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue43883'
versions = []

@orsenthil
Copy link
Member Author

Mike Lissner reported that a set test suites that exercise extreme conditions with URLs, but in conformance with url.spec.whatwg.org
was maintained here:

https://github.com/web-platform-tests/wpt/tree/77da471a234e03e65a22ee6df8ceff7aaba391f8/url

These test cases were used against urlparse and urljoin method.

https://gist.github.com/mlissner/4d2110d7083d74cff3893e261a801515

Quoting verbatim

The basic idea is to iterate over the test cases and try joining and parsing them. The script wound up messier than I wanted b/c there's a fair bit of normalization you have to do (e.g., the test cases expect blank paths to be '/', while urlparse returns an empty string), but you'll get the idea.

The bad news is that of the roughly 600 test cases fewer than half pass. Some more normalization would fix some more of this, and I don't imagine all of these have security concerns (I haven't thought through it, honestly, but there are issues with domain parsing too that look meddlesome). For now I've taken it as far as I can, and it should be a good start, I think.

The final numbers the script cranks out are:

Done. 231/586 successes. 1 skipped.

@orsenthil orsenthil self-assigned this Apr 18, 2021
@orsenthil orsenthil added the type-bug An unexpected behavior, bug, or error label Apr 18, 2021
@orsenthil orsenthil self-assigned this Apr 18, 2021
@orsenthil orsenthil added the type-bug An unexpected behavior, bug, or error label Apr 18, 2021
@serhiy-storchaka
Copy link
Member

It would be interesting to test also with the yarl module. It is based on urlparse and urljoin, but does extra normalization of %-encoding.

@vstinner
Copy link
Member

See also bpo-43882.

@gpshead
Copy link
Member

gpshead commented May 5, 2021

FWIW rather than implementing our own URL parsing at all... wrapping a library extracted from a compatible-license major browser (Chromium or Firefox) and keeping it updated would avoid disparities.

Unfortunately, I'm not sure how feasible this really is. Do all of the API surfaces we must support in the stdlib for compatibility's sake with urllib line up with such a browser core URL parsing library?

Something to ponder. Unlikely something we'll actually do.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
@xmo-odoo
Copy link

xmo-odoo commented Jun 15, 2023

If urllib.parse is to be tested for whatwg conformance, shouldn't it be urlsplit? And maybe urlparse should finally be deprecated?

urlparse's behaviour was superseded by RFC 2396 back in 1998 (which is what resulted in the addition of urlsplit: #35466), the whatwg url standard is a descendant of that (itself intending to supersede RFC 3986). As of the url standard (and already RFC 3986) the entire concept of path segment parameter has been dropped.

By its very nature and purpose, urlparse can't pass the conformance suite, because any time it encounters a ; within a path segment it's going to move data out of the pathname which it should not. And the suite likely has multiple test cases where the path contains ; characters.

@iritkatriel iritkatriel added the stdlib Python modules in the Lib dir label Nov 26, 2023
@idahogray
Copy link

I'm not sure if this is the correct issue or not, but it seemed related. Chromium (and derivative browsers) recently implemented a change to URL parsing that broke a bunch of custom protocols. We have an application that uses the protocol pw:// which is broken by this change. We have a django application that was using the django URLValidator to validate those custom URLs.

The way I understand it, the correct format for custom protocols is to not have the //. I don't believe urlsplit supports these because it is looking for the // in order to extract the netloc. At least it is in 3.12, I haven't looked at 3.13.

Thanks and please let me know if this is the wrong issue or if you need more information.

@gpshead gpshead added the type-feature A feature request or enhancement label Nov 14, 2024
@gpshead
Copy link
Member

gpshead commented Nov 14, 2024

also @sethmlarson as FYI - it isn't clear we'd ever even be able to do what this issue suggests or that we should FWIW. there are multiple competing interests in the world who have valid needs to use URLs (URIs?) in different ways.

@sethmlarson
Copy link
Contributor

I don't think we should chase browsers, and therefore we shouldn't implement WHATWG. They are allowed to update their "fleet" globally on a whim, they primarily deal with users physically typing things into address bars, we have things called backwards compatibility guarantees. I think implementing RFC 3986 is appropriate for software like Python.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

8 participants