Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frontend: parse every error message to utf-8, ignore non-parsable chars #3651

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nikromen
Copy link
Member

@nikromen nikromen commented Mar 3, 2025

If user sends to API some weird non-utf-8 character and we then
try to log this, the logging freaks out with logging error (see [1]).
[2] fixes this, but only for one endpoint and one type of user input,
but there are many ways how user can do this. On the other hand, [2]
caused bad formating and now we got messages like b'Project a/b does not exist'.

This patch tries to parse every error message we want to store, and maybe
later log it, to utf-8 and store it as a string. If some non-utf-8
character appears in the message output, its byte won't be pares and it
will be stored/printed as the string representation of that byte.

Example:

User calls https://copr.fedorainfracloud.org/coprs/g/some<nasty_characters>string

the string value in CoprHttpException.message will be some\x01\x02string

where the \x01\x02 is example of the <nasty_characters>

[1] - #3072
[2] - #3076

Fix #3644

Copy link

github-actions bot commented Mar 3, 2025

Pull Request validation

Failed

🔴 Review - Missing review from a member (2 required)

Success

🟢 CI - All checks have passed

@nikromen nikromen force-pushed the fix-bad-formatting branch from e338e9e to 3329aaf Compare March 3, 2025 11:10
@nikromen
Copy link
Member Author

nikromen commented Mar 3, 2025

I updated the parsing - it doesn't store the bytes, but it rather parses the string right away so we don't work with bytes. I also updated the commit message/PR description, PTAL

If user sends to API some weird non-utf-8 character and we then
try to log this, the logging freaks out with `logging error` (see [1]).
[2] fixes this, but only for one endpoint and one type of user input,
but there are many ways how user can do this. On the other hand, [2]
caused bad formating and now we got messages like `b'Project a/b does
not exist'`.

This patch tries to parse every error message we want to store, and maybe
later log it, to utf-8 and store it as a string. If some non-utf-8
character appears in the message output, its byte won't be pares and it
will be stored/printed as the string representation of that byte.

Example:

User calls https://copr.fedorainfracloud.org/coprs/g/some<nasty_characters>x02string

the string value in CoprHttpException.message will be some\x01\x02string

where the \x01\x02 is example of the <nasty_characters>

[1] - fedora-copr#3072
[2] - fedora-copr#3076

Fix fedora-copr#3644
@nikromen nikromen force-pushed the fix-bad-formatting branch from 043086c to 4c8b3ec Compare March 3, 2025 16:29
@nikromen nikromen changed the title frontend: remove unnecesary formating of error message to error handler frontend: parse every error message to utf-8, ignore non-parsable chars Mar 3, 2025
@nikromen nikromen marked this pull request as ready for review March 3, 2025 16:31
@FrostyX
Copy link
Member

FrostyX commented Mar 6, 2025

Strangely enough, when I use code from main and try this URL:

http://127.0.0.1:5000/coprs/frostyx/nvid…

The error on the page is:

Error 404: Page Not Found
b'Project frostyx/nvid\xc3\xa2\xe2\x82\xac\xc2\xa6 does not exist.' 

When I simply revert the code from PR #3076, I get:

Error 404: Page Not Found
Project frostyx/nvid… does not exist. 

and not the original #3072. So either we can simply revert it, or I need to find a better reproducer for the original issue.

@nikromen
Copy link
Member Author

aha, it is true that utf-8 characters like "…" work now, you are right.

I've been researching why this is, and I've concluded that we must have had the locale set to US.US-ASCII or something similar on the old F39 machines at the time. Now we have US.UTF-8. That's probably why this error used to occur, because logging uses by default io.text_encoding, which defaults to Python's UTF-8 mode and returns either locale or UTF-8 if the locale is not set. In that case it was trying to log to ASCII instead of UTF-8.

In any case, it just reduced the set of characters on which this error occurs. Non-UTF-8 characters (such as UTF-16) will fall with the same error.

So there are two possibilities:

what do you think about it?

@praiskup
Copy link
Member

The old reproducer was:

b'Project t0xic0der/nvid\xe2\x80\xa6 does not exist.'

That is in utf:

'Project t0xic0der/nvid… does not exist.'

This doesn't fail the logger on F41 (current prod). It is ugly, yes, the logging code still dumps the message using the \xe2\x80... instead of \u2026, not sure why (does logging want ascii? if yes, where/why? because they don't want to ignore the encoding errors?)

But at least it doesn't try to encode elpisis into ASCII.

When I simply revert the code from PR #3076, I get:

Yes, please, revert. I tested with str=$(dd if=/dev/urandom bs=8 count=1) and I got:

[remote 77.92.220.242:48662] ERROR:coprs:Response error: 404 Project frostyx/\xef\xbf\xbd|\xef\xbf\xbd\xef\xbf\xbd|\xef\xbf\xbdO\x02 does not exist.

We should be safe; no tracebacks. But if we miss something, we'll at least get the reproducer.

@nikromen
Copy link
Member Author

not sure why (does logging want ascii? if yes, where/why? because they don't want to ignore the encoding errors?)

I've concluded that we must have had the locale set to US.US-ASCII or something similar on the old F39 machines at the time. Now we have US.UTF-8. That's probably why this error used to occur, because logging uses by default io.text_encoding, which defaults to Python's UTF-8 mode and returns either locale or UTF-8 if the locale is not set. In that case it was trying to log to ASCII instead of UTF-8.

We should be safe; no tracebacks. But if we miss something, we'll at least get the reproducer.

this will still mean that we got logging error when someone sends e.g. utf32 only characters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Message for 404 project does not exist is badly decoded
3 participants