frontend: parse every error message to utf-8, ignore non-parsable chars #3651

nikromen · 2025-03-03T10:31:37Z

If user sends to API some weird non-utf-8 character and we then
try to log this, the logging freaks out with logging error (see [1]).
[2] fixes this, but only for one endpoint and one type of user input,
but there are many ways how user can do this. On the other hand, [2]
caused bad formating and now we got messages like b'Project a/b does not exist'.

This patch tries to parse every error message we want to store, and maybe
later log it, to utf-8 and store it as a string. If some non-utf-8
character appears in the message output, its byte won't be pares and it
will be stored/printed as the string representation of that byte.

Example:

User calls https://copr.fedorainfracloud.org/coprs/g/some<nasty_characters>string

the string value in CoprHttpException.message will be some\x01\x02string

where the \x01\x02 is example of the <nasty_characters>

[1] - #3072
[2] - #3076

Fix #3644

github-actions · 2025-03-03T10:48:48Z

Pull Request validation

Failed

🔴 Review - Missing review from a member (2 required)

Success

🟢 CI - All checks have passed

nikromen · 2025-03-03T16:19:19Z

I updated the parsing - it doesn't store the bytes, but it rather parses the string right away so we don't work with bytes. I also updated the commit message/PR description, PTAL

If user sends to API some weird non-utf-8 character and we then try to log this, the logging freaks out with `logging error` (see [1]). [2] fixes this, but only for one endpoint and one type of user input, but there are many ways how user can do this. On the other hand, [2] caused bad formating and now we got messages like `b'Project a/b does not exist'`. This patch tries to parse every error message we want to store, and maybe later log it, to utf-8 and store it as a string. If some non-utf-8 character appears in the message output, its byte won't be pares and it will be stored/printed as the string representation of that byte. Example: User calls https://copr.fedorainfracloud.org/coprs/g/some<nasty_characters>x02string the string value in CoprHttpException.message will be some\x01\x02string where the \x01\x02 is example of the <nasty_characters> [1] - fedora-copr#3072 [2] - fedora-copr#3076 Fix fedora-copr#3644

FrostyX · 2025-03-06T11:23:36Z

Strangely enough, when I use code from main and try this URL:

http://127.0.0.1:5000/coprs/frostyx/nvidâ€¦

The error on the page is:

Error 404: Page Not Found
b'Project frostyx/nvid\xc3\xa2\xe2\x82\xac\xc2\xa6 does not exist.'

When I simply revert the code from PR #3076, I get:

Error 404: Page Not Found
Project frostyx/nvidâ€¦ does not exist.

and not the original #3072. So either we can simply revert it, or I need to find a better reproducer for the original issue.

nikromen · 2025-03-10T15:11:32Z

aha, it is true that utf-8 characters like "â€¦" work now, you are right.

I've been researching why this is, and I've concluded that we must have had the locale set to US.US-ASCII or something similar on the old F39 machines at the time. Now we have US.UTF-8. That's probably why this error used to occur, because logging uses by default io.text_encoding, which defaults to Python's UTF-8 mode and returns either locale or UTF-8 if the locale is not set. In that case it was trying to log to ASCII instead of UTF-8.

In any case, it just reduced the set of characters on which this error occurs. Non-UTF-8 characters (such as UTF-16) will fall with the same error.

So there are two possibilities:

accept this PR as a solution that reverses convert eror message to binary text before emitting #3076 while outputting all non-UTF-8 characters as a pure string of their byte value
remove the parsing part from this PR and just revert convert eror message to binary text before emitting #3076 . If someone will send us requests with non-utf8 characters and we get errors, then do the parsing afterwards.

what do you think about it?

praiskup · 2025-03-10T18:14:53Z

The old reproducer was:

b'Project t0xic0der/nvid\xe2\x80\xa6 does not exist.'

That is in utf:

'Project t0xic0der/nvid… does not exist.'

This doesn't fail the logger on F41 (current prod). It is ugly, yes, the logging code still dumps the message using the \xe2\x80... instead of \u2026, not sure why (does logging want ascii? if yes, where/why? because they don't want to ignore the encoding errors?)

But at least it doesn't try to encode elpisis into ASCII.

When I simply revert the code from PR #3076, I get:

Yes, please, revert. I tested with str=$(dd if=/dev/urandom bs=8 count=1) and I got:

[remote 77.92.220.242:48662] ERROR:coprs:Response error: 404 Project frostyx/\xef\xbf\xbd|\xef\xbf\xbd\xef\xbf\xbd|\xef\xbf\xbdO\x02 does not exist.

We should be safe; no tracebacks. But if we miss something, we'll at least get the reproducer.

nikromen · 2025-03-11T21:31:27Z

not sure why (does logging want ascii? if yes, where/why? because they don't want to ignore the encoding errors?)

I've concluded that we must have had the locale set to US.US-ASCII or something similar on the old F39 machines at the time. Now we have US.UTF-8. That's probably why this error used to occur, because logging uses by default io.text_encoding, which defaults to Python's UTF-8 mode and returns either locale or UTF-8 if the locale is not set. In that case it was trying to log to ASCII instead of UTF-8.

We should be safe; no tracebacks. But if we miss something, we'll at least get the reproducer.

this will still mean that we got logging error when someone sends e.g. utf32 only characters

nikromen marked this pull request as draft March 3, 2025 10:33

nikromen force-pushed the fix-bad-formatting branch from 059d40a to e338e9e Compare March 3, 2025 10:44

github-actions bot added pr/failing-ci pr/missing-review labels Mar 3, 2025

nikromen force-pushed the fix-bad-formatting branch from e338e9e to 3329aaf Compare March 3, 2025 11:10

nikromen force-pushed the fix-bad-formatting branch from 043086c to 4c8b3ec Compare March 3, 2025 16:29

nikromen changed the title ~~frontend: remove unnecesary formating of error message to error handler~~ frontend: parse every error message to utf-8, ignore non-parsable chars Mar 3, 2025

nikromen marked this pull request as ready for review March 3, 2025 16:31

github-actions bot removed the pr/failing-ci label Mar 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

frontend: parse every error message to utf-8, ignore non-parsable chars #3651

frontend: parse every error message to utf-8, ignore non-parsable chars #3651

nikromen commented Mar 3, 2025 •

edited

Loading

github-actions bot commented Mar 3, 2025 •

edited

Loading

nikromen commented Mar 3, 2025 •

edited

Loading

FrostyX commented Mar 6, 2025

nikromen commented Mar 10, 2025

praiskup commented Mar 10, 2025

nikromen commented Mar 11, 2025

frontend: parse every error message to utf-8, ignore non-parsable chars #3651

Are you sure you want to change the base?

frontend: parse every error message to utf-8, ignore non-parsable chars #3651

Conversation

nikromen commented Mar 3, 2025 • edited Loading

github-actions bot commented Mar 3, 2025 • edited Loading

Pull Request validation

Failed

Success

nikromen commented Mar 3, 2025 • edited Loading

FrostyX commented Mar 6, 2025

nikromen commented Mar 10, 2025

praiskup commented Mar 10, 2025

nikromen commented Mar 11, 2025

nikromen commented Mar 3, 2025 •

edited

Loading

github-actions bot commented Mar 3, 2025 •

edited

Loading

nikromen commented Mar 3, 2025 •

edited

Loading