Skip to content

Conversation

@ManBearTM
Copy link

@ManBearTM ManBearTM commented Jul 9, 2025

I took a stab at fixing #751 (and maybe the source of a few other issues). You could probably make a similar fix for UTF-16 files, but I opted to not look into that right now.

EDIT: I realize now that this fix maybe only applies to FileStreamer

@dboskovic
Copy link
Collaborator

🤔 This logic makes sense but almost definitely adds a major performance hit. It seems the right solution would be to make sure the left over bytes properly re-assemble into valid utf8 bytecode. If you could add a regression test that validates this I'll include the fix in my #1100 refactor

@dboskovic dboskovic added the v6-todo This issue should be handled in the v6 release label Jul 29, 2025
@ManBearTM
Copy link
Author

ManBearTM commented Aug 13, 2025

Purely anecdotal, but I tested it with a 50 MB file and did not notice any performance hit. I can try to find the time to add tests, but I can't promise anything - life is a bit busy atm, sorry.

It seems the right solution would be to make sure the left over bytes properly re-assemble into valid utf8 bytecode.

Yeah, this is what it does, but before decoding the bytes. I initially tried to implement a fix AFTER the decoding to UTF-8, but the problem is that the byte "information" is lost when decoded to "�", so there is no way to reverse the broken bytes into its proper form, if that makes sense 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v6-todo This issue should be handled in the v6 release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants