Skip to content

New flag to try known charsets when bad encoding is specified in HTTP or HTML#459

Merged
benoit74 merged 1 commit into
mainfrom
accept_unknown_encodings
Mar 30, 2026
Merged

New flag to try known charsets when bad encoding is specified in HTTP or HTML#459
benoit74 merged 1 commit into
mainfrom
accept_unknown_encodings

Conversation

@benoit74

@benoit74 benoit74 commented Mar 30, 2026

Copy link
Copy Markdown
Collaborator

This PR introduces a new --ignore-unknown-charsets flag which allows warc2zim to ignore charset found in HTTP headers or HTML if it fails to decode because it is unknown.

When this happens, currently the scraper is just halted for typically stupid reasons like in https://farm.openzim.org/pipeline/71c03b16-124d-4ee1-ac22-218c2a1392a1

LookupError: unknown encoding: utf-4

When such an error occurs, it is probably worth to try to decode with --charsets-to-try values.

Unknown encodings are stored so they are reported at the end, giving a chance to fix them in next invocations.

@benoit74 benoit74 self-assigned this Mar 30, 2026
@benoit74 benoit74 force-pushed the accept_unknown_encodings branch from b366764 to 9e5a0e9 Compare March 30, 2026 14:55
@benoit74 benoit74 merged commit ccd0f08 into main Mar 30, 2026
5 checks passed
@benoit74 benoit74 deleted the accept_unknown_encodings branch March 30, 2026 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant