Update tokenizer.py to fix erroneous Javanese language code #2669

KathyReid · 2025-10-08T11:44:49Z

The ISO-639-1 code for Javanese is jv NOT jw as given here. It should be listed as jv.

This is a breaking change - anyone who calls transcribe() with language='jw' will get an error.

The ISO-639-1 code for Javanese is `jv` NOT `jw` as given here. It should be listed as `jv`.

ryanheise · 2025-10-08T12:09:59Z

Unfortunately you can't simply fix it by renaming it in the dictionary, since the model was trained to associate jw with Javanese speech, and will only recognise Javanese speech with the jw code. So you need to pass in jw or Javanese. I suppose if you really wanted to correct for this, you could make it work by adding a mapping from jv to jw in TO_LANGUAGE_CODE although that's a hack since that dictionary is meant to map language NAMES to code.

Update tokenizer.py to fix erroneous Javanese language code

bd9d47a

The ISO-639-1 code for Javanese is `jv` NOT `jw` as given here. It should be listed as `jv`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update tokenizer.py to fix erroneous Javanese language code #2669

Update tokenizer.py to fix erroneous Javanese language code #2669

KathyReid commented Oct 8, 2025

Uh oh!

ryanheise commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Update tokenizer.py to fix erroneous Javanese language code #2669

Are you sure you want to change the base?

Update tokenizer.py to fix erroneous Javanese language code #2669

Conversation

KathyReid commented Oct 8, 2025

Uh oh!

ryanheise commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants