Description
Detail
Noticed that whisper.cpp now can output full json, and it's default output encoding is utf-8.
When the output characters are all made of ascii characters, it works perfect.
But when the output CJK characters, a small issue arises.
A utf-8 basic: UTF-8 is a variable-length character encoding where common ASCII characters are represented using one byte, while a broader range of characters, including most CJK characters, are encoded using 2 or 3 bytes (mainly 3).
The encoding rules for UTF-8 are as follows:
- Single-byte encoding ranges from 0x00 to 0x7F (corresponding to ASCII characters).
- Double-byte encoding ranges from 0xC2 80 to 0xDF BF.
- Triple-byte encoding ranges from 0xE0 A0 80 to 0xEF BF BF.
In the full json output, many CJK characters are frequently separated into two tokens: the first token has two bytes, and the second has one byte. These two tokens are not valid utf-8 characters, causing the json file can't be read using utf-8 encoding.
Reproduce
I used the v1.5.4 Windows binary Release.
Here is the zipped wav sound file:
Using the command:
main.exe --model ../model/medium.bin --language zh -otxt -ojf test-zh.wav
A txt and a json-full result is produced:
Possible solution
A possible solution is to check if the token is a valid utf-8 character and concat the broken tokens before output to json file.