Skip to content

CJK character (3 Byte) is split into two tokens in json output.  #1798

Open
@HaujetZhao

Description

@HaujetZhao

Detail

Noticed that whisper.cpp now can output full json, and it's default output encoding is utf-8.

When the output characters are all made of ascii characters, it works perfect.

But when the output CJK characters, a small issue arises.

A utf-8 basic: UTF-8 is a variable-length character encoding where common ASCII characters are represented using one byte, while a broader range of characters, including most CJK characters, are encoded using 2 or 3 bytes (mainly 3).

The encoding rules for UTF-8 are as follows:

  • Single-byte encoding ranges from 0x00 to 0x7F (corresponding to ASCII characters).
  • Double-byte encoding ranges from 0xC2 80 to 0xDF BF.
  • Triple-byte encoding ranges from 0xE0 A0 80 to 0xEF BF BF.

In the full json output, many CJK characters are frequently separated into two tokens: the first token has two bytes, and the second has one byte. These two tokens are not valid utf-8 characters, causing the json file can't be read using utf-8 encoding.

image

Reproduce

I used the v1.5.4 Windows binary Release.

Here is the zipped wav sound file:

test-zh.wav.zip

Using the command:

main.exe --model ../model/medium.bin --language zh -otxt -ojf  test-zh.wav

A txt and a json-full result is produced:

Possible solution

A possible solution is to check if the token is a valid utf-8 character and concat the broken tokens before output to json file.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions