CJK character (3 Byte) is split into two tokens in json output. 


## Detail

Noticed that whisper.cpp now can output full json, and it's default output encoding is utf-8. 

When the output characters are all made of ascii characters, it works perfect. 

But when the output  CJK characters, a small issue arises.  

A utf-8 basic: UTF-8 is a variable-length character encoding where common ASCII characters are represented using one byte, while a broader range of characters, including most CJK characters, are encoded using 2 or 3 bytes (mainly 3). 

The encoding rules for UTF-8 are as follows:

- Single-byte encoding ranges from 0x00 to 0x7F (corresponding to ASCII characters).
- Double-byte encoding ranges from 0xC2 80 to 0xDF BF.
- Triple-byte encoding ranges from 0xE0 A0 80 to 0xEF BF BF. 

In the full json output, many CJK characters are frequently separated into two tokens: the first token has two bytes, and the second has one byte. These two tokens are not valid utf-8 characters, causing the json file can't be read using utf-8 encoding. 

![image](https://github.com/ggerganov/whisper.cpp/assets/19181833/583308f7-6948-49b8-a59b-cb7ca62e5ccc)



## Reproduce 

I used the [v1.5.4 Windows binary Release](https://github.com/ggerganov/whisper.cpp/releases/download/v1.5.4/whisper-bin-x64.zip).  

Here is the zipped wav sound file: 

[test-zh.wav.zip](https://github.com/ggerganov/whisper.cpp/files/14013002/test-zh.wav.zip)

Using the command: 

```
main.exe --model ../model/medium.bin --language zh -otxt -ojf  test-zh.wav
```

A txt and a json-full result is produced:

- [test-zh.wav.txt](https://github.com/ggerganov/whisper.cpp/files/14013044/test-zh.wav.txt)
- [test-zh.wav.json](https://github.com/ggerganov/whisper.cpp/files/14013042/test-zh.wav.json)


## Possible solution 

A possible solution is to check if the token is a valid utf-8 character and concat the broken tokens before output to json file. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CJK character (3 Byte) is split into two tokens in json output. #1798

Detail

Reproduce

Possible solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CJK character (3 Byte) is split into two tokens in json output. #1798

Description

Detail

Reproduce

Possible solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions