Map each code point of an ill-formed UTF-8 subsequence to a replacement character individually

As it was pointed out in https://github.com/Kotlin/kotlinx-io/pull/290#discussion_r1567268068, kotlinx-io converts different ill-formed UTF-8 subsequences differently: either the whole multi-code-point subsequence replaced with a single replacement character, or each code points is converted separately:
- `0xf0 0x89 0x89 <EOF>` -> `�`
-  `0xf0 0x89 0x89 0x89 <EOF>` -> `�`
- `0xf0 0xf0 0xf0 <EOF>` -> `���`

The UTF-8 spec allows handling these ill-formed sequences whatever way we want as long as errors are somehow reported. However, such behavior looks a bit inconsistent and it's hard to reason about how an arbitrary byte sequences will be converted.

We should improve the way ill-formed sequences are handled and stick to an approach adopted by other languages/libraries: convert only ill-formed subsequences consisting of a single byte.

That's how it's done in:
- Java:
```
jshell> new String(new byte[]{(byte)0xf0,(byte)0x89,(byte)0x89,(byte)0x89})
$5 ==> "����"
```
- Python 3:
```
>>> b'\xf0\x89\x89\x89'.decode("utf-8", errors='replace')
'����
```
- Go: 
```
fmt.Println(string([]byte{0xf0, 0x89, 0x89, 0x89}))
...

����
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Map each code point of an ill-formed UTF-8 subsequence to a replacement character individually #301

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Map each code point of an ill-formed UTF-8 subsequence to a replacement character individually #301

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions