Skip to content

Commit 1e10bdb

Browse files
Test the (meta) prescan algorithm
This change adds a `preparsed` subdirectory in the `encoding` directory, with tests for which the result of the *encoding sniffing algorithm* at https://html.spec.whatwg.org/#encoding-sniffing-algorithm is the expected result — that is, tests for which the expected result is the output of running *only* the encoding sniffing algorithm (of which the main sub-algorithm is the so-called “meta prescan”) — without also running the tokenization state machine and tree-construction stage. This change also adds a README file that explicitly documents what the expected results for the encoding tests are, based on whether or not they’re in the `preparsed` subdirectory. Without those changes, it’s unclear whether the expected results shown in the existing tests are for the output of fully parsing the test data — through the tokenization state machine and tree-construction stage — or instead just the output of the encoding sniffing algorithm only. And without those changes, we also don’t have any tests a system can use for testing only the output from the encoding sniffing algorithm. Fixes #28
1 parent 6ddcf58 commit 1e10bdb

File tree

2 files changed

+90
-0
lines changed

2 files changed

+90
-0
lines changed

encoding/README.md

+39
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
Encoding Tests
2+
==============
3+
4+
Each file containing encoding tests has any number of tests separated by
5+
two newlines (LF) and a single newline before the end of the file:
6+
7+
[TEST]LF
8+
LF
9+
[TEST]LF
10+
LF
11+
[TEST]LF
12+
13+
...where [TEST] is the format documented below.
14+
15+
Encoding test format
16+
====================
17+
18+
Each test must begin with a string "\#data", followed by a newline (LF).
19+
All subsequent lines until a line that says "\#encoding" are the test data
20+
and must be passed to the system being tested unchanged, except with the
21+
final newline (on the last line) removed.
22+
23+
Then there must be a line that says "\#encoding", followed by a newline
24+
(LF), followed by string indicating an encoding name, followed by a newline
25+
(LF). The encoding name indicated is the expected character encoding for
26+
the output with the given test data as input.
27+
28+
For the tests in the `preparsed` subdirectory, the encoding name indicated
29+
is the expected result of running the *encoding sniffing algorithm* at
30+
https://html.spec.whatwg.org/#encoding-sniffing-algorithm with the given
31+
test data as input; this is, it's the expected result of running *only* the
32+
*encoding sniffing algorithm* — without also running the tokenization state
33+
machine and tree-construction stage defined in the spec.
34+
35+
For all tests outside the subdirectory named `preparsed`, the encoding name
36+
indicated is instead the expected character encoding for the output after
37+
fully parsing the given test data; that is, it's the expected character
38+
encoding for the output after running the tokenization state machine and
39+
tree-construction stage.

0 commit comments

Comments
 (0)