Markup Metrics is a Testing Tool for comparing implementations of Automatic Markup (Auto-Markup) tools.
pip install -r requirements.txtThe two first auto-markup engines use OpenAI, so they need
the OPENAI_API_KEY environment variable to be set. You
can run the suite without that, but you won't be actually
testing anything remotely similar to real automarkup.
If you create an Auto-Markup engine, you just make a driver for it that implements the AutoMarkup interface.
Look at the markup_engines directory to see how to implement
a new driver.
Actual metrics are also pluggable, so you can compare each implementation multiple ways.
New data can be added to the data directory.
For every schema, you can add a "prompt.txt" and as many test .txt files as you want. Beside every test .txt file you can add a .xml file which represents what the output should look like.
Set up a VENV with `pip3 install -e .`` and then run this:
python markup-metrics.pyThis runs all metrics against all engines (even two dummy/test engines).
The output looks like this:
Processing gpt3.5_am1_automarkup with xater_metric
dita
data/dita/test1.txt (out/gpt3.5_am1_automarkup/dita/test1/test1.xml): 2.94%
data/dita/test2.txt (out/gpt3.5_am1_automarkup/dita/test2/test2.xml): 2.94%
data/dita/test3.txt (out/gpt3.5_am1_automarkup/dita/test3/test3.xml): 3.70%
Average gpt3.5_am1_automarkup / xater_metric / dita: 3.20%
html
data/html/test1.txt (out/gpt3.5_am1_automarkup/html/test1/test1.xml): 4.65%
data/html/test2.txt (out/gpt3.5_am1_automarkup/html/test2/test2.xml): 20.00%
data/html/test3.txt (out/gpt3.5_am1_automarkup/html/test3/test3.xml): 16.67%
Average gpt3.5_am1_automarkup / xater_metric / html: 13.77%
Average gpt3.5_am1_automarkup / xater_metric: 8.48%gpt3.5_am1_automarkup is an automarkup system based on GPT-4 and Prompt Engineering.
xater_metric is a metric based on XML tokenization and the industry
standard Translation Edit Rate metric.
dita is a schema under test.
data/dita/test1.txt is a test file to be automatically encoded into DITA.
It should have a sibling file data/dita/test1.xml which describes ideal
target output.
out/gpt3.5_am1_automarkup/dita/test1/test1.xml is an output file.
In that same directory may be other files that the metrics output to explain their scoring. For example:
out/gpt3.5_am1_automarkup/dita/test1/test1.xater_metric.txt is
a difference file which shows how different the output XML was
from the target.
At the end of each line is a score. For all built-in metrics, 0
is a good score and 100 is a bad score. For example, for
xater, zero means zero edits were needed to match the sample file.
xater_metric ("XML Automarkup Translation Error Rate)
is a metric based on XML tokenization and the industry
standard Translation Edit Rate metric. Zero means zero edits were
needed to match the sample file. 100 means, roughly, "everything
needed to change". It is actually possible for a horrible
TER to be worse than 100%, because the numerator and the denominator
are not counting the same thing.
validation_error_metric is a measure of how many errors there are
in the document. Zero means zero errors and 100 means, essentially,
that "everything was wrong."
If you change these metrics, or create new ones, and want to test them against specially written example documents, run:
$ python test-metrics.pyThis will run all installed metrics against sample files described in test_metrics/README.md
dummy_automarkup.py: does basically nothing. It returns a hard-coded
HTML string. It can be used for testing.
gpt3.5_am1_automarkup.py: a simple prompt-engineering-based markup
system that uses the gpt-3.5-turbo API.
gpt4_am1_automarkup.py: a simple prompt-engineering-based markup
system that uses the gpt-4 API.
buggy_automarkup__DISABLED.py: A buggy markup engine that is disabled by default.
This engine can be used to test what happens when a markup engine fails to produce valid markup.
This folder has files that test the extremes of metric engines.