Skip to content

Latest commit

 

History

History
234 lines (199 loc) · 10.4 KB

pipeline.md

File metadata and controls

234 lines (199 loc) · 10.4 KB

From the pipeline to the UCD

The following checklist for preparing a pull request with the UCD changes for an encoding proposal was (mostly) followed for https://github.com/unicode-org/unicodetools/pulls?q=label%3Apipeline-16.0. The plan is for this process to be part of the PAG’s review of encoding proposals going forward.

Checklist

Prerequisites: proposal posted to L2, SAH agreed to recommend for provisional assignment (or the proposal is already in the pipeline).

  • UnicodeData.txt — Prepend lines from proposal
  • Commit
  • UTC decision — Check counts, code points, names, properties
  • SAH report — Check counts, code points, names, properties
  • Ken’s UnicodeData draft — Check consistent

If the proposal supplies LineBreak.txt:

  • LineBreak.txt — Prepend lines from proposal
  • Commit

If the proposal does not supply LineBreak.txt:

  • LineBreak.txt — Regenerate [TODO(markus): This should become « invoke Ken’s tool »]
  • Update modified lines
  • Commit

New scripts only:

  • UCD_Names — Check script name

  • Scripts.txt — Prepend ranges (carefully mind any gaps)
  • Commit

New blocks only:

  • ShortBlockNames.txt — Update, keep sorted
  • Blocks.txt — Update, keep sorted [TODO(egg): This one wants to be generated…]
  • Commit

Joining scripts only:

  • ArabicShaping.txt — Merge from proposal, keep sorted
  • Commit

Indic scripts only:

  • IndicPositionalCategory — Prepend lines from proposal
  • IndicSyllabicCategory — Prepend lines from proposal
  • Commit



PR preparation:

  • UTC decision — Cite if available
    • Copy from the minutes (this includes a link), or, if unavailable, use the form UTC-\d\d\d-[MC]\d+.
    • If there is no UTC decision but an L2 document is available, cite as L2/\d\d-\d+.
  • Working group — Mention:
    • Proposals from SAH — Link SAH issue
    • Proposals from ESC or CJK — Mention ESC or CJK in the PR description
  • RMG issue — Link
  • data-for-new — Set label
  • pipeline-* — Set label:
    • pipeline-recommended-to-UTC if the characters are not yet in the pipeline,
    • pipeline-provisionally-assigned, or
    • pipeline-<version> depending on their status in the Pipeline.
  • PR button — Set to DRAFT pull request
    • unless approved for the upcoming version
  • PR button — Press
    • The Check UCA data and Check security data invariants CI checks are suppressed; many character additions need separate handling there, but that is out of scope for the PAG work of preparing data-for-new, so reporting those failures could distract from real issues in the UCD invariants. UCA and security data issues are addressed later in the process, before the start of β review.
  • PAG review summary for the report — Write
    • For proposals from SAH, use the following template in the SAH issue:
      # PAG Review
      
      [Name] drafted the UCD change in https://github.com/unicode-org/unicodetools/pull/[number].
      
      ## PAG report
      
      [Summarize the propertywise tests, omitting the uninteresting, _e.g._, differences in Block or
      Unicode_1_Name, and calling out the nontrivial (in particular, any issues that were caught by
      the tests).]
    • For proposals from CJK, file a PAG issue of type Document, citing the proposal. Put the review in the Background information / discussion section, and link the pull request in the Internal section. See, e.g., https://github.com/unicode-org/properties/issues/366.
  • PAG dashboard status of SAH or PAG issue — Set to Review
  • Pipeline dashboard PAG status of RMG issue — Set to data review

Scripts

There are a variety of setups for unicodetools, depending on OS, in-source vs. out-of-source, git practices, etc. If you take part in UCD development, feel free to add your own.

Ken UnicodeData

Ken's files come from here (select appropriate ucd version e.g. ucd160 for Unicode 16.0). NOTE: this check is probably not applicable for pipeline-provisionally-assigned data where Ken does not yet have a draft.

eggrobin (Windows, in-source; the remote corresponding to unicode-org is called la-vache, Ken’s files are downloaded next to the unicodetools repository).

$latestKenFile = (ls ..\UnicodeData-*.txt | sort LastWriteTime)[-1]
$kenUnicodeData = (Get-Content $latestKenFile)
git diff la-vache/main */UnicodeData.txt |
sls ^\+[0-9A-F]                          |
% {
  $headLine = $_.line.Substring(1)
  if (-not $kenUnicodeData.Contains($headLine)) {
    $codepoint = $headLine.Split(";")[0];
    echo "Mismatch for U+$codepoint";
    echo "HEAD : $headLine";
    echo "Ken  : $($kenUnicodeData.Where({$_.Split(";")[0] -eq $codepoint}))";
  }
}

Merge

eggrobin (Windows, in-source; the remote corresponding to unicode-org is called la-vache).

git fetch la-vache
git merge la-vache/main
git checkout la-vache/main unicodetools/data/ucd/dev/Derived*;
git checkout la-vache/main unicodetools/data/ucd/dev/extracted/*;
git checkout la-vache/main unicodetools/data/ucd/dev/auxiliary/*;
rm .\Generated\* -recurse -force;
mvn compile exec:java '-Dexec.mainClass="org.unicode.text.UCD.Main"'  '-Dexec.args="build MakeUnicodeFiles"' -am -pl unicodetools  "-DCLDR_DIR=..\cldr\"  "-DUNICODETOOLS_GEN_DIR=Generated"  "-DUNICODETOOLS_REPO_DIR=.";
cp .\Generated\UCD\17.0.0\* .\unicodetools\data\ucd\dev -recurse -force;
rm unicodetools\data\ucd\dev\zzz-unchanged-*;
rm unicodetools\data\ucd\dev\*\zzz-unchanged-*;
rm .\unicodetools\data\ucd\dev\extra\*;
rm .\unicodetools\data\ucd\dev\cldr\*;
git add ./unicodetools/data
git merge --continue

markusicu (Linux, out-of-source; main tracks unicode-org/main)

git merge main
# complains about merge conflicts as expected
git checkout main unicodetools/data/ucd/dev/Derived*
git checkout main unicodetools/data/ucd/dev/extracted/*
git checkout main unicodetools/data/ucd/dev/auxiliary/*
rm -r ../Generated/BIN/17.0.0.0/
rm -r ../Generated/BIN/UCD_Data17.0.0.bin
mvn -s ~/.m2/settings.xml compile exec:java -Dexec.mainClass="org.unicode.text.UCD.Main"  -Dexec.args="version 17.0.0 build MakeUnicodeFiles" -am -pl unicodetools  -DCLDR_DIR=$(cd ../../../cldr/mine/src ; pwd)  -DUNICODETOOLS_GEN_DIR=$(cd ../Generated ; pwd)  -DUNICODETOOLS_REPO_DIR=$(pwd)  -DUVERSION=17.0.0
# fix merge conflicts in unicodetools/src/main/java/org/unicode/text/UCD/UCD_Types.java
#   and in UCD_Names.java
# rerun mvn
cp -r ../Generated/UCD/17.0.0/* unicodetools/data/ucd/dev
rm unicodetools/data/ucd/dev/ZZZ-UNCHANGED-*
rm unicodetools/data/ucd/dev/*/ZZZ-UNCHANGED-*
rm unicodetools/data/ucd/dev/extra/*
rm unicodetools/data/ucd/dev/cldr/*
git add unicodetools/src/main/java/org/unicode/text/UCD/UCD_Names.java
git add unicodetools/src/main/java/org/unicode/text/UCD/UCD_Types.java
git add unicodetools/data
git merge --continue

macchiati (IDE)

sync github
run MakeUnicodeFiles.java -c

Cf. #636

Regenerate UCD

eggrobin (Windows, in-source).

rm .\Generated\* -recurse -force
mvn compile exec:java '-Dexec.mainClass="org.unicode.text.UCD.Main"'  '-Dexec.args="build MakeUnicodeFiles"' -am -pl unicodetools  "-DCLDR_DIR=..\cldr\"  "-DUNICODETOOLS_GEN_DIR=Generated"  "-DUNICODETOOLS_REPO_DIR=."
cp .\Generated\UCD\17.0.0\* .\unicodetools\data\ucd\dev -recurse -force
rm unicodetools\data\ucd\dev\zzz-unchanged-*
rm unicodetools\data\ucd\dev\*\zzz-unchanged-*
rm .\unicodetools\data\ucd\dev\extra\*
rm .\unicodetools\data\ucd\dev\cldr\*
git add unicodetools/data/ucd/dev/*
git commit -m "Regenerate UCD"

Regenerate LineBreak

eggrobin (Windows, in-source).

rm .\Generated\* -recurse -force
mvn compile exec:java '-Dexec.mainClass="org.unicode.text.UCD.Main"'  '-Dexec.args="build MakeUnicodeFiles"' -am -pl unicodetools  "-DCLDR_DIR=..\cldr\"  "-DUNICODETOOLS_GEN_DIR=Generated"  "-DUNICODETOOLS_REPO_DIR=."
cp .\Generated\UCD\17.0.0\LineBreak.txt .\unicodetools\data\ucd\dev

GenerateEnums

eggrobin (Windows, in-source).

mvn compile exec:java '-Dexec.mainClass="org.unicode.props.GenerateEnums"' -am -pl unicodetools  "-DCLDR_DIR=..\cldr\"  "-DUNICODETOOLS_GEN_DIR=Generated"  "-DUNICODETOOLS_REPO_DIR=." -U
mvn spotless:apply
git add *.java
git commit -m GenerateEnums

Run comparison tests

eggrobin (Windows, in-source).

mvn test -am -pl unicodetools "-DCLDR_DIR=$(gl|split-path -parent)\cldr\"  "-DUNICODETOOLS_GEN_DIR=$(gl|split-path -parent)\unicodetools\Generated\"  "-DUNICODETOOLS_REPO_DIR=$(gl|split-path -parent)\unicodetools\" "-DUVERSION=17.0.0" "-Dtest=TestTestUnicodeInvariants#testAdditionComparisons" -DfailIfNoTests=false -DtrimStackTrace=false

Results are in Generated\UnicodeTestResults-addition-comparisons-[RMG issue number].html.