Skip to content

tools/content: Support surveying unimplemented KaTeX features #1600

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

rajveermalviya
Copy link
Member

@rajveermalviya rajveermalviya commented Jun 17, 2025

Stacked on top of #1601.

@rajveermalviya rajveermalviya added the maintainer review PR ready for review by Zulip maintainers label Jun 17, 2025
@rajveermalviya rajveermalviya force-pushed the pr-tex-content-survery branch 2 times, most recently from ec6f49e to 8c68c44 Compare June 18, 2025 18:33
Copy link
Member

@gnprice gnprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very useful, thanks!

I read through the code, then ran this on one open math community in order to see what the output looks like. Comments below, which I think should all be pretty quick.

Let's prioritize this next after the PR split described at #1452 (comment) and the small rebase at #1601 (comment) .

In particular it'd be great to be able to merge this right after that new first segment of #1452. Then we can use it to measure the impact of the remaining part of #1452, of #1559, and of #1601.

Comment on lines 98 to 100
buf.writeln('Out of $totalMessageCount total messages,'
' ${katexMessageIds.length} of them were KaTeX containing messages'
' and ${failedKatexMessageIds.length} of them failed.');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
buf.writeln('Out of $totalMessageCount total messages,'
' ${katexMessageIds.length} of them were KaTeX containing messages'
' and ${failedKatexMessageIds.length} of them failed.');
buf.writeln('Out of $totalMessageCount total messages,'
' ${katexMessageIds.length} of them were KaTeX containing messages'
' and ${failedKatexMessageIds.length} of those failed.');

(otherwise it sounds like that's relative to the total messages rather than the KaTeX-containing messages)

Comment on lines 120 to 121
buf.writeln(' Message IDs (upto 100): ${messageIds.take(100).join(', ')}');
buf.writeln(' TeX source (upto 30):');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "up to" is two words

Suggested change
buf.writeln(' Message IDs (upto 100): ${messageIds.take(100).join(', ')}');
buf.writeln(' TeX source (upto 30):');
buf.writeln(' Message IDs (up to 100): ${messageIds.take(100).join(', ')}');
buf.writeln(' TeX source (up to 30):');

@@ -50,7 +50,7 @@ opt_verbose=
opt_steps=()
while (( $# )); do
case "$1" in
fetch|check) opt_steps+=("$1"); shift;;
fetch|check|katex-check) opt_steps+=("$1"); shift;;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update the usage message too (the --help output), to keep it in sync.

Comment on lines 122 to 127
for (final node in failedMathNodes.take(30)) {
final type = switch (node) {
MathBlockNode() => 'block',
MathInlineNode() => 'inline',
};
buf.writeln(' $type: "${node.texSource}"');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running this on one example server, some of the output looks like:

      inline: "\frac{1}{2} \delta_0 + \frac{1}{2} \delta_1"
      inline: "\frac{1}{2} \delta_0 + \frac{1}{2} \delta_1"
      block: "f(x) = \begin{cases}
        diamond(x) & x \text{ not in top corner} \\
        f(4x) & \text{otherwise}
     \end{cases}"
      inline: "\displaystyle{ \frac{\partial y}{\partial x} > 0}"

I think the key things I'd want from this output are to (a) skim it by eye, (b) copy-paste some of the examples into a Zulip test message. The "inline:" and quotes get in the way of (b), and don't help with (a) either.

How about trying to print them in Zulip Markdown syntax? So e.g. $$ \frac{1}{2} \delta_0 + \frac{1}{2} \delta_1 $$, and something appropriate for blocks. (It might need to lose the indentation, which is fine. Just put a separator line before the next heading, similar to the separator line called divider in the other script.)

Comment on lines 129 to 137
buf.writeln(' HTML (upto 10):');
for (final node in failedMathNodes.take(10)) {
buf.writeln(' ${node.debugHtmlText}');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These get pretty voluminous — try cutting them down to 3 examples.

Comment on lines 130 to 138
for (final node in failedMathNodes.take(10)) {
buf.writeln(' ${node.debugHtmlText}');
buf.writeln();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to end up with a blank line separating the examples. Maybe skip the extra writeln?

for (final MapEntry(key: reason, value: messageIds) in failedMessageIdsByReason.entries.sorted(
(a, b) => b.value.length.compareTo(a.value.length),
)) {
final failedMathNodes = failedMathNodesByReason[reason]!;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the TeX examples are looking redundant in the output I'm seeing. Like this:

  Because of unsupported css class: nulldelimiter:
    410 messages failed.
    Oldest message: 191982974, Newest message: 524791558
    Message IDs (upto 100): 507123830, 509468402, […], 489054468, 453621753
    TeX source (upto 30):
      inline: "\phi D \phi = \frac{1}{2}D(\phi^2)"
      inline: "\phi D \phi = \frac{1}{2}D(\phi^2)"
      inline: "\frac{n!}{k!(n-k)!}"
      inline: "\frac{n!}{k!(n-k)!}"
      inline: "k \frac{p_1}{2} = 0"
      inline: "k \frac{p_1}{2} = 0"
      inline: "\frac{\mathsf{Free}(G)}{\sim}"
      inline: "\frac{\mathsf{Free}(G)}{\sim}"
      inline: "\frac{\mathsf{Free}(G)}{\sim} \cong H_1(G)"
      inline: "\frac{\mathsf{Free}(G)}{\sim} \cong H_1(G)"
[…]

In addition to expressions that are exactly equal, there are some that are similar (presumably because they came from the same conversation), which means each one adds less new information after the others. That makes the "up to 30" examples less useful than they could be.

I think two things would help:

  • We can keep the failed math nodes for each reason as a set, rather than a list. That way if the identical node has e.g. the same unsupported CSS class in multiple places, it appears just once.
  • Let's take a random 30 examples instead of the first 30 found:
Suggested change
final failedMathNodes = failedMathNodesByReason[reason]!;
final failedMathNodes = failedMathNodesByReason[reason]!;
failedMathNodes.shuffle();

Shuffling once at the top also means that when we take 3 or 10 for HTML, they'll be from among the examples we're showing TeX for. That's good for being able to compare the two formats to each other.

debugHtmlNode: kDebugMode ? innerSpan : null,
node: innerSpanNode));
} else {
throw _KatexHtmlParseError();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running this on one example server, there's just one "unknown" among the common failure reasons, which is this line:

  Because of hard fail: unknown "#0      _KatexParser._parseSpan (package:zulip/model/katex.dart:317:17)":
    568 messages failed.

(out of 4776 total failed messages, out of 28109 total messages with KaTeX).

So this would be good to fill in with a few words about what the unexpected structure is.

Of the examples I'm currently seeing, the shortest looks to be:
$$ \widehat{C} $$

      <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mover accent="true"><mi>C</mi><mo stretchy="true">^</mo></mover></mrow><annotation encoding="application/x-tex">\widehat{C}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9233em;"></span><span class="mord accent"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.9233em;"><span style="top:-3em;"><span class="pstrut" style="height:3em;"></span><span class="mord mathnormal" style="margin-right:0.07153em;">C</span></span><span class="svg-align" style="width:calc(100% - 0.1667em);margin-left:0.1667em;top:-3.6833em;"><span class="pstrut" style="height:3em;"></span><span style="height:0.24em;"><svg width="100%" height="0.24em" viewBox="0 0 1062 239" preserveAspectRatio="none" xmlns="http://www.w3.org/2000/svg"><path d="M529 0h5l519 115c5 1 9 5 9 10 0 1-1 2-1 3l-4 22
  c-1 5-5 9-11 9h-2L532 67 19 159h-2c-5 0-9-4-11-9l-5-22c-1-6 2-12 8-13z"></path></svg></span></span></span></span></span></span></span></span></span>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is most probably because of an unexpected CSS class value for inner span in a vlist when the vlist hosts an svg:

for (final innerSpan in vlist.nodes) {
if (innerSpan case dom.Element(
localName: 'span',
className: '',

So the fix would be somewhat like:

diff --git a/lib/model/katex.dart b/lib/model/katex.dart
index 7022fb5a..a3babee5 100644
--- a/lib/model/katex.dart
+++ b/lib/model/katex.dart
@@ -195,13 +195,17 @@ class _KatexParser {
             for (final innerSpan in vlist.nodes) {
               if (innerSpan case dom.Element(
                 localName: 'span',
-                className: '',
                 nodes: [
                   dom.Element(localName: 'span', className: 'pstrut') &&
                       final pstrutSpan,
                   ...final otherSpans,
                 ],
               )) {
+                if (innerSpan.className != '') {
+                  throw KatexHtmlParseError('vlist inner span has unexpected'
+                    'CSS class: ${innerSpan.className}');
+                }
+
                 var styles = _parseSpanInlineStyles(innerSpan)!;
                 final topEm = styles.topEm ?? 0;

But not sure if it should be included in #1452, as we are trying to keep the state of unmerged but released PRs as-is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that'd be good to include in this PR instead. This PR wasn't part of the releases, so having the commit appear in this PR helps us easily see that it's new since the releases.

(Though also this doesn't make a user-visible difference, right? It's throwing KatexHtmlParseError either way, and this just changes the string, which has no effect in the app itself. So if there had been a reason it was inconvenient to do as a separate commit and in this separate PR, we wouldn't lose much by including it in #1452.)

@rajveermalviya rajveermalviya force-pushed the pr-tex-content-survery branch from 8c68c44 to b42ed94 Compare June 19, 2025 14:10
@rajveermalviya rajveermalviya force-pushed the pr-tex-content-survery branch 2 times, most recently from 49fa272 to 19f89d8 Compare June 19, 2025 15:27
@rajveermalviya
Copy link
Member Author

Thanks for the review @gnprice! Revision pushed.

@rajveermalviya rajveermalviya force-pushed the pr-tex-content-survery branch from 19f89d8 to 3d99ccd Compare June 19, 2025 17:58
Copy link
Member

@gnprice gnprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the revision! Just two small comments — otherwise looks good.

And please go ahead and start posting the results you're finding, in a thread in #mobile-team.

Comment on lines +32 to +33
katex-check Check for unimplemented KaTeX features. This requires the
corpus directory \`CORPUS_DIR\` to contain at least one corpus
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use same indentation as the other steps and options

Suggested change
katex-check Check for unimplemented KaTeX features. This requires the
corpus directory \`CORPUS_DIR\` to contain at least one corpus
katex-check
Check for unimplemented KaTeX features. This requires the
corpus directory \`CORPUS_DIR\` to contain at least one corpus

Comment on lines +32 to +33
final failedMessageIdsByReason = <String, Set<int>>{};
final failedMathNodesByReason = <String, List<MathNode>>{};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump part of #1600 (comment)

We can keep the failed math nodes for each reason as a set, rather than a list. That way if the identical node has e.g. the same unsupported CSS class in multiple places, it appears just once.

Less critical now that it's shuffled, but still better not to have dupes.

debugHtmlNode: kDebugMode ? innerSpan : null,
node: innerSpanNode));
} else {
throw _KatexHtmlParseError();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that'd be good to include in this PR instead. This PR wasn't part of the releases, so having the commit appear in this PR helps us easily see that it's new since the releases.

(Though also this doesn't make a user-visible difference, right? It's throwing KatexHtmlParseError either way, and this just changes the string, which has no effect in the app itself. So if there had been a reason it was inconvenient to do as a separate commit and in this separate PR, we wouldn't lose much by including it in #1452.)

@gnprice gnprice added integration review Added by maintainers when PR may be ready for integration and removed maintainer review PR ready for review by Zulip maintainers labels Jun 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration review Added by maintainers when PR may be ready for integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants