Don't include HTML content in title search index #13356

wlach · 2025-02-17T21:39:11Z

Purpose

Fixes issue where the search index would contain HTML content (e.g. tags, &amp escaped content) which inhibited searching and also bloated the size of the index.

References

Closes #13355.

AA-Turner · 2025-02-17T21:43:22Z

Could you add a CHANGES entry?

A

wlach · 2025-02-17T21:50:25Z

sphinx/builders/html/__init__.py

        title_node = self.env.longtitles.get(docname)
-        title = self.render_partial(title_node)['title'] if title_node else ''
+        title = title_node.astext() if title_node else ''


Best I can tell the title is only used for indexing purposes here (corroborated by tests not failing, I hope).

AA-Turner · 2025-02-17T21:53:55Z

Will titles with literals still render properly in search results? E.g. https://docs.python.org/dev/search.html?q=migrating+optparse, the second result currently renders as "Migrating optparse code to argparse"

A

wlach · 2025-02-17T21:54:08Z

For whatever it's worth, here are the changes in the cpython documentation index with this applied:

https://gist.github.com/wlach/d6fd822b8f894fccb7d71f33e1b76198

AA-Turner · 2025-02-17T22:12:03Z

https://docs.python.org/dev/:

This PR:

wlach · 2025-02-17T22:17:26Z

Will titles with literals still render properly in search results? E.g. https://docs.python.org/dev/search.html?q=migrating+optparse, the second result currently renders as "Migrating optparse code to argparse"

A

Ah good spot, no, this behaviour is lost. Looking again, one of the (primary?) purposes of this metadata is to render the titles so maybe it is worth the cost/weirdness. Though there is some existing behaviour that assumed that the content was plain-text and won't work properly without it, note how the relevant document is now higher up in the results in the pictured screenshots.

https://github.com/wlach/sphinx/blob/bdf776a3568a3bc211495acacd56d95d946c8d05/sphinx/themes/basic/static/searchtools.js#L340

I'm not really sure what to suggest now. 😅

jayaddison · 2025-02-17T22:58:02Z

Despite the arguable title-display degradation, I'd tend to agree that HTML doesn't belong in the index representation, and that that's potentially the more important factor.

I'm curious why the total count of results differed (9 vs 10) - is it due to de-duplication of this Migrating optparse code to argparse > Migrating optparse code to argparse result, by any chance?

wlach · 2025-02-17T23:04:32Z

Despite the arguable title-display degradation, I'd tend to agree that HTML doesn't belong in the index representation, and that that's potentially the more important factor.

I think I'm leaning towards this too, although maybe there's something I'm missing.

I'm curious why the total count of results differed (9 vs 10) - is it due to de-duplication of this Migrating optparse code to argparse > Migrating optparse code to argparse result, by any chance?

Yup this line now operates as expected:

https://github.com/wlach/sphinx/blob/bdf776a3568a3bc211495acacd56d95d946c8d05/sphinx/themes/basic/static/searchtools.js#L343

jayaddison · 2025-02-17T23:09:00Z

There is one quirk I'd like to be careful about: titles containing greater-than / less-than can currently be interpreted as HTML when found in search results.

An obvious example (although others could perhaps be more subtle):

<code>testing</code>
====================

(note the title-displayed-as-a-code-block in the above; the HTML was interpreted by the browser directly)

jayaddison · 2025-02-17T23:15:29Z

Perhaps we could locate the code that retrieves/formats the subtitle (where the literals aren't included in the output), and use the same approach as that?

AA-Turner · 2025-02-18T00:14:33Z

Can we split the title-to-be-searched from the title-to-be-displayed? That would seem to solve the issue?

A

wlach · 2025-02-18T00:47:26Z

Can we split the title-to-be-searched from the title-to-be-displayed? That would seem to solve the issue?

We could, it would be a pretty straightforward extension of the PR. Looking at it more though, I'm not really sure if seeing the code-type formatting for the titles really makes things any easier to read/skim though.

wlach · 2025-02-18T00:49:46Z

There is one quirk I'd like to be careful about: titles containing greater-than / less-than can currently be interpreted as HTML when found in search results.

Ah that's a good point, if we kept this behaviour (which I'm leaning towards thinking is the right solution) we would definitely want to escape it before display.

AA-Turner · 2025-02-18T00:54:16Z

Looking at it more though, I'm not really sure if seeing the code-type formatting for the titles really makes things any easier to read/skim though.

Personally I think the value is more that the search-result heading matches the actual heading on the page. I agree it is a fine balance though---we could proceed as currently proposed and see if users complain that the code formatting is gone.

A

Closes sphinx-doc#13355.

wlach · 2025-02-18T12:28:41Z

tests/js/searchtools.spec.js

@@ -184,7 +184,7 @@ describe('Basic html theme search', function() {

      expectedRanking = [
        ['index', 'Main Page', '#index-0'],  /* index entry */
-        ['index', 'Main Page > Result Scoring', '#result-scoring'],  /* title */
+        ['index', 'Main Page &gt; Result Scoring', '#result-scoring'],  /* title */


@jayaddison This update to an existing test shows quoting HTML content

sphinx/themes/basic/static/searchtools.js

wlach · 2025-02-18T12:34:15Z

Looking at it more though, I'm not really sure if seeing the code-type formatting for the titles really makes things any easier to read/skim though.

Personally I think the value is more that the search-result heading matches the actual heading on the page. I agree it is a fine balance though---we could proceed as currently proposed and see if users complain that the code formatting is gone.

Yeah I think this is what I'm going to recommend for now. I honestly doubt anyone is going to notice the difference and preserving the old behaviour would be a bunch of extra code and bandwidth (to download the extra headers).

Co-authored-by: James Addison <[email protected]>

sphinx/themes/basic/static/searchtools.js

jayaddison · 2025-02-18T13:23:06Z

👍 Looks good to me. I removed myself from the sphinx-doc org, so I can't add the green approved review status.

As I understand it, briefly: we don't escape the user query text, and I think that's as-intended; we'll then match that unescaped query text against the unescaped and text-only title terms (also good; that should improve the match accuracy/fidelity), and other existing index contents, before displaying escaped HTML titles and descriptions in the results.

wlach · 2025-02-22T14:14:09Z

@AA-Turner I think this is ready to merge if we're agreed this is the right approach.

wlach force-pushed the fix-titles-searchindex branch from d876a3d to deeb94d Compare February 17, 2025 21:42

wlach requested a review from AA-Turner February 17, 2025 21:48

wlach commented Feb 17, 2025

View reviewed changes

AA-Turner added this to the 8.2.0 milestone Feb 17, 2025

wlach added 5 commits February 18, 2025 07:27

Don't include HTML content in title search index

ad6b8df

Closes sphinx-doc#13355.

CHANGES, fix js tests

62779ca

escape html

681ee18

Better escape html

ec63ed4

expect escaped html

d9a7838

wlach force-pushed the fix-titles-searchindex branch from e2096f6 to d9a7838 Compare February 18, 2025 12:27

wlach commented Feb 18, 2025

View reviewed changes

jayaddison reviewed Feb 18, 2025

View reviewed changes

sphinx/themes/basic/static/searchtools.js Outdated Show resolved Hide resolved

Update sphinx/themes/basic/static/searchtools.js

aaa2f59

Co-authored-by: James Addison <[email protected]>

jayaddison reviewed Feb 18, 2025

View reviewed changes

sphinx/themes/basic/static/searchtools.js Outdated Show resolved Hide resolved

Escape HTML on display

cdcbbbb

wlach force-pushed the fix-titles-searchindex branch from 4fc2dfd to cdcbbbb Compare February 18, 2025 12:52

AA-Turner modified the milestones: 8.2.0, 8.x Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't include HTML content in title search index #13356

Don't include HTML content in title search index #13356

wlach commented Feb 17, 2025

AA-Turner commented Feb 17, 2025

wlach Feb 17, 2025 •

edited

Loading

AA-Turner commented Feb 17, 2025

wlach commented Feb 17, 2025

AA-Turner commented Feb 17, 2025

wlach commented Feb 17, 2025 •

edited

Loading

jayaddison commented Feb 17, 2025

wlach commented Feb 17, 2025

jayaddison commented Feb 17, 2025

jayaddison commented Feb 17, 2025

AA-Turner commented Feb 18, 2025

wlach commented Feb 18, 2025

wlach commented Feb 18, 2025

AA-Turner commented Feb 18, 2025

wlach Feb 18, 2025

wlach commented Feb 18, 2025

jayaddison commented Feb 18, 2025

wlach commented Feb 22, 2025

Don't include HTML content in title search index #13356

Are you sure you want to change the base?

Don't include HTML content in title search index #13356

Conversation

wlach commented Feb 17, 2025

Purpose

References

AA-Turner commented Feb 17, 2025

wlach Feb 17, 2025 • edited Loading

Choose a reason for hiding this comment

AA-Turner commented Feb 17, 2025

wlach commented Feb 17, 2025

AA-Turner commented Feb 17, 2025

wlach commented Feb 17, 2025 • edited Loading

jayaddison commented Feb 17, 2025

wlach commented Feb 17, 2025

jayaddison commented Feb 17, 2025

jayaddison commented Feb 17, 2025

AA-Turner commented Feb 18, 2025

wlach commented Feb 18, 2025

wlach commented Feb 18, 2025

AA-Turner commented Feb 18, 2025

wlach Feb 18, 2025

Choose a reason for hiding this comment

wlach commented Feb 18, 2025

jayaddison commented Feb 18, 2025

wlach commented Feb 22, 2025

wlach Feb 17, 2025 •

edited

Loading

wlach commented Feb 17, 2025 •

edited

Loading