-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't include HTML content in title search index #13356
base: master
Are you sure you want to change the base?
Conversation
d876a3d
to
deeb94d
Compare
Could you add a CHANGES entry? A |
title_node = self.env.longtitles.get(docname) | ||
title = self.render_partial(title_node)['title'] if title_node else '' | ||
title = title_node.astext() if title_node else '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best I can tell the title is only used for indexing purposes here (corroborated by tests not failing, I hope).
Will titles with literals still render properly in search results? E.g. https://docs.python.org/dev/search.html?q=migrating+optparse, the second result currently renders as "Migrating A |
For whatever it's worth, here are the changes in the cpython documentation index with this applied: https://gist.github.com/wlach/d6fd822b8f894fccb7d71f33e1b76198 |
This PR: |
Ah good spot, no, this behaviour is lost. Looking again, one of the (primary?) purposes of this metadata is to render the titles so maybe it is worth the cost/weirdness. Though there is some existing behaviour that assumed that the content was plain-text and won't work properly without it, note how the relevant document is now higher up in the results in the pictured screenshots. I'm not really sure what to suggest now. 😅 |
Despite the arguable title-display degradation, I'd tend to agree that HTML doesn't belong in the index representation, and that that's potentially the more important factor. I'm curious why the total count of results differed ( |
I think I'm leaning towards this too, although maybe there's something I'm missing.
Yup this line now operates as expected: |
There is one quirk I'd like to be careful about: titles containing greater-than / less-than can currently be interpreted as HTML when found in search results. An obvious example (although others could perhaps be more subtle): <code>testing</code>
==================== (note the title-displayed-as-a-code-block in the above; the HTML was interpreted by the browser directly) |
Perhaps we could locate the code that retrieves/formats the subtitle (where the literals aren't included in the output), and use the same approach as that? |
Can we split the title-to-be-searched from the title-to-be-displayed? That would seem to solve the issue? A |
We could, it would be a pretty straightforward extension of the PR. Looking at it more though, I'm not really sure if seeing the code-type formatting for the titles really makes things any easier to read/skim though. |
Ah that's a good point, if we kept this behaviour (which I'm leaning towards thinking is the right solution) we would definitely want to escape it before display. |
Personally I think the value is more that the search-result heading matches the actual heading on the page. I agree it is a fine balance though---we could proceed as currently proposed and see if users complain that the code formatting is gone. A |
e2096f6
to
d9a7838
Compare
tests/js/searchtools.spec.js
Outdated
@@ -184,7 +184,7 @@ describe('Basic html theme search', function() { | |||
|
|||
expectedRanking = [ | |||
['index', 'Main Page', '#index-0'], /* index entry */ | |||
['index', 'Main Page > Result Scoring', '#result-scoring'], /* title */ | |||
['index', 'Main Page > Result Scoring', '#result-scoring'], /* title */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayaddison This update to an existing test shows quoting HTML content
Yeah I think this is what I'm going to recommend for now. I honestly doubt anyone is going to notice the difference and preserving the old behaviour would be a bunch of extra code and bandwidth (to download the extra headers). |
Co-authored-by: James Addison <[email protected]>
4fc2dfd
to
cdcbbbb
Compare
👍 Looks good to me. I removed myself from the As I understand it, briefly: we don't escape the user query text, and I think that's as-intended; we'll then match that unescaped query text against the unescaped and text-only title terms (also good; that should improve the match accuracy/fidelity), and other existing index contents, before displaying escaped HTML titles and descriptions in the results. |
@AA-Turner I think this is ready to merge if we're agreed this is the right approach. |
Purpose
Fixes issue where the search index would contain HTML content (e.g. tags,
&
escaped content) which inhibited searching and also bloated the size of the index.References
Closes #13355.