Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize PHP html_entity_decode function #18092

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ArtUkrainskiy
Copy link

@ArtUkrainskiy ArtUkrainskiy commented Mar 16, 2025

Improvements affect the C function traverse_for_entities:

  • Use memchr to search for '&' instead of scanning character by character.
  • Use memchr to locate ';' to determine potential entity boundaries instead of process_named_entity_html, avoiding unnecessary per-character validations.
  • Use memcpy instead of character-by-character copying.
  • Refactor code for improved structure and readability.

Benchmark branch - https://github.com/ArtUkrainskiy/php-src/tree/html_entity_decode/benchmark

------------------------------------------------------------------------------
|                      Test |     old avg(ns) |     new avg(ns) |    diff(%) |
------------------------------------------------------------------------------
|                      4k & |            5949 |           21115 |    -71.98% |
------------------------------------------------------------------------------
|             only entities |            8279 |           10202 |    -18.80% |
------------------------------------------------------------------------------
|        400 valid entities |            6439 |            5861 |      7.80% |
------------------------------------------------------------------------------
|        200 valid entities |            4891 |            3178 |     38.12% |
------------------------------------------------------------------------------
|        200 invalid entity |            4777 |            3181 |     37.29% |
------------------------------------------------------------------------------
|             200 ampersand |            4809 |            1221 |    198.35% |
------------------------------------------------------------------------------
|        100 valid entities |            4188 |            1777 |    124.49% |
------------------------------------------------------------------------------
|         50 valid entities |            2885 |             979 |    193.50% |
------------------------------------------------------------------------------
|        String ends with & |            2428 |             176 |   1221.69% |
------------------------------------------------------------------------------

As you can see, the speedup depends on the number of entities and & characters in the string — the fewer there are, the more noticeable the performance improvement.

In edge cases, where the string consists entirely of & characters or valid HTML entities, performance actually worsens. However, I don't think this is a common scenario.

Either way, I plan to continue optimizing and implement & scanning using SIMD instructions, which should significantly improve performance even in high-entity-density cases.

…mize scanning for '&' and ';' using memchr

Use memcpy instead of character-by-character copying

language
@ArtUkrainskiy ArtUkrainskiy force-pushed the html_entity_decode/improve-memchr branch from d166abe to 66f5709 Compare March 16, 2025 17:52
char *output_ptr = ZSTR_VAL(output);
int doctype = flags & ENT_HTML_DOC_TYPE_MASK;

assert(*input_end == '\0');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change the input parameter to a zend_string* you'd be guaranteed this. Moreover, please use ZEND_ASSERT() instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok


unsigned code = 0, code2 = 0;
const char *entity_end_ptr = NULL;
int valid_entity = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int valid_entity = 1;
bool valid_entity = true;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@bukka
Copy link
Member

bukka commented Mar 17, 2025

@Girgias are you going to review the logic as well? Just checking if I should look into this or if you are happy to handle it all?

@Girgias
Copy link
Member

Girgias commented Mar 17, 2025

@Girgias are you going to review the logic as well? Just checking if I should look into this or if you are happy to handle it all?

Please do review the logic, I only had a cursory glance :)

@bukka
Copy link
Member

bukka commented Mar 17, 2025

Ok I will check it out next week if no one is quicker.

@ArtUkrainskiy ArtUkrainskiy force-pushed the html_entity_decode/improve-memchr branch from 9b3e96d to f093c30 Compare March 17, 2025 16:30
@dragoonis
Copy link
Contributor

Nice idea @ArtUkrainskiy :-) 👍

@ArtUkrainskiy ArtUkrainskiy requested a review from Girgias March 26, 2025 19:04
@ArtUkrainskiy ArtUkrainskiy marked this pull request as draft March 26, 2025 19:05
@ArtUkrainskiy ArtUkrainskiy marked this pull request as ready for review March 29, 2025 11:23
Copy link
Member

@bukka bukka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks reasonable except that introduction of valid_entity boolean and checking that everywhere which doesn't look like performance optimization to me. I understand that it was probably meant to make code more readable but not sure if it's worth it in this case. Might be worth to check if it has any impact.

} else {
const size_t name_len = semi_colon_ptr - name_start;
if (name_len == 0) {
valid_entity = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those extra stores and extra check of valid_entity might be actually slower. Could you try to put beck to goto and measure it.

}
}
}

assert(*next == ';');
/* If entity_end_ptr is not found or does not point to ';', consider the entity invalid */
if (!valid_entity || entity_end_ptr == NULL || *entity_end_ptr != ';') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it cannot be ; so that check is pointless - this is checked in resolve_named_entity_html as well as in process_numeric_entity.

As I mentioned above, I think we should keep the goto.

Copy link
Author

@ArtUkrainskiy ArtUkrainskiy Mar 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @bukka ! Thanks for the code review! I removed the invalid_entity flag, and that actually resulted in a measurable performance gain - you can see the comparison below. I also optimized the condition you mentioned.
Commit: 5f8363b

------------------------------------------------------------------------------
|                      Test |    flag avg(ns) |    goto avg(ns) |    diff(%) |
------------------------------------------------------------------------------
|                      4k & |           21295 |           21148 |      0.27% |
------------------------------------------------------------------------------
|             only entities |           10298 |           10032 |      2.80% |
------------------------------------------------------------------------------
|        400 valid entities |            5407 |            5183 |      2.57% |
------------------------------------------------------------------------------
|        200 valid entities |            3075 |            2729 |      5.15% |
------------------------------------------------------------------------------
|        200 invalid entity |            3181 |            2895 |     10.66% |
------------------------------------------------------------------------------
|             200 ampersand |            1190 |            1180 |      0.18% |
------------------------------------------------------------------------------
|        100 valid entities |            1628 |            1533 |      5.54% |
------------------------------------------------------------------------------
|         50 valid entities |            1008 |             881 |      4.68% |
------------------------------------------------------------------------------
|        String ends with & |             174 |             174 |      0.00% |
------------------------------------------------------------------------------

@ArtUkrainskiy ArtUkrainskiy requested a review from bukka March 30, 2025 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants