Add regex interface with re2 and std::regex implementations #48

jackzhxng · 2025-04-15T23:09:06Z

🧱 Stack:

facebook-github-bot · 2025-04-15T23:13:20Z

@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

larryliu0820 · 2025-04-16T18:43:29Z

include/pytorch/tokenizers/re2_regex.h

+  /**
+   * @brief Return all non-overlapping matches found in the input string.
+   */
+  virtual std::vector<Match> findAll(const std::string& text) const override;


This should be a Result? Is this changed in the top PR?

Don't expect this to error

src/re2_regex.cpp

include/pytorch/tokenizers/std_regex.h

src/regex.cpp

larryliu0820 · 2025-04-16T18:47:26Z

You will need to add license header to all source files, similar to https://github.com/pytorch-labs/tokenizers/blob/main/include/pytorch/tokenizers/tiktoken.h

shoumikhin · 2025-04-16T20:10:41Z

include/pytorch/tokenizers/regex.h

+ * @param pattern The regex pattern to compile.
+ * @return A unique pointer to an IRegex-compatible object.
+ */
+Result<std::unique_ptr<IRegex>> createRegex(const std::string& pattern);


Do we need to have a constructor in this abstract class?

This is the factory function mentioned below

shoumikhin · 2025-04-16T20:12:59Z

include/pytorch/tokenizers/regex.h

+   *
+   * @return true if the pattern is valid and ready to use, false otherwise.
+   */
+  virtual bool ok() const = 0;


Ideally regex should either fail at constructor, or stay valid forever once created. Do we need a standalone check like this?

To see if it failed during construction so it can then fallback on another regex implementation

What if we make this class construction-agnostic and let the subclasses deal with errors during construction? Eg. different regex impls may approach it differently: throw exceptions from constructor, return an error, or have a dedicated "compile" method, etc. And the users of this interface shouldn't care how exactly a regex has been constructed, they just want to get a match using an apriori valid interface. Normally, it's up to a "regex factory" or some higher level concept to deal with failures at construction (eg. pick a proper impl as a fallback according to some logic) and then provide a valid pointer to this interface. Like a platform-specific impl could leverage Apple's NSRegularExpression, but then fallback to std::regex or something if the former fails. But whoever gets a pointer to this interface would never need to reason if it's valid or not, but just use it.

Yeah that makes sense, I can make this protected? I just added this to use in the factory function create_regex

Let's consider how it's gonna be used. I guess IRegex will be injected as a dep to some higher level concept like some concrete impl of ITokenizer?

Some pseudocode:

// MyModelTokenizer is such that requires regex, so it'll use IRegex. // Other tokenizers may not need regex at all, btw. class MyModelTokenizer : public ITokenizer { MyModelTokenizer(const std::string& filepath, std::unique_ptr<IRegex> regex) : regex_(std::move(regex)) { // open file, initialize everything else } std::vector<size_t> encode(const std::string& text) override { // Use regex_ to parse the text, etc. // It's guaranteed the injected IRegex is ready to use and there's no need to validate it again // Tokenizer doesn't need use anything of IRegex beyond matching text auto tokens = regex_->match_all(text); } }; // MyModelRunner is such that required text tokenization, so it'll use ITokenizer. // Other runners may not need tokenization at all, btw, or expect some other components do tokenization and provide them with already ready tokens. class MyModelRunner : public IRunner { MyModelRunner(std::unique_ptr<ITokenizer> tokenizer) : tokenizer_(std::move(tokenizer)) {...} std::vector<size_t> preprocess(const std::string& text) override { return tokenizer_->encode(text); } size_t generate(const std::vector<size_t>& tokens) override { ... } };

So we can inject various regex implementations into tokenizers that do need regexp, and the latter never have to deal with regex creation or check its validity.

I ended up removing it

include/pytorch/tokenizers/regex.h

shoumikhin · 2025-04-16T20:17:46Z

include/pytorch/tokenizers/regex.h

+};
+
+/**
+ * @brief Abstract interface for regex wrappers.


May like something like this:

#pragma once #include <string> #include <vector> class Regex { public: virtual ~Regex() = default; // The only method subclasses have to implement. virtual std::pair<size_t, size_t> match(const std::string& text, size_t start) const = 0; // Convenience overload to match from the beginning. std::pair<size_t, size_t> match(const std::string& text) const { return match(text, 0); } // General implementation to match all. std::vector<std::pair<size_t, size_t>> match_all(const std::string& text, size_t start = 0) const { std::vector<std::pair<size_t, size_t>> matches; for (size_t length = 0;; start += length) { std::tie(start, length) = match(text, start); if (length == 0) { break; } matches.emplace_back(start, length); } return matches; } };

I feel like we should just leave this API as is. We can get into a more granular API design later if necessary but the main point of all of this was to simply just provide a pcre2 fallback if re2 didn't work. I don't really expect people to be adding different regex implementations to be honest so don't want to overengineer too much. Another reason is I'd like to not mess with the current re2 code which uses FindAndConsume, which is stateful and would not fit into the proposed match API

facebook-github-bot · 2025-04-17T23:11:47Z

@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: 🧱 Stack: - [ ] #45 - [ ] #48 - [x] #49 - [ ] #50 ### Testing Pass CI ``` cmake -DTOKENIZERS_BUILD_TEST=ON -DCMAKE_BUILD_TYPE=Debug . -Bbuild && cmake --build build -j9 --config Debug (cd build && ctest) ``` Differential Revision: D73238728 Pulled By: jackzhxng

Summary: Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers. Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr). Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib. 🧱 Stack: - [ ] #45 - [ ] #48 - [ ] #49 - [x] #50 Pull Request resolved: #50 Differential Revision: D73295314 Pulled By: jackzhxng

Add regex interface

934ffa3

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 15, 2025

This was referenced Apr 15, 2025

Decouple tokenizers from Re2 and use IRegex interface #49

Merged

Add pcre2 as re2 fallback #50

Merged

This was referenced Apr 16, 2025

Use common base class private functions for TikToken #45

Merged

Use internal result as regex return type #51

Closed

larryliu0820 reviewed Apr 16, 2025

View reviewed changes

src/re2_regex.cpp Outdated Show resolved Hide resolved

larryliu0820 reviewed Apr 16, 2025

View reviewed changes

include/pytorch/tokenizers/std_regex.h Show resolved Hide resolved

larryliu0820 reviewed Apr 16, 2025

View reviewed changes

src/regex.cpp Outdated Show resolved Hide resolved

jackzhxng force-pushed the jz/regex-1 branch 8 times, most recently from 6efdb92 to 100430f Compare April 16, 2025 20:11

shoumikhin reviewed Apr 16, 2025

View reviewed changes

PR review

c2d4de8

jackzhxng force-pushed the jz/regex-1 branch from 100430f to c2d4de8 Compare April 16, 2025 20:22

jackzhxng added 3 commits April 17, 2025 08:45

Rename

c662e3d

Remove unnecessary interface functions

6c221f4

Remove text from match

4f3f224

shoumikhin approved these changes Apr 18, 2025

View reviewed changes

facebook-github-bot merged commit bca09a2 into main Apr 18, 2025
4 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add regex interface with re2 and std::regex implementations #48

Add regex interface with re2 and std::regex implementations #48

jackzhxng commented Apr 15, 2025 •

edited

Loading

facebook-github-bot commented Apr 15, 2025

larryliu0820 Apr 16, 2025

jackzhxng Apr 16, 2025

larryliu0820 commented Apr 16, 2025

shoumikhin Apr 16, 2025

jackzhxng Apr 16, 2025

shoumikhin Apr 16, 2025

jackzhxng Apr 16, 2025

shoumikhin Apr 16, 2025 •

edited

Loading

jackzhxng Apr 16, 2025

shoumikhin Apr 17, 2025

jackzhxng Apr 17, 2025

shoumikhin Apr 16, 2025

jackzhxng Apr 17, 2025

facebook-github-bot commented Apr 17, 2025

Add regex interface with re2 and std::regex implementations #48

Add regex interface with re2 and std::regex implementations #48

Conversation

jackzhxng commented Apr 15, 2025 • edited Loading

facebook-github-bot commented Apr 15, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

larryliu0820 commented Apr 16, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoumikhin Apr 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Apr 17, 2025

jackzhxng commented Apr 15, 2025 •

edited

Loading

shoumikhin Apr 16, 2025 •

edited

Loading