-
Notifications
You must be signed in to change notification settings - Fork 6
Add regex interface with re2 and std::regex implementations #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
/** | ||
* @brief Return all non-overlapping matches found in the input string. | ||
*/ | ||
virtual std::vector<Match> findAll(const std::string& text) const override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a Result
? Is this changed in the top PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't expect this to error
You will need to add license header to all source files, similar to https://github.com/pytorch-labs/tokenizers/blob/main/include/pytorch/tokenizers/tiktoken.h |
6efdb92
to
100430f
Compare
include/pytorch/tokenizers/regex.h
Outdated
* @param pattern The regex pattern to compile. | ||
* @return A unique pointer to an IRegex-compatible object. | ||
*/ | ||
Result<std::unique_ptr<IRegex>> createRegex(const std::string& pattern); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to have a constructor in this abstract class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the factory function mentioned below
include/pytorch/tokenizers/regex.h
Outdated
* | ||
* @return true if the pattern is valid and ready to use, false otherwise. | ||
*/ | ||
virtual bool ok() const = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally regex should either fail at constructor, or stay valid forever once created. Do we need a standalone check like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To see if it failed during construction so it can then fallback on another regex implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we make this class construction-agnostic and let the subclasses deal with errors during construction? Eg. different regex impls may approach it differently: throw exceptions from constructor, return an error, or have a dedicated "compile" method, etc. And the users of this interface shouldn't care how exactly a regex has been constructed, they just want to get a match using an apriori valid interface. Normally, it's up to a "regex factory" or some higher level concept to deal with failures at construction (eg. pick a proper impl as a fallback according to some logic) and then provide a valid pointer to this interface. Like a platform-specific impl could leverage Apple's NSRegularExpression
, but then fallback to std::regex
or something if the former fails. But whoever gets a pointer to this interface would never need to reason if it's valid or not, but just use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense, I can make this protected? I just added this to use in the factory function create_regex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's consider how it's gonna be used. I guess IRegex
will be injected as a dep to some higher level concept like some concrete impl of ITokenizer
?
Some pseudocode:
// MyModelTokenizer is such that requires regex, so it'll use IRegex.
// Other tokenizers may not need regex at all, btw.
class MyModelTokenizer : public ITokenizer {
MyModelTokenizer(const std::string& filepath, std::unique_ptr<IRegex> regex)
: regex_(std::move(regex)) {
// open file, initialize everything else
}
std::vector<size_t> encode(const std::string& text) override {
// Use regex_ to parse the text, etc.
// It's guaranteed the injected IRegex is ready to use and there's no need to validate it again
// Tokenizer doesn't need use anything of IRegex beyond matching text
auto tokens = regex_->match_all(text);
}
};
// MyModelRunner is such that required text tokenization, so it'll use ITokenizer.
// Other runners may not need tokenization at all, btw, or expect some other components do tokenization and provide them with already ready tokens.
class MyModelRunner : public IRunner {
MyModelRunner(std::unique_ptr<ITokenizer> tokenizer)
: tokenizer_(std::move(tokenizer)) {...}
std::vector<size_t> preprocess(const std::string& text) override {
return tokenizer_->encode(text);
}
size_t generate(const std::vector<size_t>& tokens) override { ... }
};
So we can inject various regex implementations into tokenizers that do need regexp, and the latter never have to deal with regex creation or check its validity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up removing it
}; | ||
|
||
/** | ||
* @brief Abstract interface for regex wrappers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May like something like this:
#pragma once
#include <string>
#include <vector>
class Regex {
public:
virtual ~Regex() = default;
// The only method subclasses have to implement.
virtual std::pair<size_t, size_t> match(const std::string& text, size_t start) const = 0;
// Convenience overload to match from the beginning.
std::pair<size_t, size_t> match(const std::string& text) const {
return match(text, 0);
}
// General implementation to match all.
std::vector<std::pair<size_t, size_t>> match_all(const std::string& text, size_t start = 0) const {
std::vector<std::pair<size_t, size_t>> matches;
for (size_t length = 0;; start += length) {
std::tie(start, length) = match(text, start);
if (length == 0) {
break;
}
matches.emplace_back(start, length);
}
return matches;
}
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should just leave this API as is. We can get into a more granular API design later if necessary but the main point of all of this was to simply just provide a pcre2 fallback if re2 didn't work. I don't really expect people to be adding different regex implementations to be honest so don't want to overengineer too much. Another reason is I'd like to not mess with the current re2 code which uses FindAndConsume
, which is stateful and would not fit into the proposed match
API
@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers. Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr). Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib. 🧱 Stack: - [ ] #45 - [ ] #48 - [ ] #49 - [x] #50 Pull Request resolved: #50 Differential Revision: D73295314 Pulled By: jackzhxng
Summary: Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers. Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr). Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib. 🧱 Stack: - [ ] #45 - [ ] #48 - [ ] #49 - [x] #50 Pull Request resolved: #50 Differential Revision: D73295314 Pulled By: jackzhxng
Summary: Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers. Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr). Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib. 🧱 Stack: - [ ] #45 - [ ] #48 - [ ] #49 - [x] #50 Pull Request resolved: #50 Differential Revision: D73295314 Pulled By: jackzhxng
🧱 Stack: