Skip to content

Add regex interface with re2 and std::regex implementations #48

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 18, 2025

Conversation

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 15, 2025
@facebook-github-bot
Copy link
Contributor

@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

/**
* @brief Return all non-overlapping matches found in the input string.
*/
virtual std::vector<Match> findAll(const std::string& text) const override;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a Result? Is this changed in the top PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't expect this to error

@larryliu0820
Copy link
Contributor

You will need to add license header to all source files, similar to https://github.com/pytorch-labs/tokenizers/blob/main/include/pytorch/tokenizers/tiktoken.h

@jackzhxng jackzhxng force-pushed the jz/regex-1 branch 8 times, most recently from 6efdb92 to 100430f Compare April 16, 2025 20:11
* @param pattern The regex pattern to compile.
* @return A unique pointer to an IRegex-compatible object.
*/
Result<std::unique_ptr<IRegex>> createRegex(const std::string& pattern);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have a constructor in this abstract class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the factory function mentioned below

*
* @return true if the pattern is valid and ready to use, false otherwise.
*/
virtual bool ok() const = 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally regex should either fail at constructor, or stay valid forever once created. Do we need a standalone check like this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To see if it failed during construction so it can then fallback on another regex implementation

Copy link

@shoumikhin shoumikhin Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we make this class construction-agnostic and let the subclasses deal with errors during construction? Eg. different regex impls may approach it differently: throw exceptions from constructor, return an error, or have a dedicated "compile" method, etc. And the users of this interface shouldn't care how exactly a regex has been constructed, they just want to get a match using an apriori valid interface. Normally, it's up to a "regex factory" or some higher level concept to deal with failures at construction (eg. pick a proper impl as a fallback according to some logic) and then provide a valid pointer to this interface. Like a platform-specific impl could leverage Apple's NSRegularExpression, but then fallback to std::regex or something if the former fails. But whoever gets a pointer to this interface would never need to reason if it's valid or not, but just use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense, I can make this protected? I just added this to use in the factory function create_regex

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's consider how it's gonna be used. I guess IRegex will be injected as a dep to some higher level concept like some concrete impl of ITokenizer?

Some pseudocode:

// MyModelTokenizer is such that requires regex, so it'll use IRegex.
// Other tokenizers may not need regex at all, btw.
class MyModelTokenizer : public ITokenizer {
  MyModelTokenizer(const std::string& filepath, std::unique_ptr<IRegex> regex)
  : regex_(std::move(regex)) {
    // open file, initialize everything else
  }

  std::vector<size_t> encode(const std::string& text) override {
    // Use regex_ to parse the text, etc.
    // It's guaranteed the injected IRegex is ready to use and there's no need to validate it again
    // Tokenizer doesn't need use anything of IRegex beyond matching text
    auto tokens = regex_->match_all(text);
  }
};

// MyModelRunner is such that required text tokenization, so it'll use ITokenizer.
// Other runners may not need tokenization at all, btw, or expect some other components do tokenization and provide them with already ready tokens.
class MyModelRunner : public IRunner {
  MyModelRunner(std::unique_ptr<ITokenizer> tokenizer) 
  : tokenizer_(std::move(tokenizer)) {...}

  std::vector<size_t> preprocess(const std::string& text) override {
    return tokenizer_->encode(text);
  }

  size_t generate(const std::vector<size_t>& tokens) override { ... }
};

So we can inject various regex implementations into tokenizers that do need regexp, and the latter never have to deal with regex creation or check its validity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up removing it

};

/**
* @brief Abstract interface for regex wrappers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May like something like this:

#pragma once

#include <string>
#include <vector>

class Regex {
public:
  virtual ~Regex() = default;

  // The only method subclasses have to implement.
  virtual std::pair<size_t, size_t> match(const std::string& text, size_t start) const = 0;

  // Convenience overload to match from the beginning.
  std::pair<size_t, size_t> match(const std::string& text) const {
    return match(text, 0);
  }

  // General implementation to match all.
  std::vector<std::pair<size_t, size_t>> match_all(const std::string& text, size_t start = 0) const {
    std::vector<std::pair<size_t, size_t>> matches;
    for (size_t length = 0;; start += length) {
      std::tie(start, length) = match(text, start);
      if (length == 0) {
        break;
      }
      matches.emplace_back(start, length);
    }
    return matches;
  }
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should just leave this API as is. We can get into a more granular API design later if necessary but the main point of all of this was to simply just provide a pcre2 fallback if re2 didn't work. I don't really expect people to be adding different regex implementations to be honest so don't want to overengineer too much. Another reason is I'd like to not mess with the current re2 code which uses FindAndConsume, which is stateful and would not fit into the proposed match API

@facebook-github-bot
Copy link
Contributor

@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot facebook-github-bot merged commit bca09a2 into main Apr 18, 2025
4 of 6 checks passed
facebook-github-bot pushed a commit that referenced this pull request Apr 18, 2025
Summary:
🧱 Stack:
- [ ] #45
- [ ] #48
- [x] #49
- [ ] #50

### Testing
Pass CI
```
cmake -DTOKENIZERS_BUILD_TEST=ON -DCMAKE_BUILD_TYPE=Debug . -Bbuild && cmake --build build -j9 --config Debug
(cd build && ctest)
```


Differential Revision: D73238728

Pulled By: jackzhxng
facebook-github-bot pushed a commit that referenced this pull request Apr 18, 2025
Summary:
🧱 Stack:
- [ ] #45
- [ ] #48
- [x] #49
- [ ] #50

### Testing
Pass CI
```
cmake -DTOKENIZERS_BUILD_TEST=ON -DCMAKE_BUILD_TYPE=Debug . -Bbuild && cmake --build build -j9 --config Debug
(cd build && ctest)
```


Differential Revision: D73238728

Pulled By: jackzhxng
jackzhxng added a commit that referenced this pull request Apr 21, 2025
Summary:
Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers.

Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr).

Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib.

🧱 Stack:
- [ ] #45
- [ ] #48
- [ ] #49
- [x] #50

Pull Request resolved: #50

Differential Revision: D73295314

Pulled By: jackzhxng
jackzhxng added a commit that referenced this pull request Apr 21, 2025
Summary:
Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers.

Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr).

Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib.

🧱 Stack:
- [ ] #45
- [ ] #48
- [ ] #49
- [x] #50

Pull Request resolved: #50

Differential Revision: D73295314

Pulled By: jackzhxng
jackzhxng added a commit that referenced this pull request Apr 21, 2025
Summary:
Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers.

Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr).

Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib.

🧱 Stack:
- [ ] #45
- [ ] #48
- [ ] #49
- [x] #50

Pull Request resolved: #50

Differential Revision: D73295314

Pulled By: jackzhxng
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants