-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fuzzy Matching #25
Comments
In theory you can do fuzzy matching with tries via branching search, but a more efficient way might be to use a Levenshtein automata. I've never used it in Python, so I can't recommend a specific package, but there're some, e.g. leven. |
Ah thanks. I figured that Levenshtein automata would work for this, but I couldn't find a python implementation. The one that you linked computes Levenshtein distance but according to the docs, Levenshtein automata are still to be implemented. There is a node module (https://www.npmjs.com/package/node-levenshtein-automata) that implements Levenshtein automata, but I haven't had a chance to take a look at it yet. I may look into implementing a fuzzy (branching) search in marisa-trie in the near future. I imagine there might be some interest in this, since I haven't been able to find much about in-memory fuzzy searching (esp. in Python). ~ Ben |
JFYI there's another blog post with Python code for building and using the automata.
Feel free to submit a PR. I'm not sure if anyone uses tries for insertions/deletions (as in Leveshtein distance), though, so if you come up with a simpler mismatch-only solution, it would be useful as well. |
Hey, I think it won't be possible to use marisa-trie efficiently for that, as it lacks an ability to do step-wise walking (see https://code.google.com/p/marisa-trie/issues/detail?id=9 and #20). https://github.com/kmike/DAWG or https://github.com/kmike/DAWG-Python may be a better fit. |
Here's a fast Levenstein search. It uses ternary search trees. https://github.com/mattandahalfew/Levenshtein_search |
I see that this data structure supports prefix lookups -- does it also support fuzzy lookups (i.e. all records within Levenshtein distance). If that's not supported in this package / this data structure, do you know of any other packages that would let me do in-memory fuzzy searching?
~ Ben
The text was updated successfully, but these errors were encountered: