A high-performance port of TatSu to Rust.
TieXiu (铁修) is a PEG (Parsing Expression Grammar) engine that implements the flexibility and power of the original TatSu lineage into a memory-safe, high-concurrency architecture optimized for modern CPU caches.
TieXiu is functionally complete, and correct respect its predecessor TatSu. A beta period will allow for adjusting the API and its signatures to the user experience.
TieXiu is a tool that takes grammars in extended EBNF_ as input, and
outputs memoizing (Packrat) PEG parsers as a Rust model. The classic
variations of EBNF (Tomassetti, EasyExtend, Wirth) and ISO EBNF are
supported as input grammar formats.
The TatSu Documentation provides a vision of where the TieXiu project is heading. A copy of the grammar syntax can can be accessed locally in the SYNTAX document.
TieXiu is foremost a Rust library that is also published as a Python library with the help of PyO3/Maturin. The Rust API may return objects of types in the internal parser or tree model. The Python API has strings as input and json.dumps() compatible Python objects as output.
TatSu is a mature project with an important user base so it's difficult to make certain changes even if they are improvements or fixes for long-standing quirks (as well known within experienced software engineers, a long-lived quirk becomes a feature). TieXiu is an opportunity to start from scratch, with a modern approach, even if the grammar syntax and its semantics are preserved.
TieXiu was started with the aim of learning Rust and applying AI Agents over a meaningful project (versus books or simple exercises). TatSu has rich parser-generator semantics that had to be replicated in Rust for completeness and compatibility.
Not a primary objective, it was expected that parsing with otpimized Rust over its runtime would run circles around the Python implementation.
It was not so.
As if implementing the semantics wasn't difficult enough on a languate so strict about memory management and with so little reasonable and efficient defaults, the first complete runs of parser generation and parsing were 3x times slower than the best Python counterpart.
It took an important amount of Rust-specific optimizations and some algorithm redesign to reach the current 1.08x speed.Rust is not friendly to the deep recursion required to parse a language like, for example, Java, and its default data structures, like Vec, don't behave well when used as short-lived containers. The complete history of optimizations that include an imported heap manager figure in the Git logs.
The PyO3 interface is there, but it's easier and more convenient to use TatSu directly when working with Python.
TieXiu is today a powerful PEG parser generator in Rust, so it may find a home among the rustacean community wanting to convert flat streams into semantic structures.
Most features of TatSu are available in TieXiu. Some features have not yet been implemented, and a few never will:
- Generation of synthetic classes from grammar parameters will not be implemented in Rust.
- Generation of source code with an object model for deifinitions in the grammar may be implemented if a way is found to make the parser or postprocessing bind the Tree output of a parse to the model.
- Code generation of a parser recently moved in TatSu to the loading of a model of the Grammar and using it as parser. Although the generated procedural parser may produce 1.3x increased throughput in Python, supporting generated code is hard and it complicates the internal interfaces. For Rust, TieXiu alreay knows how to load fast a Grammar model from TatSu JSON. A generated copy of the grammar model constructor could be precompiled by Rust.
- Parsing of boolean and numeric values happens in TatSu through synthetic actions, which call the constructors for those types passing the parsed strings. For TieXiu the preferred way of transformig a tree (semantics) is through post-processing (folding), but basic numeric types and booleans could be supported.
- Semantic actions (transformations) during parse are not implemented. Python is friendly to objects of type
Any, so semantic actions during parse in TatSu can produce a tree of any type. Rust is different, and trying to have structures of an any type is not rustacean. The result of a parse is a well-defined Tree which is a small-enough enum that writing a walker for it is easy, so type transformations can be done in postprocessing by folding. See thefoldmodules in TieXiu for examples and useful trait definitions. - Interpolation and evaluation of `constant` expressions hasn't had any known use cases with TatSu. They will not be implemented in TieXiu until a use case appears.
- The
@@includedirective for textual includes was always a bad idea.
The needs of most users are met by parsing input with the rules in a grammar and reciving the structure output as a JSON-compatible value. For other use cases, TieXiu exposes its internal model and APIs (to be docummented).
The return values of Any are of the basic Python types, as defined in the json module documentation (see Encoders and Decoders ).
| JSON | Python |
|---|---|
| object | dict |
| array | list |
| string | str |
| number (int) | int |
| number (real) | float |
| true | True |
| false | False |
| null | None |
Keyword arguments can be passed for runtime configuration. The only recognized argument as of writing is trace=.
These functions are available from package tiexiu.
def parse(grammar: str, text: str, **kwargs: Any) -> Any
def parse_grammar(grammar: str, **kwargs: Any) -> Any:
def parse_grammar_to_json(grammar: str, **kwargs: Any) -> Any:
def parse_to_json(grammar: str, text: str, **kwargs: Anyt) -> Any:
def pretty(grammar: str, **kwargs: Any) -> str:
def compile_to_json(grammar: str, **kwargs: Any) -> Any:pub fn parse_grammar(grammar: &str, cfg: &CfgA) -> Result<Tree>;
pub fn parse_grammar_to_json(grammar: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn parse_grammar_to_json_string(grammar: &str, cfg: &CfgA) -> Result<String>;
pub fn parse_grammar_with<U>(cursor: U, cfg: &CfgA) -> Result<Tree>
pub fn parse_grammar_to_json_with<U>(cursor: U, cfg: &CfgA) -> Result<serde_json::Value>
pub fn compile(grammar: &str, cfg: &CfgA) -> Result<Grammar>;
pub fn compile_to_json(grammar: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn compile_to_json_string(grammar: &str, cfg: &CfgA) -> Result<String>;
pub fn compile_with<U>(cursor: U, cfg: &CfgA) -> Result<Grammar>
pub fn compile_to_json_with<U>(cursor: U, cfg: &CfgA) -> Result<serde_json::Value>
pub fn load(json: &str, _cfg: &CfgA) -> Result<Grammar>;
pub fn load_to_json(json: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn load_tree(json: &str, _cfg: &CfgA) -> Result<Tree>;
pub fn load_tree_to_json(json: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn grammar_pretty(grammar: &str, cfg: &CfgA) -> Result<String>;
pub fn pretty_tree(tree: &Tree, _cfg: &CfgA) -> Result<String>;
pub fn pretty_tree_json(tree: &Tree, _cfg: &CfgA) -> Result<String>;
pub fn parse(grammar: &str, text: &str, cfg: &CfgA) -> Result<Tree>;
pub fn parse_to_json(grammar: &str, text: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn parse_to_json_string(grammar: &str, text: &str, cfg: &CfgA) -> Result<String>;
pub fn parse_input(parser: &Grammar, text: &str, cfg: &CfgA) -> Result<Tree>;
pub fn parse_input_to_json(parser: &Grammar, text: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn parse_input_to_json_string(parser: &Grammar, text: &str, cfg: &CfgA) -> Result<String>;The project is functionally complete. Comments about the implementation strategies and possible improvements are now in RODADMAP.
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Unless explicitly stated otherwise, any contribution intentionally submitted for inclusion in the work, as defined in the Apache-2.0 license, shall be dual-licensed as above, without any additional terms or conditions.