-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discuss: Custom token callbacks, or mode value property? #1133
Comments
Anyway i've managed to integrate the grammar add-on with highlight.js Not sure if and how an option for a mode to return a ready token to be highlighted directly has any merit or ease of implementation (judging from the hljs code, it would probably need changes in more than one place) Plus i'm only able to make it work, using the un-packed |
The code is on this repository https://github.com/foo123/highlightjs-grammar, for anyone interested |
Wouldn't you have more luck hooking into the keywords pipeline? And just dynamically adding keywords? I'm not sure I 100% follow what you're trying to do here. A more general (or perhaps more specific) example would be great. If you want more than just keywords though (which are highlighted in the "gaps") then yeah you get into the end expression compiling and having to recompile those expressions on the fly while parsing, etc... and it seems from the second snap that you want to selectively NOT highlight tags based on whether they've been seen before or not... Not a lot of fun. Really don't rules/modes need some type of callback system really to do this sort of thing well? I haven't looked at any of your code, just read this issue. I think that would work much better than CHANGING the rules on the fly... just let them continue to match and then dynamically decide what you want to do about it. |
The whole point of the gramnmar addon is that a user can specify a grammar for a language and then highlight based on that grammar. This is a very general approach. Have implemented the add-on for various syntax-highlighter available including highlight.js. The problem (issue is quite old and havent worked on it since then) is that I found no way of adding the parsed tokens into the flow of the highlight them in order to be highlighted. So I made sth of a little hack. I dont know if a new version exists that eases the burden of this. It is working as is and you can see the online interactive example available. The problem is that highlight.js assumes from the start that some regexes will be used and is hardcoded into that mindframe disallowing more general parsers (that dont use simply a set of regular expressions) to be used (like -grammar addon does). So only way I found (for the version I am refereing) was to make a hack and override the regular expressions, while I think some modular system of taking a general callback (as an option, regexes can still be used, but if callback passed use the callback to retrieve the token) would be more general and modular. |
Well because Hightlight JS is an integrated unit. It's a parser-highlighter. It's not two separate pieces. You can't just highlight an arbitrary stream of things. I'm interested in the grammars perhaps being more complex with callbacks and such things. It's be helpful for a very specific example of what you're trying to do, how you hacked it, and how you imagined it'd work in a perfect world. You mention tokens but I'm not sure what tokens you mean.. an actual concrete example might be helpful. |
Does your project already have it's own parser that spits out tokens or an AST that you use with other things? IE if we exposed JUST the highlighter (and not the parser) would that be helpful? |
Yes what I mean (like the code example posted on my first comments) is that highlight.js works only with a language that defines a set of regular expressions and then parses them on its own and highlights them. There is no way a custom parser (whatever that may be, a full-blown parser, other regexes, not following same rationale as highlight.js assumes) can be used and hjs take the result and highlight it. So initially I suggested for some way the parser can be made autonomous and maybe ovewritten as well. For example by allowing a custom callback which takes the current state as argument and returns the next token (string + class). That would be more modular and allow for the addon to work transparently instead of me making this ugly hack. The only way to understand is to check the code of the addon which makes the hack and decide for the best way to make it modular. https://github.com/foo123/highlightjs-grammar/blob/master/src/main.js#L58 You see in the code of the addon, in order to make the integration and use the addon's parser I had to make custom functions that simulate regular expressions and pass this hacky object as the language definition in hjs (as that is what hjs assumes by default and no way of changing that). That is my point. If you can think of any way, for example a callback, can be used to de-couple the parsing from highlighting and make it more general and modular. The default way can still be used (eg if a language defines a set of regular expressions) but allow for overwriting of tokens by some extra parameter and if present delegate parsing to that method (a token is a unit of code that is highlighted by itself, eg an identifier). Hope this is clear. |
Well, I don't think it's a high priority but one could imagine in the future separate the two pieces... so you'd have the highlighter and the parser... so in your case you'd parse code HOWEVER you wanted and then you'd pass an AST and ask for it to be highlighted. It sounds like you'd rather just take over the parsing process entirely. Once you have an AST though is it really that hard to just generate the markup yourself? It seems that is actually the easy part... and you could just leverage our CSS themes for the "look"... |
I have already solved the issue with a hack. It would be good (like some other highlighters do) to have some process which could be overwriten in some modular way and not be hardcoded. My initial intention was to create the add-ons for as many syntax-highlighters I was aware of. The benefit is that people create only a single grammar and can use it throught all highlighters and editors of their choice without any modification (except the styling names, the rest remains the same). So that was the intention and that is why I added support for hjs. Doing so I noticed this issue and suggested some workaround. I leave it up to you if you want to close this issue. |
Also I'm not sure you really answered me... don't you really just want an AST from the parser? Or are you doing things the parser isn't technically capable of? In that case it sounds like you'd want your own parser and JUST our highlighting engine. I'm not sure I completely understand the desire to turn our parser into a more general purpose parser. |
I don't think I understand your goals. Lets start with the stated end goal:
Ok... but what you'd quickly find is that the parsers are NOT all equal at all. Our approach is VERY different than Prism.js approach, for example. Because they're a bit more minimalistic in what they choose to parse, for one - which results in a very different design of the parser- and limits what they can do vs what we can do, for example. So I'd think it would be difficult to impossible to achieve "write once run anywhere" simply because all the parsers you work with are so very different (or not even full blown parsers at all, in the sense of the type of parsing toolchain you'd find in a full blown compiler, for example). Highlight.js and Prism.js are really very, very advanced tokenizers. So I truly don't understand why you wouldn't pivot to writing your own parser that did EVERYTHING you wanted, and then just hook to the highlighters at the very end - or perhaps none of them are easily designed to do that? BUT... Speaking for Highlight.js the highlighting part is trivial, vs the parsing (and grammars). If we were ever to split the parser/highlighter the highlighting (parse tree -> HTML) code is likely 20-30 lines... you're just looping over a nested tree and turning it into linear HTML. So if your goal was to have the "Highlight.js look" or be compatible with our themes, all you need is your own parser and a tiny shim to convert to HTML. If you have your own grammar files you don't really want our parser as far as I can see. So I guess I'm wondering why you didn't go that route as it would seem to be MUCH simpler than what you're trying to do. |
All the *-grammar addons have a full-blown parser of their own (in fact the exact same parser engine, only the integration code changes). The integration with existing editors and highlighters is what is done. So the parser of the addon is used and is integrated in some way into the highlighter or editor. The user simply defines a grammar and can be used throughout all highlighters and editors of their choice for that language as is. This is the intention and how it works. Of course I can create my own syntax highlighting framework (and use my ready made grammars and parsers) but that is another question. The issue here is integration with existing highlighters as an addon. I am aware that some highlighters are narrowly implemented in that parsing and highlighting is a single process, strictly coupled and hardcoded usualy with regular expressions (which solve the parsing problem only partialy at best). Consider the grammar addon integration as an addon for your own framework. It is only in that purpose that is made as an addon which integrates with your own framework as well as others so people can choose freely and use it with your own framework. A kind of kudos, if you like, for this framework. Nothing more, nothing less. I only made a suggestion that maybe the procedure can de-couple the parser from the highlighter so that other parsers (for example the parser of the grammar addon) can be plugged in in some modular way (while still allowing for the default parser and behaviour to work as is). I have already resolved this issue with a hack. I wish I did not have to hack it, but it doesnt matter. it works (packaged version does not work though, only unpackaged version of the framework). So if you like to dismiss this issue, it is fine with me. |
I learn by discussing things, hence this conversation. Plus this is related to: I guess I just don't see why you thought you had to use our code at all. That's what perplexes me. You could still say "add-on for Highlight.js" if all you did was piggyback themes and write the 20 LOC or so wrapper to translate from your parsers output to highlight.js style HTML. That wouldn't do 100% of what we do, because we have some weird features, but that'd be like 95% - for what I'd imagine would have been a lot less effort. So I'm trying to ask if you considered that before you actually went the route you did. In hopes that I might learn something to educate my own undertaking. :-) |
I totally support the request from that other issue #1086. In fact the two issues can be grouped together. De-couple parsing from highlighting, allow plugged-in parser to function if given and make html only one output format from possibly other output formats (eg pdf). I totally agree. When sth is integrated into sth else, it has to hook somewhere (to some existing code) in order to function transparently. So the addon I made has to use and hook into the code of the parent framework. is this what you dont understand fully? This makes the integration stable. Maybe hjs offers features not covered by my hjs addon. This is fine. Users that want to use the addon with hjs can make a choice of what they need. it is only optional in order to save time and complexity if they want to highlight a language of their own, for example, where no ready-made language definition exists and they only simply define a grammar (much easier than writing a parser for a language, even if only regular expressions are used). However I am still not sure you understand the original issue and its suggestion. It is really very easy to add some option or callback in the |
I'd be happy to review a pr (proof of concept), but we have to watch for "simple to add" things. Not everything that can be added should be added. And we also don't want to add things quickly without thought... if we're adding an API that's going to stick around a LONG time we'd rather take it slow and get it right. What you're describing it a pretty huge edge case for most users of Highlight.js. Also, anything we add we have to maintain for the long-term, and troubleshoot, and answer questions about, etc. So it's also possible this kind of thing gets easier, but it may never be officially supported.
How would it even know what the next token is without the regex? The regexes define the tokens and modes.
I'm not sure that's an important goal, but there would be benefits of splitting the two processes and it would make this kind of thing easier. You could just grab which ever part of the pipeline you wanted and use that... that wouldn't be the same as "plug-in" though, at least in my mind. If you already have a parser I think you'd just want to use a |
The thing with the callback option is that it allows the mode to have its own way of tokenizing the input stream, maybe not using regular expressions, or not using this kind of regular expressions that hjs assumes by default. So the callback is simply an entry point of the mode's custom tokenizer into the hjs parsing and highlighting routine. If not callback is given then the default tokenizer of hjs as is right now (but made into its own routine) is used based on the regexes that the mode defines (everything that worked before continues to work). This was my original point. And of course making the html rendering routine simply an option as well, hjs can allow rendering output into other formats as well (also given by mode for example, via a custom render function). You can think about it, after all this is your framework and you have the last word. |
No, I'm just a maintainer. :-)
That just sounds like making us so generic that now we do almost anything. Maybe we'll get there eventually with refactoring, but I'm not sure that seems like a reasonable goal to start with. It feels a little like you're describing a framework/library that we'd build Highlight.js on TOP of... not what Highlight.js actually is or wants to be. I agree the idea sounds cool, I just think perhaps you're describing a NEW project. :-) Although along those lines perhaps you'd find this interesting though: #2212 Although I was imagining "Recompiling" the grammars, not just slurping them in by building the Prism parsing engine into Highlight.js. :-) I'm still curious though... Could you give a working example or walk me thru it? What would the callback do? Be passed the string and position and return a token? I can visualize the concept but not the detail. Or perhaps the details aren't really that fleshed out? OTTOMH in still sounds like you'd want to use your OWN parser and then just pipe that into our "covert to styled HTML" pipeline (if we were to make that easier to do). Do you have an example of a C++ or JSX grammar for your parser thingy? |
The kind of things I imagine being useful are things that fit into the existing modes/rules/regex model... ie, before/after match hooks, or things that allowed you to change the rules slightly while you parse or perhaps decide how and when particular rules should be applied (or not). IE, seeing a function definition and then later knowing when you saw that identifier that it was a function you'd seen earlier, etc. |
I already replied to that other issue and mentioned this addon which works the same both for highlightjs and prism (and others). Maybe the author of the issue will find it useful. After all this is the kind of use cases that the addon targets, maximum portability and ease of creating highly detailed language definitions. It is very easy to have this optional callback. I provide an example based on the code fragment of highlightjs in my first comment. function hjs_tokenizer(code, mode, state)
{
// maybe first time entering the tokenizer, init state
state.top = state.top || mode;
state.index = state.index || 0;
// maybe add more things in state object if needed
// ..
var mode_buffer = '';
var relevance = 0;
try {
var match, count;
while (true) {
state.top.terminators.lastIndex = index;
match = state.top.terminators.exec(code);
if (!match) break;
count = processLexeme(code.substr(state.index, match.index - state.index), match[0]);
state.index = match.index + count;
}
processLexeme(code.substr(state.index));
for(current = state.top; current.parent; current = current.parent) { // close dangling modes
if (current.className) {
result += '</span>';
}
}
return {
relevance: relevance,
value: result,
language: name
};
} catch (e) { /*..*/ }
}
// then inside parsing and highlighting routine check if custom tokenizer given else use default
// ..
var tokenizer = mode.tokenizer || hjs_tokenizer, state = {} /* initialy state is a blank object, tokenizer can add whatever it needs in this object to keep state between calls*/;
// parse
do {
// call tokenizer repeatedly, untill all tokens are exhausted
token = tokenizer(code, mode, state);
// process token
// ..
} while (token); This is rough but you get the idea, hopefully. The trick is to pass an empty object representing the tokenizer state. Initialy it is blank, so mode knows this is the first time called for this string of code. Then it is initialised and on subsequent calls state is kept and it proceeds normaly as every tokenizer can do. Tokenizer habndles its own state, what it needs to store bnetween subsequent calls. The state is tokenizer-specific. The framework simply facilitates this by passing an empty object which the tokenizer handles as needed by itself. State being an object, persists between calls Alternatively the tokenizer can parse the whole code with one call only (not calling it repeatedly). This is fine as well, for my use case, for example, I can use both approaches. In fact they are equivalent if a buffer is used to buffer results and return them all at once. So no big difference, if tokenizer is called repeatedly or just once and for all. |
Now this is something else entirely - in this case you're saying you don't need the parser at all (which is true in your case, as I think we've both mentioned). In someone else's case I dunno how it would work since the tokens themselves are derived from the regex matches... if you take away the regex (such as our plaintext grammar) then the WHOLE thing text becomes a single huge "token" anyways - so there is really no iteration going on. Take a look at: Couldn't your whole project be built as an input plugin that just dumps the "code" (which you've already transformed however you want) thru plaintext (which wouldn't change it one bit)?
And of course you could wrap those two lines to give it a nicer API... Ok, actually it's not that simple since we still have to figure out the parseTree/HTML division of labor and where that happens... hmmm... |
I get the idea except for what is considered a "token" by the lever without any regex rules... if you'd like to elaborate that might be helpful but it's obviously not super relevant in your case where you could just take the content whole. |
Hmm, you are confused about a couple of things and maybe this is my fault. A So the tokenizer receives a steram of text, and breaks it up into tokens according to some rules. For example the following javascript code: var foo = "bar"; is split into the following tokens: [
{token: "var", type: "keyword"},
{token: "foo", type: "identifier"},
{token: "=", type: "operator"},
{token: "\"bar\"", type: "string"},
{token: ";", type: "delimiter"}
] Hope all is clear so far. So the default tokenizer of hsj ( But the mode can define its own tokenizer which uses some other way to split the input text stream into tokens (ie This is also simple, the tokenizer can be called repeatedly as long as it finds tokens in the text, or be called once and return all the tokens at once (actually both approaches are equivalent, dont be confused by this, they are simply two ways of doing the same thing). One approach can be deterministicaly transformed into the other (that is why they are quivalent). I present the repeated approach (rough) where a state object is used to track state between subsequent calls to the same tokenizer. Each call to tokenizer returns one token with its value and type (and possibly other info), as the above tokenizing example demonstrates. Then (unfortunately this part is not well demonstrated in my previous comment) the highlight routine takes the token and creates an output by rendering it (eg in html by wrapping it between
This issue is quite old and am not aware of such functionality. If this is newer and it helps I will give it a look. Can you explain how this works? |
See: Soon it should be a LOT easier to do this than in the past... You'd use a There is still no way to tie DIRECTLY into the existing tokenizer real-time via a simple plugin, but if someone really needed to do that you can replace the whole token tree/html renderer now by swapping two lines of code in the source. The key lines being: var emitter = new TokenTree();
// ...
result = new HTMLRenderer(emitter, options).value(); One could even imagine allowing to configure this: configure({
emitterClass: TokenTree,
htmlRenderer:(emitter, options) => { new HTMLRenderer(emitter, options).value(); }
}) I'm not sure we want to do that (yet or ever), but I'm thinking about it. The API is nice (I think) but it exposes a lot of internals and would make it harder to change the internals in the future I think. I do think we'll expose the parse tree emitter somehow (right now it's exposed as |
Yes, it's very new. Read the plugin docs and check out the PR regarding callbacks for |
You might also find the plugin example here interesting |
+1 For developing a plugin-friendly culture! I will definately check out the docs when I get some time |
Closing due to new functionality and lack of any activity on this issue. |
Hello, i'm playing around with
highlight.js
(9.2.0
) and want to integrate with a-grammar
add-on (following previous work on syntax-highlighting, for example here) which enables to syntax-highlight code by defining a grammar specification for the language (e.g inBNF
form).i have already made some integration code (to be uploaded here), but so far, in order to use the grammar parser and integrate with
hljs
core highlighter some mode boilerplate code (for example multiple modes insidecontains
and dummy;lexemesRe
,beginRe
,endRe
functions). While it would be easier (and more flexible) if there was some property that allowed a callback or even a static value with the lexeme (i,e token) to be directly available from the mode itself (and passed directly to highlighter to be wrapped in<span>[token value]</span>
for highlight).To be more explicit, consider the fragment from
hljs
highlight
method below:Nikos
The text was updated successfully, but these errors were encountered: