Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[next-master] Question: can we improve the system of dealing with unencoded open/close punctuation? #27

Open
arrowtype opened this issue Mar 17, 2022 · 5 comments

Comments

@arrowtype
Copy link
Contributor

Currently, the script seems to check for unencoded open/close punctuation, but only if it specifically has a name using the .uc suffix. However, there are plenty of potential names that could fall outside of that. For one, .case might be another logical suffix for case-specific punctuation, but then there also might be any other potential reasonable suffixes on punctuation alts.

openCloseUnencodedPairs = {
"parenleft.uc": "parenright.uc",
"bracketleft.uc": "bracketright.uc",
"braceleft.uc": "braceright.uc",
"exclamdown.uc": "exclam.uc",
"questiondown.uc": "question.uc",
"guilsinglleft.uc": "guilsinglright.uc",
"guillemotleft.uc": "guillemotright.uc",
"guilsinglright.uc": "guilsinglleft.uc",
"guillemotright.uc": "guillemotleft.uc",
"slash": "backslash", #should be encoded but adding here because those aren't working for some reason
"backslash": "slash", #should be encoded but adding here because those aren't working for some reason
}

I’m making a note of this as something to potentially look into after #26.

@benkiel
Copy link

benkiel commented Mar 17, 2022

Maybe you don't know about this? robotools/defcon#391 Also, you can use pseudo unicode: split at . see if the first thing has a unicode, use that.

@benkiel
Copy link

benkiel commented Mar 17, 2022

To be clear, I'd use something to make a pair list of open/close: BIDI may be good there, then make a mapping file by splitting the suffixes to map to the unicode encoded version, then you can just do a lookup to get the right open/close

@cjdunn
Copy link
Owner

cjdunn commented Mar 18, 2022

@benkiel that's a great suggestion! I‘m not going to be working on this for a bit, but @arrowtype this seems like it would be helpful if you're going to keep working on this feature. Thank you both!

@arrowtype
Copy link
Contributor Author

@benkiel thanks so much for pointing this out!

you can use pseudo unicode: split at . see if the first thing has a unicode, use that.

I think that Wei’s suffix handling feature does essentially this, so it might just be a matter of adapting/extending that to work for open/close punctuation, as well.

And then, using BIDI would probably be a big improvement over our current, simplistic way of just listing a bunch of potential open/close punctuation (which is almost certainly not as comprehensive as BIDI).

@arrowtype
Copy link
Contributor Author

As a note: I did check whether .case punctuation is handled with the current MM2SC version (0.3.0), and it is not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants