Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: can we have programmatic output as well as PDF/LaTeX? #814

Open
yfarjoun opened this issue Feb 16, 2025 · 4 comments
Open

Discussion: can we have programmatic output as well as PDF/LaTeX? #814

yfarjoun opened this issue Feb 16, 2025 · 4 comments

Comments

@yfarjoun
Copy link
Contributor

There are many constants that are defined in hts-spec, but the only way to use them (currently) is to manually copy/update them in one's own implementation. If we were to publish some artifact/artifacts containing these constants, maintainers would be able to import that artifact and use it in their code.

I think that there are two general needed parts here:

  1. Maintain a file that is computer-readable, containing relevant constants organized in a suitable hierarchical format with rich data types.
  2. Emit code that contains classes/constants that the implementers can use directly without having to recode.

Ideally, each of the hts-constants would be defined in a single "original" place, and all the other uses would be automatically generated from that.

Here's my idea for an implementation (based on https://github.com/aantn/reconstant)

Have a configuration file that contains the constants of interest. For example, the SamTags, Sam-header tags, "magic" strings.
This configuration file will be the only definitive place for adding/modifying constants.
Artifacts in different languages (python,java,c,rust,latex,R) will be generated via the make-file.

  • SamTags.tex (for example) will include and use said artifact
  • a version "release" will include packaging up the code and making it available for different languages using the various artifact-distribution options available.

I'm mostly thinking about the SamSpec, but, of course, different sub-specs could choose to use this mechanism or not, individually, for example, VCF, refget, etc.

@yfarjoun yfarjoun changed the title Discussion can we have programmatic output as well as PDF/LaTeX? Discussion: can we have programmatic output as well as PDF/LaTeX? Feb 16, 2025
yfarjoun added a commit that referenced this issue Feb 17, 2025
I added all the SamTags and many of the SamHeader tags as well as three Enums and the BAM magic string into three different yaml files. I then ran reconstant on said files and obtained the autogenerated code.

reconstant doesn't currently have an R nor a LaTeX output mode, but its a simple enough code that it could be injested and modified into this code-base, or I could submit a PR and we could continue using from it's current location.

This PR is meant to provide something to discuss in the next meeting, it's not ready for merging.
@zaeleus
Copy link

zaeleus commented Mar 2, 2025

I like this idea for its formality, but I'm not sure language-specific details or implementations belong with the specification.

Libraries, in any particular language, probably should be providing such constants in the first place? For example, in the Rust library noodles, there are SAM data field tags, SAM record flags, etc. Compared to the Rust output in #815, see there are differences in nomenclature (e.g., the data tags use full names rather than short codes) and type definitions (e.g., flags have a type-safe wrapper rather than being an integer).

Regarding enums, note that the SAM/BAM specification maintainers don't consider field values to be closed sets (see #725 (comment)). noodles changed enums to common string constants (e.g., SAM header read group platform values) because of this argument.

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Mar 2, 2025

I understand your hesitation, and definitely do not want to pretend that the implementation provided in #815 is ideal or even good. The point of that PR was to provide an implementation that would clarify my intention regarding how hts-specs might provide a single point definition of constants. The details of the implementation can be discussed in that PR, after we discuss here if the idea is worthwhile....

The reason I thought it would make sense that hts-specs would provide a definitive set of constants is that it makes it easy to include and recognize the library. I've seen many (mostly python) packages that re-define the hts-spec constants that they need. This provides aple oppornity for mistakes & misunderstandings when reading/using these pacakges.

If the consensus is that such a collection of small libraries is pointless, I'm happy to close this issue....and I also accept the fact that I'm a little late to the game and that the existing libraries are unlikely to include the ones we may release here and make use of them...but I am still curious to see what the community thinks.

@jmarshall
Copy link
Member

jmarshall commented Mar 4, 2025

I can see the use of something like this: for example, to use @zaeleus's usual bugbear 😄, it would be useful to provide an up-to-date list of the valid @RG-PL values in a machine-readable format. However I don't think the specification should be in the business of inventing additional names for all these tags/keywords/codes/etc, particularly when some implementations may have already invented their own names for them. And IMHO #815's suggested description field is an unnecessary maintenance burden when by definition it accompanies the full description in the spec.

I could support adding something lighter-weight, listing the tags and keywords that are currently defined by the specification that are subject to being added to in future. For example, for SAM/BAM/CRAM this could be a JSON file something like pub/sam.json:

{
  "headers": {"HD": ["VN", "SO", "GO", "SS"], …},
  "HD_SO_values": ["unknown", "unsorted", "queryname", "coordinate"],
  "HD_GO_values": ["none", "query", "reference"],
  "SQ_TP_values": ["linear", "circular"],
  "RG_PL_values": ["CAPILLARY", "DNBSEQ", "ELEMENT", "HELICOS", "ILLUMINA", "IONTORRENT",
                   "LS454", "ONT", "PACBIO", "SINGULAR", "SOLID", "ULTIMA"],
  "record_tags": {"AM": "i", "AS": "i", "BC": "Z", …, "CG": ["B", "I"], …},
  "draft_record_tags": {}
}

IMHO that would suffice, and it would be best to leave it up to implementations what, if anything, they wanted to do with the data in such a file. I don't think it would be worthwhile to have the LaTeX spec derive these items from the machine-readable version; e.g., we have textual descriptions for some of the platform values that would be non-trivial to implement in code in LaTeX. So I don't think adding something like this JSON file would be a big maintenance burden, even though the tags and value keywords are duplicated in it.

Regarding enums, note that the SAM/BAM specification maintainers don't consider field values to be closed sets

Reality also does not consider these field values to be closed sets.

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Mar 4, 2025

Thanks @jmarshall for the thoughts and comments.

I agree that the autogenerated LaTeX is possibly a step too far and without that, there's no need for the descriptions, and types.

The reason I suggested autogenerated code was that it would then be relatively straightforward to autogenerate a collection of libraries/packages (one per language) that could be included into a project with the language-appropriate packaging tool.

I like the idea of a json with the tags/values, I was simply unaware of a good way of including that in a code project. This is not so surprising given that I'm far from being an expert on the matter of software packaging....

Do you or anyone else know of a good way of packaging a json as a first-class citizen in different code languages?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants