Skip to content

Conversation

@Yoshanuikabundi
Copy link
Collaborator

Fixes #

Changes made in this Pull Request:

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@Yoshanuikabundi
Copy link
Collaborator Author

Yoshanuikabundi commented Oct 22, 2025

Hi @mattwthompson, thanks for agreeing to a review! This is obviously a massive changeset and it's going to keep getting bigger, but at this stage I'd just like a review on chapters 1-4 of the "Pablo Book" and any API docs linked therein. I'd specifically like a sanity check on:

  1. Is the writing clear, comprehensible, and interesting?
  2. Does the example code need further explanation or explication?
  3. Are you left with important questions you can't answer?
  4. Was anything a slog to read?
  5. Do any of the described APIs feel unnecessarily painful?
  6. Are there any important undocumented use-cases or things you wish were covered?
  7. Is the information presented in a logical order?
  8. Any other suggestions or comments you might have :)

Spelling and grammar mistakes would be great if you spot them, but I'd prefer a user-like higher level read for information content at this stage and can go over everything with a fine toothed comb when it's more settled. Feel free to just read the rendered docs and leave a big comment if it's easier!

Thanks!!

EDIT: Oh, and I know the PDB files I use in the demo code are a bit of a mess (you'll see when you load the page), planning on tidying them up later!

Yoshanuikabundi and others added 7 commits October 23, 2025 14:47
updates:
- [github.com/astral-sh/ruff-pre-commit: v0.12.10 → v0.12.11](astral-sh/ruff-pre-commit@v0.12.10...v0.12.11)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
updates:
- [github.com/astral-sh/ruff-pre-commit: v0.12.11 → v0.12.12](astral-sh/ruff-pre-commit@v0.12.11...v0.12.12)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* [pre-commit.ci] pre-commit autoupdate

updates:
- [github.com/astral-sh/ruff-pre-commit: v0.12.12 → v0.13.0](astral-sh/ruff-pre-commit@v0.12.12...v0.13.0)

* Update monthly

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Matthew W. Thompson <[email protected]>
updates:
- [github.com/astral-sh/ruff-pre-commit: v0.13.0 → v0.13.1](astral-sh/ruff-pre-commit@v0.13.0...v0.13.1)
Yoshanuikabundi and others added 4 commits October 23, 2025 16:33
updates:
- [github.com/astral-sh/ruff-pre-commit: v0.13.1 → v0.13.3](astral-sh/ruff-pre-commit@v0.13.1...v0.13.3)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Copy link
Member

@mattwthompson mattwthompson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is looking great! I would be lying if I said I was excited to read the nitty gritty on loading PDB files and custom modifications therein, but I found this easy to get through. I feel like I understand the tool better now and would be better equipped to leverage the API.

I went through the first four chapters in detail and the API docs with more of a skim. You'll see that most of my comments are a mixture of me not being less of a battle-warn biomolecular modeling practitioner than you and possibly being to nitpicky on some details you're already planning to smooth over.

Comments on "the book" first:

  • The first example, loading 2ZUQ, emits a warning about atom ordering. I wouldn't note this except that it comes immediately after telling the user that Pablo "just works" (except for a separate case).
  • " ... as long as the ligands are complete and named in the standard way." Complete as in not missing hydrogens? And what's the standard way? (If I should already know both answers, take this as gaps in my background.)
  • Multi-model warning with 1A4T, can this be worked around or acknowledged?
  • Why does the 5AP1 ligand need additional_definitions but 3H34 "just works" without it?
  • In the beginning of Chapter 4, custom residue definitions are reference without, I think, fully being introduced. Reordering the next chapter is probably unadvised
  • Many (all?) of the links to API methods/functions are filled out the first time one is mentioned and not the later. Hassle, I know, but having them all point to the API docs would be nice.
  • 4.2.1 - At first I wondered if this handled protonation and deprotonation, since the example only covers adding protons. The text itself is clear enough that it covers both (assuming I understand the chemistry correctly) but it might be helpful to make this doubly explicit or also include a deprotonation example.
  • How would with_vsite_water meaningfully "apply to residues added later" ? I understand how this might be relevant for virtual sites on ligands or proteins, but not water. Doesn't that method already do everything the user is likely to need with 4+-site waters?
  • I struggled to understand what "The patch is a function from a single residue definition to a list of residue definitions." means in prose. The type annotation helps, but I had to look elsewhere for it
  • 4.3 - At first I was pretty confused what novel behavior this was showing (I'm right in assuming the CCD would handle this just fine?). It might be useful to make explicit that this is just showing off a way to use a more minimal residue library. If I'm understanding it correctly that is

API/UX questions:

  • How often is $XDG_CACHE_HOME overriden from the default? It read at first like an important implementation detail of managing the cache but now seems like something in which the default works fine except for users who might override it.
  • What's the motivation of the format argument? The behavior of the default value covers all of what I'd expect a reasonable user to provide
  • STD_CCD_CACHE.with_ looks odd - what's with the trailing underscore?
  • There are plenty of with_x methods and it was confusing at first that some are on CcdCache and others are on ResidueDefinition
  • High-level docstring of CcdCache.without confused me at first since, even though the argument takes residue names, the effect is removing residue definitions. At first, I thought there was an asymmetry compared to .with_(
  • In general, the one-line docstrings of CcdCache.withX are a little inconsistent. It bears repeating in each, I think, that a new CcdCache is returned

@Yoshanuikabundi
Copy link
Collaborator Author

Thanks for your comments Matt! I addressed them today:

The first example, loading 2ZUQ, emits a warning about atom ordering. I wouldn't note this except that it comes immediately after telling the user that Pablo "just works" (except for a separate case).

Have changed to a different PDB that doesn't have the warning

" ... as long as the ligands are complete and named in the standard way." Complete as in not missing hydrogens? And what's the standard way? (If I should already know both answers, take this as gaps in my background.)

Yes complete as in not missing hydrogens - have made this more explicit. The standard way is just whatever it says in the CCD, I discuss it in the last section of the quickstart, but I've tried to make this clearer.

Multi-model warning with 1A4T, can this be worked around or acknowledged?

Have acknowledged it!

Why does the 5AP1 ligand need additional_definitions but 3H34 "just works" without it?

Clarified that this refers to the same standardness discussed previously

In the beginning of Chapter 4, custom residue definitions are reference without, I think, fully being introduced. Reordering the next chapter is probably unadvised

I've made this seem a little less like I forgot something at least.

Many (all?) of the links to API methods/functions are filled out the first time one is mentioned and not the later. Hassle, I know, but having them all point to the API docs would be nice.

This is the ongoing fight between readability and navigability. I've added more links to things that haven't been linked in a paragraph or two.

4.2.1 - At first I wondered if this handled protonation and deprotonation, since the example only covers adding protons. The text itself is clear enough that it covers both (assuming I understand the chemistry correctly) but it might be helpful to make this doubly explicit or also include a deprotonation example.

I've made it clearer that the given example both adds and removes protons!

How would with_vsite_water meaningfully "apply to residues added later" ? I understand how this might be relevant for virtual sites on ligands or proteins, but not water. Doesn't that method already do everything the user is likely to need with 4+-site waters?

People can always add new residue definitions, even for residues already in the cache, but I agree this is covering an unusual edge case. I've softened the language a little.

I struggled to understand what "The patch is a function from a single residue definition to a list of residue definitions." means in prose. The type annotation helps, but I had to look elsewhere for it

Hmm I'm used to "function from x to y" meaning "function that takes arguments x and returns y". Have clarified.

4.3 - At first I was pretty confused what novel behavior this was showing (I'm right in assuming the CCD would handle this just fine?). It might be useful to make explicit that this is just showing off a way to use a more minimal residue library. If I'm understanding it correctly that is

Yeah that's right, I just wanted to stress the point that all chemical info in Pablo can be changed by the user

API/UX questions:

How often is $XDG_CACHE_HOME overriden from the default? It read at first like an important implementation detail of managing the cache but now seems like something in which the default works fine except for users who might override it.

Clarified that this is a cross-OS standard and de-emphasized the role of the environment variable. I have no idea how many people change this variable but there was a big push a few years ago to get devs to stop writing stuff to ~/.appname and start using the XDG standard.

What's the motivation of the format argument? The behavior of the default value covers all of what I'd expect a reasonable user to provide

The format argument is mostly there for loading file-like objects.

STD_CCD_CACHE.with_ looks odd - what's with the trailing underscore?

with is a keyword in Python, and the standard way to reuse a keyword in a name is with a trailing underscore

There are plenty of with_x methods and it was confusing at first that some are on CcdCache and others are on ResidueDefinition

with_x is just my way of saying "make a copy of this with something changed". It comes from Rust and is even more helpful in Python, where the type system doesn't tell you this info. But I have added CcdCache.with_varied_protonation because I can see that confusion!

High-level docstring of CcdCache.without confused me at first since, even though the argument takes residue names, the effect is removing residue definitions. At first, I thought there was an asymmetry compared to .with_(... In general, the one-line docstrings of CcdCache.withX are a little inconsistent. It bears repeating in each, I think, that a new CcdCache is returned

Oof yeah taking a second look at these I see what you mean. Fixed!

@Yoshanuikabundi Yoshanuikabundi marked this pull request as ready for review November 1, 2025 03:02
Copy link
Member

@j-wags j-wags left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I focused on reading the book "cover-to-cover" and following my nose down some other functionality rabbit holes. it took me a few hours so I didn't review the non-book code changes too closely. Overall this is absolutely excellent. A lot of suggestions and only a few blockers before I think this is ready for merge.

  • (not blocking) Could be good to tell people where the example files live (I'm pretty good at this stuff and it was the 4th place I looked)
  • The heme example doesn't actually load it correctly, and may be too clever to be a good intro:
    • (blocking) The hemes are supposed to be covalently bound to the protein, but aren't in the pablo-loaded topology (see footnote [1] below) (this is probably a limitation of the preparation process, but it makes it a weaker headline example regardless)
    • (not blocking) We can't parameterize iron, even if it were loaded correctly, which makes this not-the-best-first-example
    • (not blocking) Could consider putting a simpler topology here, and moving this to a later page where we can give more context, since it is really impressive
  • (not blocking) The tooltips are really slick, great work on those
  • (not blocking) Could be cool to include a pdbx/mmcif in the quickstart
  • (not blocking) It could be good to include something like a philosophy/use case overview between 1.1 and 1.2, like
    • Normal usage of pablo expects that things will "just work" if your inputs are made of standard residues, or you'll get a helpful error message explaining how your input file needs to change or which residues need additional information to be loaded. We intend to offer several options in our API for users to teach pablo about nonstandard residues. It's important to note that we intend for it it be very rare for users to custom residues atom-by-atom. Instead we offer several API points to allow users to copy and modify existing residue definitions, and we expect these will be the shortest path to defining custom residues in the vast majority of cases.
  • (not blocking) In section 4: "The fact that a residue library admits multiple residue definitions for each name allows alternate protonation states and resonance forms to be included in a residue library." - The inclusion of "resonance forms" here gives me pause. I think this means to say "you could put the charge on the other oxygen in a carboxylate" but I'm thinking like "you could permute the double bonds in an aromatic ring", but then the implication that this would be additive (like, we're telling the user they could have residue definitions for BOTH permutations of double bonds around a ring) creates more questions than answers. I might recommend just leaving this at "protonation states"
  • (not blocking) Drawing leaving atoms in light grey by default is SO COOL

[1] The PDB website shows two cysteines binding the hemes, but the loaded complex doesn't have them

Image

However if bonds existed between the protein and hemes, they'd all be molecule 0. Instead molecule 1 is a disconnected heme.

topology.molecule(1).to_smiles()

yields

 '[H][O][C](=[O])[C]([H])([H])[C]([H])([H])[C]1=[C]2/[C]([H])=[C]3/[C]([C]([H])([H])[C]([H])([H])[C](=[O])[O][H])=[C]([C]([H])([H])[H])[C]4=[N]3->[Fe@]35<-[N]6=[C]([C]([C]([H])=[C]([H])[H])=[C]([C]([H])([H])[H])/[C]6=[C](\\[H])[C](=[C]1[C]([H])([H])[H])[N]23)/[C]([H])=[C]1/[C]([C]([H])([H])[H])=[C]([C]([H])=[C]([H])[H])/[C](=[C]/4[H])[N]15'

@jameseastwood jameseastwood self-requested a review November 4, 2025 04:03
@Yoshanuikabundi
Copy link
Collaborator Author

(not blocking) Could be good to tell people where the example files live (I'm pretty good at this stuff and it was the 4th place I looked)

I've moved them to a subdirectory of the docs so they live next to the documentation source

The heme example doesn't actually load it correctly, and may be too clever to be a good intro

Great catch, I've replaced this with a freshly prepared PDB file with a more Sage-friendly noncovalent ligand

(not blocking) Could be cool to include a pdbx/mmcif in the quickstart

Will give this a miss as I think it'll distract from what the library is really trying to do.

(not blocking) It could be good to include something like a philosophy/use case overview between 1.1 and 1.2, like

Added with some edits

(not blocking) In section 4: "The fact that a residue library admits multiple residue definitions for each name allows alternate protonation states and resonance forms to be included in a residue library." - The inclusion of "resonance forms" here gives me pause. I think this means to say "you could put the charge on the other oxygen in a carboxylate" but I'm thinking like "you could permute the double bonds in an aromatic ring", but then the implication that this would be additive (like, we're telling the user they could have residue definitions for BOTH permutations of double bonds around a ring) creates more questions than answers. I might recommend just leaving this at "protonation states"

Permuting the double bonds in the aromatic ring is absolutely allowed, and a user could absolutely have both permutations of a double bonds around a ring. This is important for eg. resonance forms of charged substituents of an aromatic system where the charge moves around the ring, changing the formal bond orders in the aromatic system. I've clarified what happens when multiple resonance forms match - as long as they assign the same net formal charge, elements, and connectivity, the first match is chosen.

Copy link
Contributor

@jameseastwood jameseastwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of my comments should be considered blocking. Overall, this documentation looks great.


PDB files are a highly permissive format widely used in molecular modeling. The rules for how a PDB file should be written or read are vague enough --- and the chemistry familiar enough --- that the format often runs on the principle of "you know what I mean". For traditional MD workflows, this permissiveness is an asset. Simulation engines can write PDB files with arbitrary virtual sites or coarse-grainings safe in the knowledge that most tools will be able to make do, at least in the biochemical context where the user has force field parameters.

The OpenFF ecosystem, however, relies on access to detailed chemical information of the modeled molecular system. In particular, we require a complete chemical graph; every atom, including hydrogens, their atomic numbers and formal charges, and their connectivity with bond orders. We use this rich chemical information to assign accurate force field parameters to an enormous variety of molecules. By contrast, PDB files have no specified facility for the specification of bond orders, and routinely omit bonds, formal charges, and hydrogen atoms altogether.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "PDB files have no facility for the specification of bond orders" accurate? To avoid repeating "specify"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's slightly less accurate - there is an unusual/rare convention that duplicate CONECT records can represent higher bond orders, but it's not in the PDB spec (hence the first specified). Have rephrased the sentence "By contrast, PDB files routinely omit bonds, formal charges, and hydrogen atoms, and the PDB spec provides no way at all to define bond orders for arbitrary chemistries."


## PDBx/mmCIF files

Pablo has basic support for PDBx/mmCIF files. This works by reading a subset of data entries in the file that are equivalent to the columns of a PDB file's `ATOM/HETATM` records and passing that data through the same matching and topology construction pipeline used for PDB files. This ignores most of the chemical information in the PDBx/mmCIF file. If you have a use-case for the PDBx/mmCIF format that isn't covered by the existing functionality, please let us know --- if there's demand, we'll implement it!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's my professional responsibility to stop developers from promising users that we will implement some feature. Can we re-word this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "please let us know so we can consider implementing it!" work?


Sometimes you need to create a custom residue definition. This might be when you want to augment the CCD cache with a residue of your own, or when creating your own residue library. All residue definitions are instances of the [`ResidueDefinition`] class. That class has a large number of `from_*()` methods that can be used to easily and quickly generate residue definitions.

For ligands and single-residue molecules, the reference docs for that class might be all you need. Unfortunately, preparing residue definitions that can actually link to each other in a polymer requires a little bit more understanding.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omit "Unfortunately"

Copy link
Collaborator Author

@Yoshanuikabundi Yoshanuikabundi Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this omission I felt like the whole paragraph doesn't hold its weight - I intended it as a sort of oblique "sorry this is so complicated". I've removed the paragraph.

Copy link
Member

@j-wags j-wags left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dope. So excited.

@Yoshanuikabundi Yoshanuikabundi merged commit 6d338cb into main Nov 5, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants