-
Notifications
You must be signed in to change notification settings - Fork 427
Fix double-forward, prefer legacy forward maps #4289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix double-forward, prefer legacy forward maps #4289
Conversation
|
👋 Thanks for assigning @joostjager as a reviewer! |
|
Going to take another look at this tomorrow before un-drafting it |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4289 +/- ##
=======================================
Coverage 86.58% 86.59%
=======================================
Files 158 158
Lines 102287 102368 +81
Branches 102287 102368 +81
=======================================
+ Hits 88568 88644 +76
- Misses 11304 11311 +7
+ Partials 2415 2413 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
89f5d07 to
c6bb096
Compare
c6bb096 to
425747e
Compare
lightning/src/ln/channelmanager.rs
Outdated
| (17, in_flight_monitor_updates, option), | ||
| (19, peer_storage_dir, optional_vec), | ||
| (21, WithoutLength(&self.flow.writeable_async_receive_offer_cache()), required), | ||
| (23, reconstruct_manager_from_monitors, required), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops sorry for the delay. Writing a bool here to determine whether to look at data we're always writing seems quite weird? I'm not sure what the right answer is, but ideally we run tests with both the new and old code. In the past (with block connection) we've taken a somewhat hacky approach of just flipping a coin and using a random value to decide. In general its worked and we haven't seen many cases of flaky tests making their way upstream, but its a bit more annoying for devs. Still, absent a better option (I'm not a huge fan of running the entire test suite twice every time, even running it an extra time in CI kinda sucks...) that seems reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I went with the random option you mention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are now still running the entire test suite twice, but not on the same CI run. I think better options are:
- Deliberately pick a few tests that sufficient cover the logic, and only run those twice.
- Do a more extensive test matrix nightly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel more comfortable with the entire test suite running with the new vs old code rather than selected tests, but adding a whole extra CI run to an already slow CI does suck IMO. So I like this current tradeoff.
How about another environment variable? lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that there will be much more of this for the chan mgr refactor. My draft PRs also assume a safe_channels feature flag and an associated CI job. My preference would be to use the same mechanism here, and accept that while the project is underway, we'll have an additional CI job (a single job for all PRs in the project) that only runs on a single platform. It's not that significant.
7b858c1 to
71062f2
Compare
|
Addressed feedback, main diff is here. Also pushed some whitespace fixes after. |
| // to ensure the legacy codepaths also have test coverage. | ||
| #[cfg(not(test))] | ||
| let reconstruct_manager_from_monitors = false; | ||
| #[cfg(test)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened to the idea of using the safe_channels flag here, so that we can gate this and all other changes in the chan mgr refactor project, and make it worth doing a separate CI run for it?
Using conditional compilation for the legacy code (not safe_channels) might improve readability. I noticed that I did had to pay some attention to the control flow interventions with for example continue statements when reconstructing. Also makes it easier to delete that code eventually.
| #[cfg(not(test))] | ||
| let reconstruct_manager_from_monitors = false; | ||
| #[cfg(test)] | ||
| let reconstruct_manager_from_monitors = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheBlueMatt pointed out that the way this is currently structured, a future version of LDK that does not write the pending_intercepted_htlcs/forward_htlcs maps will not be able to downgrade to this version of the code, because it only runs the reconstruction logic in tests.
So instead of running reconstruction logic in tests only, we should consider running it if the manager's written version is >= X, where X is a future version where we can assume that the new data is always present and the old data stopped being written.
I'm not sure what version that would be, and I also think we can hold off on this a little until we look into reconstructing more maps. Otherwise we might have some additional complexity (i.e. one var for reconstruct_fwd_maps_from_monitors if version > X, one for reconstruct_claimable_map_from_monitors if version > Y, etc). But worth thinking about and incorporating into upcoming PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we add a conditional like that if we don't know yet which version the new data is always present?
71062f2 to
f1ea7bb
Compare
|
Discussed offline, going to add an environmental variable to set which manager reconstruction paths to use. Rebased on main to get the changes from #4296 |
f1ea7bb to
f245139
Compare
|
Added the environment variable: diff |
joostjager
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned before, I reluctantly accept the test input randomization and the reload boolean.
The only thing I'd like to be sure of that I really understand is the upgrade/downgrade plan that was landed on. Ideally the new data would be written and used in the next release. Postponing that another release is a big decision and it seems there is not that much gained from it except for removal of some code that already exists in main.
| // persist that state, relying on it being up-to-date on restart. Newer versions are moving | ||
| // towards reducing this reliance on regular persistence of the `ChannelManager`, and instead | ||
| // reconstruct HTLC/payment state based on `Channel{Monitor}` data if | ||
| // `reconstruct_manager_from_monitors` is set below. Currently it is only set in tests, randomly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bringing everything together, would the upgrade/downgrade situation look like this?
| Version | Read | Write | Upgrade to version (max) | Downgrade to version (min) |
|---|---|---|---|---|
| 0.2 | legacy | legacy | 0.5 | - |
| 0.3 | legacy | legacy+new | 0.6 | 0.2 |
| 0.4 | legacy+new | legacy+new | 0.6 | 0.2 |
| 0.5 | legacy+new | new | 0.6 | 0.4 |
| 0.6 | new | new | - | 0.4 |
Also wondering - in relation to the discussion in yesterday's sync meet - how the reconstruct_manager_from_monitors makes for a faster path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would the upgrade/downgrade situation look like this?
My understanding is that our goal is that 0.3 would support reading new as well, so that 0.5 can downgrade to 0.3 rather than 0.4.
Also wondering - in relation to the discussion in yesterday's sync meet - how the reconstruct_manager_from_monitors makes for a faster path?
So that we get the above. I don't see a reason to want to only allow 0.5 to downgrade to 0.4 rather than 0.3. The code currently doesn't do that but presumably in a followup we could do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@valentinewallace and I discussed this more. It seems that having a dedicated flag for signaling that the old maps are not written is better than guessing at a future version number where this is the case.
This currently assumes we'll skip one version before merging the final changes (new read and new write only), but the flexibility remains to wait more versions and enlarge the downgrade window.
| Version | Read | Write | Upgrade to version (max) | Downgrade to version (min) | Set flag |
|---|---|---|---|---|---|
| 0.2 (current) | legacy | legacy | 0.4 | any | |
| 0.3 | legacy OR new with flag | legacy+new | any | any | |
| 0.4 | legacy OR new with flag | legacy+new | any | any | |
| 0.5 | new | new | any | 0.3 | X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that having a dedicated flag for signaling that the old maps are not written is better than guessing at a future version number where this is the case.
You mean instead of using SERIALIZATION_VERSION/MIN_SERIALIZATION_VERSION constants you want to use a TLV? I guess that's fine, but it seems much simpler to use the version numbers so that we can also drop some of the legacy crap that is written as non-TLVs that we'll never be writing anymore. Don't really see a reason to avoid that.
The upgrade path looks right to me, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it is not a great idea to assume things about a specific future version number.
| #[cfg(not(test))] | ||
| let reconstruct_manager_from_monitors = false; | ||
| #[cfg(test)] | ||
| let reconstruct_manager_from_monitors = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we add a conditional like that if we don't know yet which version the new data is always present?
|
You may want to update the PR title and description with details |
|
🔔 1st Reminder Hey @TheBlueMatt! This PR has been waiting for your review. |
|
🔔 2nd Reminder Hey @TheBlueMatt! This PR has been waiting for your review. |
f245139 to
17b562c
Compare
Necessary for the next commit and makes it easier to read.
We recently began reconstructing ChannelManager::decode_update_add_htlcs on startup, using data present in the Channels. However, we failed to prune HTLCs from this rebuilt map if a given HTLC was already forwarded to the outbound edge (we pruned correctly if the outbound edge was a closed channel, but not otherwise). Here we fix this bug that would have caused us to double-forward inbound HTLC forwards.
No need to iterate through all entries in the map, we can instead pull out the specific entry that we want.
We are working on removing the requirement of regularly persisting the ChannelManager, and as a result began reconstructing the manager's forwards maps from Channel data on startup in a recent PR, see cb398f6 and parent commits. At the time, we implemented ChannelManager::read to prefer to use the newly reconstructed maps, partly to ensure we have test coverage of the new maps' usage. This resulted in a lot of code that would deduplicate HTLCs that were present in the old maps to avoid redundant HTLC handling/duplicate forwards, adding extra complexity. Instead, always use the old maps in prod, but randomly use the newly reconstructed maps in testing, to exercise the new codepaths (see reconstruct_manager_from_monitors in ChannelManager::read).
17b562c to
5a4912c
Compare
joostjager
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the upgrade story will be refined in the follow up, all my comments have been addressed.
Addresses a chunk of the feedback from #4227 (review) (tracked in #4280). Splitting it out for ease of review.