Skip to content

Conversation

@valentinewallace
Copy link
Contributor

@valentinewallace valentinewallace commented Dec 16, 2025

Addresses a chunk of the feedback from #4227 (review) (tracked in #4280). Splitting it out for ease of review.

  • Fix a bug that would cause double-forwarding of inbound HTLCs when using reconstructed forward maps
  • Prefer legacy forward maps in production while randomly using reconstructed maps in tests for coverage (see commit message)
  • a few other nits from the aforementioned review

@ldk-reviews-bot
Copy link

ldk-reviews-bot commented Dec 16, 2025

👋 Thanks for assigning @joostjager as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

@valentinewallace
Copy link
Contributor Author

Going to take another look at this tomorrow before un-drafting it

@codecov
Copy link

codecov bot commented Dec 17, 2025

Codecov Report

❌ Patch coverage is 74.47917% with 49 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.59%. Comparing base (c9f022b) to head (5a4912c).
⚠️ Report is 16 commits behind head on main.

Files with missing lines Patch % Lines
lightning/src/ln/channelmanager.rs 74.73% 43 Missing and 5 partials ⚠️
lightning/src/ln/channel.rs 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4289   +/-   ##
=======================================
  Coverage   86.58%   86.59%           
=======================================
  Files         158      158           
  Lines      102287   102368   +81     
  Branches   102287   102368   +81     
=======================================
+ Hits        88568    88644   +76     
- Misses      11304    11311    +7     
+ Partials     2415     2413    -2     
Flag Coverage Δ
fuzzing 35.95% <15.42%> (-0.90%) ⬇️
tests 85.87% <74.47%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@valentinewallace valentinewallace force-pushed the 2025-12-reconstruct-fwds-followup branch from 89f5d07 to c6bb096 Compare December 17, 2025 20:25
@valentinewallace valentinewallace force-pushed the 2025-12-reconstruct-fwds-followup branch from c6bb096 to 425747e Compare December 17, 2025 21:13
@valentinewallace valentinewallace added this to the 0.3 milestone Jan 6, 2026
(17, in_flight_monitor_updates, option),
(19, peer_storage_dir, optional_vec),
(21, WithoutLength(&self.flow.writeable_async_receive_offer_cache()), required),
(23, reconstruct_manager_from_monitors, required),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops sorry for the delay. Writing a bool here to determine whether to look at data we're always writing seems quite weird? I'm not sure what the right answer is, but ideally we run tests with both the new and old code. In the past (with block connection) we've taken a somewhat hacky approach of just flipping a coin and using a random value to decide. In general its worked and we haven't seen many cases of flaky tests making their way upstream, but its a bit more annoying for devs. Still, absent a better option (I'm not a huge fan of running the entire test suite twice every time, even running it an extra time in CI kinda sucks...) that seems reasonable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I went with the random option you mention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are now still running the entire test suite twice, but not on the same CI run. I think better options are:

  • Deliberately pick a few tests that sufficient cover the logic, and only run those twice.
  • Do a more extensive test matrix nightly.

Copy link
Contributor Author

@valentinewallace valentinewallace Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel more comfortable with the entire test suite running with the new vs old code rather than selected tests, but adding a whole extra CI run to an already slow CI does suck IMO. So I like this current tradeoff.

How about another environment variable? lol

Copy link
Contributor

@joostjager joostjager Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there will be much more of this for the chan mgr refactor. My draft PRs also assume a safe_channels feature flag and an associated CI job. My preference would be to use the same mechanism here, and accept that while the project is underway, we'll have an additional CI job (a single job for all PRs in the project) that only runs on a single platform. It's not that significant.

@valentinewallace valentinewallace force-pushed the 2025-12-reconstruct-fwds-followup branch 3 times, most recently from 7b858c1 to 71062f2 Compare January 7, 2026 20:42
@valentinewallace
Copy link
Contributor Author

Addressed feedback, main diff is here. Also pushed some whitespace fixes after.

@valentinewallace valentinewallace marked this pull request as ready for review January 7, 2026 20:44
@valentinewallace valentinewallace requested review from TheBlueMatt and joostjager and removed request for wpaulino January 7, 2026 20:49
// to ensure the legacy codepaths also have test coverage.
#[cfg(not(test))]
let reconstruct_manager_from_monitors = false;
#[cfg(test)]
Copy link
Contributor

@joostjager joostjager Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened to the idea of using the safe_channels flag here, so that we can gate this and all other changes in the chan mgr refactor project, and make it worth doing a separate CI run for it?

Using conditional compilation for the legacy code (not safe_channels) might improve readability. I noticed that I did had to pay some attention to the control flow interventions with for example continue statements when reconstructing. Also makes it easier to delete that code eventually.

Comment on lines +17953 to +17946
#[cfg(not(test))]
let reconstruct_manager_from_monitors = false;
#[cfg(test)]
let reconstruct_manager_from_monitors = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheBlueMatt pointed out that the way this is currently structured, a future version of LDK that does not write the pending_intercepted_htlcs/forward_htlcs maps will not be able to downgrade to this version of the code, because it only runs the reconstruction logic in tests.

So instead of running reconstruction logic in tests only, we should consider running it if the manager's written version is >= X, where X is a future version where we can assume that the new data is always present and the old data stopped being written.

I'm not sure what version that would be, and I also think we can hold off on this a little until we look into reconstructing more maps. Otherwise we might have some additional complexity (i.e. one var for reconstruct_fwd_maps_from_monitors if version > X, one for reconstruct_claimable_map_from_monitors if version > Y, etc). But worth thinking about and incorporating into upcoming PRs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we add a conditional like that if we don't know yet which version the new data is always present?

@valentinewallace valentinewallace self-assigned this Jan 8, 2026
@valentinewallace valentinewallace moved this to Goal: Merge in Weekly Goals Jan 8, 2026
@valentinewallace valentinewallace added the weekly goal Someone wants to land this this week label Jan 8, 2026
@valentinewallace valentinewallace force-pushed the 2025-12-reconstruct-fwds-followup branch from 71062f2 to f1ea7bb Compare January 8, 2026 19:01
@valentinewallace
Copy link
Contributor Author

Discussed offline, going to add an environmental variable to set which manager reconstruction paths to use. Rebased on main to get the changes from #4296

@valentinewallace valentinewallace force-pushed the 2025-12-reconstruct-fwds-followup branch from f1ea7bb to f245139 Compare January 8, 2026 19:31
@valentinewallace
Copy link
Contributor Author

Added the environment variable: diff

Copy link
Contributor

@joostjager joostjager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned before, I reluctantly accept the test input randomization and the reload boolean.

The only thing I'd like to be sure of that I really understand is the upgrade/downgrade plan that was landed on. Ideally the new data would be written and used in the next release. Postponing that another release is a big decision and it seems there is not that much gained from it except for removal of some code that already exists in main.

// persist that state, relying on it being up-to-date on restart. Newer versions are moving
// towards reducing this reliance on regular persistence of the `ChannelManager`, and instead
// reconstruct HTLC/payment state based on `Channel{Monitor}` data if
// `reconstruct_manager_from_monitors` is set below. Currently it is only set in tests, randomly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bringing everything together, would the upgrade/downgrade situation look like this?

Version Read Write Upgrade to version (max) Downgrade to version (min)
0.2 legacy legacy 0.5 -
0.3 legacy legacy+new 0.6 0.2
0.4 legacy+new legacy+new 0.6 0.2
0.5 legacy+new new 0.6 0.4
0.6 new new - 0.4

Also wondering - in relation to the discussion in yesterday's sync meet - how the reconstruct_manager_from_monitors makes for a faster path?

Copy link
Collaborator

@TheBlueMatt TheBlueMatt Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would the upgrade/downgrade situation look like this?

My understanding is that our goal is that 0.3 would support reading new as well, so that 0.5 can downgrade to 0.3 rather than 0.4.

Also wondering - in relation to the discussion in yesterday's sync meet - how the reconstruct_manager_from_monitors makes for a faster path?

So that we get the above. I don't see a reason to want to only allow 0.5 to downgrade to 0.4 rather than 0.3. The code currently doesn't do that but presumably in a followup we could do that?

Copy link
Contributor

@joostjager joostjager Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@valentinewallace and I discussed this more. It seems that having a dedicated flag for signaling that the old maps are not written is better than guessing at a future version number where this is the case.

This currently assumes we'll skip one version before merging the final changes (new read and new write only), but the flexibility remains to wait more versions and enlarge the downgrade window.

Version Read Write Upgrade to version (max) Downgrade to version (min) Set flag
0.2 (current) legacy legacy 0.4 any
0.3 legacy OR new with flag legacy+new any any
0.4 legacy OR new with flag legacy+new any any
0.5 new new any 0.3 X

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that having a dedicated flag for signaling that the old maps are not written is better than guessing at a future version number where this is the case.

You mean instead of using SERIALIZATION_VERSION/MIN_SERIALIZATION_VERSION constants you want to use a TLV? I guess that's fine, but it seems much simpler to use the version numbers so that we can also drop some of the legacy crap that is written as non-TLVs that we'll never be writing anymore. Don't really see a reason to avoid that.

The upgrade path looks right to me, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it is not a great idea to assume things about a specific future version number.

Comment on lines +17953 to +17946
#[cfg(not(test))]
let reconstruct_manager_from_monitors = false;
#[cfg(test)]
let reconstruct_manager_from_monitors = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we add a conditional like that if we don't know yet which version the new data is always present?

@joostjager
Copy link
Contributor

You may want to update the PR title and description with details

@ldk-reviews-bot
Copy link

🔔 1st Reminder

Hey @TheBlueMatt! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@ldk-reviews-bot
Copy link

🔔 2nd Reminder

Hey @TheBlueMatt! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@valentinewallace valentinewallace changed the title Follow-ups to #4227 (Part 1) Fix double-forward, prefer legacy forward maps Jan 12, 2026
@valentinewallace valentinewallace force-pushed the 2025-12-reconstruct-fwds-followup branch from f245139 to 17b562c Compare January 12, 2026 21:02
Necessary for the next commit and makes it easier to read.
We recently began reconstructing ChannelManager::decode_update_add_htlcs on
startup, using data present in the Channels. However, we failed to prune HTLCs
from this rebuilt map if a given HTLC was already forwarded to the outbound
edge (we pruned correctly if the outbound edge was a closed channel, but not
otherwise).  Here we fix this bug that would have caused us to double-forward
inbound HTLC forwards.
No need to iterate through all entries in the map, we can instead pull out the
specific entry that we want.
We are working on removing the requirement of regularly persisting the
ChannelManager, and as a result began reconstructing the manager's forwards
maps from Channel data on startup in a recent PR, see
cb398f6 and parent commits.

At the time, we implemented ChannelManager::read to prefer to use the newly
reconstructed maps, partly to ensure we have test coverage of the new maps'
usage. This resulted in a lot of code that would deduplicate HTLCs that were
present in the old maps to avoid redundant HTLC handling/duplicate forwards,
adding extra complexity.

Instead, always use the old maps in prod, but randomly use the newly
reconstructed maps in testing, to exercise the new codepaths (see
reconstruct_manager_from_monitors in ChannelManager::read).
@valentinewallace valentinewallace force-pushed the 2025-12-reconstruct-fwds-followup branch from 17b562c to 5a4912c Compare January 12, 2026 21:15
Copy link
Contributor

@joostjager joostjager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the upgrade story will be refined in the follow up, all my comments have been addressed.

@TheBlueMatt TheBlueMatt merged commit c5d7b13 into lightningdevkit:main Jan 13, 2026
19 of 20 checks passed
@github-project-automation github-project-automation bot moved this from Goal: Merge to Done in Weekly Goals Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

weekly goal Someone wants to land this this week

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants