boot: bootutil: Refactor erase functionality to fix watchdog feeding, also fix swap scratch corrupt image issue #2275

nordicjm · 2025-04-23T10:39:31Z

Refactors the erase functionality in bootutil so that it can be used alongside feeding the watchdog. This has also optimised some functions out.

Also fixes a serious issue with swap using scratch that would leave primary slot images unbootable and potentially brick devices

taltenbach · 2025-04-30T11:46:01Z

boot/bootutil/src/bootutil_misc.c

+            off = flash_sector_get_off(&sector);
+            csize = flash_sector_get_size(&sector);
+
+            rc = flash_area_erase(fa, off, csize);


Perhaps we still have an issue here as depending on the flash driver implementation, the sector returned by flash_area_get_sector might be a virtual sector, that is composed of multiple physical sectors. For example, I'm doing that on a project where I have an internal flash memory with 128 KiB physical sectors and an external flash memory with 64 KiB sector. My flash_map_backend implementation simply use a virtual sector size of 128 KiB for simplicity, but also to be able to support swap-move/offset on that device.

Perhaps we want to add a backwards argument to flash_area_erase then?

flash_area_erase is an OS function, so it can't be changed

Perhaps we still have an issue here as depending on the flash driver implementation, the sector returned by flash_area_get_sector might be a virtual sector, that is composed of multiple physical sectors. For example, I'm doing that on a project where I have an internal flash memory with 128 KiB physical sectors and an external flash memory with 64 KiB sector. My flash_map_backend implementation simply use a virtual sector size of 128 KiB for simplicity, but also to be able to support swap-move/offset on that device.

Perhaps we want to add a backwards argument to flash_area_erase then?

Whatever API returns as a sector is a sector, you do not care beyond that. So your backend just returns sectors. I do not understand what would be benefit of knowing what type of sector that really is.

I am working at making MCUboot funciton with software defined sectors, that may be considered virtual sectors, and reducing code that queries the info allover the place. Software sectors also allow devices with storage on RRAM/NVRAM/MRAM/EEPROM function more effectively as they drop the physical constrains like alignment. Actually, for the purpose to ramp up support of RRAM devices, we made them pretend to be flash and have paged layout, even though it does nothing for them and is only annoying for software above them.

I mean, let say my scratch area is 128 KiB and composed of two 64 KiB sectors. Imagine my flash backend is considering 128-KiB virtual sectors, so when trying to erase the scratch area backwards, the backend will return a single 128-KiB sector to boot_erase_region that we will erase with flash_area_erase. The backend doesn't know it has to erase the sectors backwards, so it is quite likely it will start by erasing the first 64 KiB sector of the scratch area. If a power cycle occurs at that moment, I guess we'd observe the same issue spotted by @nordicjm

that's different, because MCUboot would see it as a single 128KiB sector, so if we assume this is being used for swap status, swap status is going to be at the beginning of that in the first sector, MCUboot will not split up the sectors. Though having though about it I guess that maybe the patch here does only work because of the limited number of sectors, if there were a lot of sectors, so much that the trailer section spanned multiple sectors and power was lost during erasing them (backwards) then I assume it means that the swap would resume earlier than it should do, but also creating a corrupt image... might need a new device in simulator for that

that's different, because MCUboot would see it as a single 128KiB sector, so if we assume this is being used for swap status, swap status is going to be at the beginning of that in the first sector, MCUboot will not split up the sectors. Though having though about it I guess that maybe the patch here does only work because of the limited number of sectors, if there were a lot of sectors, so much that the trailer section spanned multiple sectors and power was lost during erasing them (backwards) then I assume it means that the swap would resume earlier than it should do, but also creating a corrupt image... might need a new device in simulator for that

Yeah, I was mentioning that problem with swap algorithms too, several times. If your log crosses to a next page it will be left there, in case of interruption, and once new swap starts you will end up in situation where next swap tries to overwrite existing log.
Solution would be to check on every boot, if trailer has erased magic or magic is somehow damaged, and try to erase the entire, presumed, log range. That check should happen on all boots.

I mean, let say my scratch area is 128 KiB and composed of two 64 KiB sectors. Imagine my flash backend is considering 128-KiB virtual sectors, so when trying to erase the scratch area backwards, the backend will return a single 128-KiB sector to boot_erase_region that we will erase with flash_area_erase. The backend doesn't know it has to erase the sectors backwards, so it is quite likely it will start by erasing the first 64 KiB sector of the scratch area. If a power cycle occurs at that moment, I guess we'd observe the same issue spotted by @nordicjm

Well you are not sure about that. OK, so you may with that hardware, but if you have a device like SPI NOR, the driver may choose to erase data in bulk and select command it will send to device depending on range you are trying to erase. For example if you ask it to erase 4k, then it will send command to erase 4k, if you ask it to erase 8k, it will send two 4k commands, but if you ask it to erase 64k then it will erase it with one command. In all above cases, the spi nor will report page size as 4k, so you actually do not know how it does the erase in the end.
You may also have a modified driver that will jump out of erase every 4k to allow other software to function, so it is hard to relay on erase as atomic operation. And actually even the erase, as hardware does it, is not atomic, and may be interrupted leaving some data not erased.

Well you are not sure about that. OK, so you may with that hardware, but if you have a device like SPI NOR, the driver may choose to erase data in bulk and select command it will send to device depending on range you are trying to erase. For example if you ask it to erase 4k, then it will send command to erase 4k, if you ask it to erase 8k, it will send two 4k commands, but if you ask it to erase 64k then it will erase it with one command. In all above cases, the spi nor will report page size as 4k, so you actually do not know how it does the erase in the end.
You may also have a modified driver that will jump out of erase every 4k to allow other software to function, so it is hard to relay on erase as atomic operation. And actually even the erase, as hardware does it, is not atomic, and may be interrupted leaving some data not erased.

You're absolutely right, but in that case does that mean swap-scratch is broken by design? Indeed, if my understanding of the bug described by @nordicjm is correct, it seems swap-scratch assumes the trailer of the scratch area is erased before the other data contained in the scratch area, which wouldn't be the case in the situations you're describing, even though boot_erase_region is erasing backwards.

And in fact, we might even have an issue for swap-move or swap-offset, the swap_scramble_trailer_sectors routine erases the trailer backwards to ensure we cannot have a case where, after a reboot during the erase of the trailer, the magic number (at the end of the trailer) is valid but the remaining of the trailer has already been erased and is therefore corrupted. Since we cannot trust the flash driver to erase the last part of our trailer first and since the erase is not atomic at hardware-level, I'm wondering if there are some cases where we would have an issue.

de-nordic · 2025-04-30T12:48:06Z

boot/bootutil/src/loader.c

+        goto done;
+    } else if (off >= flash_area_get_size(fa) || (flash_area_get_size(fa) - off) < size) {
+        rc = -1;
+        goto done;


Want to add the same check to boot_erase_region?

can do, will do it in separate PR though since CI is very slow today

Added. Think mynewt might be broken with this PR...

Added. Think mynewt might be broken with this PR...

What will break it?

CI was running for 4 hours on mynewt, I assume it just locked up in some loop, what broke it though I have no clue

If that keeps happening I will try to run this locally, should still have mynewet env available.

de-nordic

One comment left, but generally looks ok.

de-nordic · 2025-04-30T19:33:48Z

boot/bootutil/src/bootutil_misc.c

+            size_t csize;
+
+            /* Get current sector and, also, correct offset */
+            rc = flash_area_get_sector(fa, off, &sector);


@nordicjm Something is wrong with this function from mynewt. For the second sector starting at offset 131072 it returns sector start offset as 0, which causes the line 624 to assign 0 to offset and the loop never ends.

Refactors the erase functionality in bootutil so that it can be used alongside feeding the watchdog. This has also optimised some functions out. Signed-off-by: Jamie McCrae <[email protected]>

Fixes an issue with the swap using scratch algorithm that would cause the image loaded into the primary slot to be corrupt and unbootable if a device was rebooted during an erase of the scratch section that had not completed Signed-off-by: Jamie McCrae <[email protected]>

Adds some additional fixes that have been fixed for 2.2.0 Signed-off-by: Jamie McCrae <[email protected]>

d3zd3z · 2025-05-01T18:32:22Z

I did have one minor concern about how helpful this will be if we add support to be able to lie about having a larger erase size than the underlying device. This would then do the erase forward, which might not fix the issue this fixes.

nordicjm force-pushed the fixwatchdogfeed branch 3 times, most recently from d32100f to 5893f04 Compare April 23, 2025 12:36

nordicjm changed the title ~~boot: bootutil: Refactor erase functionality to fix watchdog feeding~~ [DNR] [DNM] boot: bootutil: Refactor erase functionality to fix watchdog feeding Apr 23, 2025

nordicjm marked this pull request as ready for review April 23, 2025 12:46

nordicjm requested review from davidvincze and de-nordic as code owners April 23, 2025 12:46

nordicjm force-pushed the fixwatchdogfeed branch 2 times, most recently from f5badf3 to 374a654 Compare April 30, 2025 08:58

nordicjm requested a review from d3zd3z as a code owner April 30, 2025 08:58

nordicjm changed the title ~~[DNR] [DNM] boot: bootutil: Refactor erase functionality to fix watchdog feeding~~ boot: bootutil: Refactor erase functionality to fix watchdog feeding, also fix swap scratch corrupt image issue Apr 30, 2025

nordicjm force-pushed the fixwatchdogfeed branch 2 times, most recently from 4d58442 to fe35c27 Compare April 30, 2025 09:04

nordicjm added bug area: core Affects core functionality labels Apr 30, 2025

nordicjm added this to the Release 2.2.0 milestone Apr 30, 2025

taltenbach reviewed Apr 30, 2025

View reviewed changes

de-nordic reviewed Apr 30, 2025

View reviewed changes

de-nordic approved these changes Apr 30, 2025

View reviewed changes

nordicjm force-pushed the fixwatchdogfeed branch 2 times, most recently from 82707e0 to 24b3c97 Compare April 30, 2025 13:27

de-nordic reviewed Apr 30, 2025

View reviewed changes

nordicjm added 3 commits May 1, 2025 07:32

boot: bootutil: Refactor erase functionality to fix watchdog feeding

5223612

Refactors the erase functionality in bootutil so that it can be used alongside feeding the watchdog. This has also optimised some functions out. Signed-off-by: Jamie McCrae <[email protected]>

docs: release-notes: Add additional fixes

b6e9ce7

Adds some additional fixes that have been fixed for 2.2.0 Signed-off-by: Jamie McCrae <[email protected]>

nordicjm force-pushed the fixwatchdogfeed branch from f7fb537 to b6e9ce7 Compare May 1, 2025 06:32

de-nordic approved these changes May 1, 2025

View reviewed changes

d3zd3z approved these changes May 1, 2025

View reviewed changes

nordicjm merged commit 990b1fc into mcu-tools:main May 2, 2025
58 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

boot: bootutil: Refactor erase functionality to fix watchdog feeding, also fix swap scratch corrupt image issue #2275

boot: bootutil: Refactor erase functionality to fix watchdog feeding, also fix swap scratch corrupt image issue #2275

nordicjm commented Apr 23, 2025 •

edited

Loading

taltenbach Apr 30, 2025 •

edited

Loading

nordicjm Apr 30, 2025

de-nordic Apr 30, 2025

taltenbach Apr 30, 2025

nordicjm Apr 30, 2025

de-nordic Apr 30, 2025

de-nordic Apr 30, 2025

taltenbach Apr 30, 2025 •

edited

Loading

de-nordic Apr 30, 2025

nordicjm Apr 30, 2025

nordicjm Apr 30, 2025

de-nordic Apr 30, 2025

nordicjm Apr 30, 2025

de-nordic Apr 30, 2025

de-nordic left a comment

de-nordic Apr 30, 2025

d3zd3z commented May 1, 2025

boot: bootutil: Refactor erase functionality to fix watchdog feeding, also fix swap scratch corrupt image issue #2275

boot: bootutil: Refactor erase functionality to fix watchdog feeding, also fix swap scratch corrupt image issue #2275

Conversation

nordicjm commented Apr 23, 2025 • edited Loading

taltenbach Apr 30, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

taltenbach Apr 30, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

de-nordic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d3zd3z commented May 1, 2025

nordicjm commented Apr 23, 2025 •

edited

Loading

taltenbach Apr 30, 2025 •

edited

Loading

taltenbach Apr 30, 2025 •

edited

Loading