Skip to content

boot: bootutil: Refactor erase functionality to fix watchdog feeding, also fix swap scratch corrupt image issue #2275

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 2, 2025

Conversation

nordicjm
Copy link
Collaborator

@nordicjm nordicjm commented Apr 23, 2025

Refactors the erase functionality in bootutil so that it can be used alongside feeding the watchdog. This has also optimised some functions out.

Also fixes a serious issue with swap using scratch that would leave primary slot images unbootable and potentially brick devices

@nordicjm nordicjm force-pushed the fixwatchdogfeed branch 3 times, most recently from d32100f to 5893f04 Compare April 23, 2025 12:36
@nordicjm nordicjm changed the title boot: bootutil: Refactor erase functionality to fix watchdog feeding [DNR] [DNM] boot: bootutil: Refactor erase functionality to fix watchdog feeding Apr 23, 2025
@nordicjm nordicjm marked this pull request as ready for review April 23, 2025 12:46
@nordicjm nordicjm force-pushed the fixwatchdogfeed branch 2 times, most recently from f5badf3 to 374a654 Compare April 30, 2025 08:58
@nordicjm nordicjm requested a review from d3zd3z as a code owner April 30, 2025 08:58
@nordicjm nordicjm changed the title [DNR] [DNM] boot: bootutil: Refactor erase functionality to fix watchdog feeding boot: bootutil: Refactor erase functionality to fix watchdog feeding, also fix swap scratch corrupt image issue Apr 30, 2025
@nordicjm nordicjm force-pushed the fixwatchdogfeed branch 2 times, most recently from 4d58442 to fe35c27 Compare April 30, 2025 09:04
@nordicjm nordicjm added bug area: core Affects core functionality labels Apr 30, 2025
@nordicjm nordicjm added this to the Release 2.2.0 milestone Apr 30, 2025
off = flash_sector_get_off(&sector);
csize = flash_sector_get_size(&sector);

rc = flash_area_erase(fa, off, csize);
Copy link
Contributor

@taltenbach taltenbach Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we still have an issue here as depending on the flash driver implementation, the sector returned by flash_area_get_sector might be a virtual sector, that is composed of multiple physical sectors. For example, I'm doing that on a project where I have an internal flash memory with 128 KiB physical sectors and an external flash memory with 64 KiB sector. My flash_map_backend implementation simply use a virtual sector size of 128 KiB for simplicity, but also to be able to support swap-move/offset on that device.

Perhaps we want to add a backwards argument to flash_area_erase then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flash_area_erase is an OS function, so it can't be changed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we still have an issue here as depending on the flash driver implementation, the sector returned by flash_area_get_sector might be a virtual sector, that is composed of multiple physical sectors. For example, I'm doing that on a project where I have an internal flash memory with 128 KiB physical sectors and an external flash memory with 64 KiB sector. My flash_map_backend implementation simply use a virtual sector size of 128 KiB for simplicity, but also to be able to support swap-move/offset on that device.

Perhaps we want to add a backwards argument to flash_area_erase then?

Whatever API returns as a sector is a sector, you do not care beyond that. So your backend just returns sectors. I do not understand what would be benefit of knowing what type of sector that really is.

I am working at making MCUboot funciton with software defined sectors, that may be considered virtual sectors, and reducing code that queries the info allover the place. Software sectors also allow devices with storage on RRAM/NVRAM/MRAM/EEPROM function more effectively as they drop the physical constrains like alignment. Actually, for the purpose to ramp up support of RRAM devices, we made them pretend to be flash and have paged layout, even though it does nothing for them and is only annoying for software above them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, let say my scratch area is 128 KiB and composed of two 64 KiB sectors. Imagine my flash backend is considering 128-KiB virtual sectors, so when trying to erase the scratch area backwards, the backend will return a single 128-KiB sector to boot_erase_region that we will erase with flash_area_erase. The backend doesn't know it has to erase the sectors backwards, so it is quite likely it will start by erasing the first 64 KiB sector of the scratch area. If a power cycle occurs at that moment, I guess we'd observe the same issue spotted by @nordicjm

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's different, because MCUboot would see it as a single 128KiB sector, so if we assume this is being used for swap status, swap status is going to be at the beginning of that in the first sector, MCUboot will not split up the sectors. Though having though about it I guess that maybe the patch here does only work because of the limited number of sectors, if there were a lot of sectors, so much that the trailer section spanned multiple sectors and power was lost during erasing them (backwards) then I assume it means that the swap would resume earlier than it should do, but also creating a corrupt image... might need a new device in simulator for that

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's different, because MCUboot would see it as a single 128KiB sector, so if we assume this is being used for swap status, swap status is going to be at the beginning of that in the first sector, MCUboot will not split up the sectors. Though having though about it I guess that maybe the patch here does only work because of the limited number of sectors, if there were a lot of sectors, so much that the trailer section spanned multiple sectors and power was lost during erasing them (backwards) then I assume it means that the swap would resume earlier than it should do, but also creating a corrupt image... might need a new device in simulator for that

Yeah, I was mentioning that problem with swap algorithms too, several times. If your log crosses to a next page it will be left there, in case of interruption, and once new swap starts you will end up in situation where next swap tries to overwrite existing log.
Solution would be to check on every boot, if trailer has erased magic or magic is somehow damaged, and try to erase the entire, presumed, log range. That check should happen on all boots.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, let say my scratch area is 128 KiB and composed of two 64 KiB sectors. Imagine my flash backend is considering 128-KiB virtual sectors, so when trying to erase the scratch area backwards, the backend will return a single 128-KiB sector to boot_erase_region that we will erase with flash_area_erase. The backend doesn't know it has to erase the sectors backwards, so it is quite likely it will start by erasing the first 64 KiB sector of the scratch area. If a power cycle occurs at that moment, I guess we'd observe the same issue spotted by @nordicjm

Well you are not sure about that. OK, so you may with that hardware, but if you have a device like SPI NOR, the driver may choose to erase data in bulk and select command it will send to device depending on range you are trying to erase. For example if you ask it to erase 4k, then it will send command to erase 4k, if you ask it to erase 8k, it will send two 4k commands, but if you ask it to erase 64k then it will erase it with one command. In all above cases, the spi nor will report page size as 4k, so you actually do not know how it does the erase in the end.
You may also have a modified driver that will jump out of erase every 4k to allow other software to function, so it is hard to relay on erase as atomic operation. And actually even the erase, as hardware does it, is not atomic, and may be interrupted leaving some data not erased.

Copy link
Contributor

@taltenbach taltenbach Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well you are not sure about that. OK, so you may with that hardware, but if you have a device like SPI NOR, the driver may choose to erase data in bulk and select command it will send to device depending on range you are trying to erase. For example if you ask it to erase 4k, then it will send command to erase 4k, if you ask it to erase 8k, it will send two 4k commands, but if you ask it to erase 64k then it will erase it with one command. In all above cases, the spi nor will report page size as 4k, so you actually do not know how it does the erase in the end.
You may also have a modified driver that will jump out of erase every 4k to allow other software to function, so it is hard to relay on erase as atomic operation. And actually even the erase, as hardware does it, is not atomic, and may be interrupted leaving some data not erased.

You're absolutely right, but in that case does that mean swap-scratch is broken by design? Indeed, if my understanding of the bug described by @nordicjm is correct, it seems swap-scratch assumes the trailer of the scratch area is erased before the other data contained in the scratch area, which wouldn't be the case in the situations you're describing, even though boot_erase_region is erasing backwards.

And in fact, we might even have an issue for swap-move or swap-offset, the swap_scramble_trailer_sectors routine erases the trailer backwards to ensure we cannot have a case where, after a reboot during the erase of the trailer, the magic number (at the end of the trailer) is valid but the remaining of the trailer has already been erased and is therefore corrupted. Since we cannot trust the flash driver to erase the last part of our trailer first and since the erase is not atomic at hardware-level, I'm wondering if there are some cases where we would have an issue.

Comment on lines 1254 to 1257
goto done;
} else if (off >= flash_area_get_size(fa) || (flash_area_get_size(fa) - off) < size) {
rc = -1;
goto done;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to add the same check to boot_erase_region?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can do, will do it in separate PR though since CI is very slow today

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. Think mynewt might be broken with this PR...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. Think mynewt might be broken with this PR...

What will break it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI was running for 4 hours on mynewt, I assume it just locked up in some loop, what broke it though I have no clue

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that keeps happening I will try to run this locally, should still have mynewet env available.

Copy link
Collaborator

@de-nordic de-nordic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment left, but generally looks ok.

@nordicjm nordicjm force-pushed the fixwatchdogfeed branch 2 times, most recently from 82707e0 to 24b3c97 Compare April 30, 2025 13:27
size_t csize;

/* Get current sector and, also, correct offset */
rc = flash_area_get_sector(fa, off, &sector);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nordicjm Something is wrong with this function from mynewt. For the second sector starting at offset 131072 it returns sector start offset as 0, which causes the line 624 to assign 0 to offset and the loop never ends.

nordicjm added 3 commits May 1, 2025 07:32
Refactors the erase functionality in bootutil so that it can be used
alongside feeding the watchdog. This has also optimised some
functions out.

Signed-off-by: Jamie McCrae <[email protected]>
Fixes an issue with the swap using scratch algorithm that would
cause the image loaded into the primary slot to be corrupt and
unbootable if a device was rebooted during an erase of the scratch
section that had not completed

Signed-off-by: Jamie McCrae <[email protected]>
Adds some additional fixes that have been fixed for 2.2.0

Signed-off-by: Jamie McCrae <[email protected]>
@nordicjm nordicjm force-pushed the fixwatchdogfeed branch from f7fb537 to b6e9ce7 Compare May 1, 2025 06:32
@d3zd3z
Copy link
Member

d3zd3z commented May 1, 2025

I did have one minor concern about how helpful this will be if we add support to be able to lie about having a larger erase size than the underlying device. This would then do the erase forward, which might not fix the issue this fixes.

@nordicjm nordicjm merged commit 990b1fc into mcu-tools:main May 2, 2025
58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: core Affects core functionality bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants