Skip to content

Mutex is not safe on multi-core systems #12

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #2
adamgreig opened this issue Oct 16, 2018 · 24 comments
Closed
Tracked by #2

Mutex is not safe on multi-core systems #12

adamgreig opened this issue Oct 16, 2018 · 24 comments

Comments

@adamgreig
Copy link
Member

On a multi-core system, disabling interrupts does not prevent the other core from operating, and so values protected by a bare_metal::Mutex will be incorrectly marked Sync.

Since the overwhelming majority of embedded use cases are single-core, I propose putting a prominent warning in the Mutex docstring for now, and working to develop a safe multi-core extension to the Mutex which can be enabled with a feature gate. Probably something using an atomic to implement a spinlock on top of requiring a CriticalSection.

bors bot added a commit that referenced this issue Oct 30, 2018
13: Add note about mutex unsafety on multi-core systems r=japaric a=adamgreig

See #12.

Co-authored-by: Adam Greig <[email protected]>
@japaric
Copy link
Member

japaric commented Jan 23, 2019

Probably something using an atomic to implement a spinlock on top of requiring a CriticalSection.

Spinlocks require a CAS operation so it's not possible to provide this on ARMv6-M.

I don't really see a Cargo feature as an option. Enabling a Cargo feature should not break code so neither of these are valid / proper uses:

  • Remove Sync impl when the "mult-core" feature is enabled (breaks code)
  • Add Sync impl when the "multi-core" feature is enabled and have it as a default feature. Theoretically OK but in practice it will be impossible to turn off the feature (also it would break code), so this would split the ecosystem in two: those who enable "multi-core" and those who do not.
  • Have Mutex.borrow fallback to a spinlock when "multi-core" is enabled. Doesn't work on ARMv6-M so enabling it would break compilation for that target.

My proposal would be to remove the Sync impl from Mutex and since that's required for its only use case; we might as well remove the Mutex abstraction.

@adamgreig
Copy link
Member Author

I agree that removing Mutex entirely is the best option but I think we should wait until rust-embedded/wg#294 is at least somewhat resolved.

@eddyp
Copy link

eddyp commented Jan 23, 2019

Since the overwhelming majority of embedded use cases are single-core, I propose putting a prominent warning in the Mutex docstring for now, and working to develop a safe multi-core extension to the Mutex which can be enabled with a feature gate.

Since this API is still not stable, how about renaming the current Mutex<> to something which makes it clear is safe only for single-core?

As you said, multi-core systems are a minority, so depending on the app, this could be enough.

Spinlocks require a CAS operation so it's not possible to provide this on ARMv6-M.

If we have some APIs which have some clear safety boundary, we can have different implementations on different systems, or some might even be missing, I would consider that acceptable.

@jonas-schievink jonas-schievink mentioned this issue Jan 9, 2020
7 tasks
@rubberduck203
Copy link
Contributor

I’ve been doing a bit of research into this topic this morning. Generally speaking, in order to achieve multi-core synchronization, hardware support is required, but perhaps one of these software implementations is feasible?

I am fairly convinced that an implementation of Mutex does not belong in this hardware agnostic crate, but perhaps having a trait here to be implemented in hw specific crates might make sense. However, in that case, I would argue that a Mutex trait belongs in the embedded-hal crate instead.

@jonas-schievink
Copy link
Contributor

jonas-schievink commented Jan 30, 2020

perhaps one of these software implementations is feasible?

Yes, something like a spinlock can be implemented without CAS if you know the number of competing parties beforehand. That's what I did with irq::PriorityLock, for example.

The drawback is that spinlocks can easily lead to deadlock when used across interrupts (PriorityLock addresses that by providing different APIs based on the priority level, but I'm not yet sure if that's the way to go). bare_metal::Mutex cannot deadlock.

However, in that case, I would argue that a Mutex trait belongs in the embedded-hal crate instead.

Agreed. Right now we have one in mutex-trait, but IMO it should go into the embedded HAL (perhaps after we have some experience with it). (maybe even in bare-metal)

@rubberduck203
Copy link
Contributor

After further research on the software implementations, all of them do require a memory barrier. I checked the Cortex-M0, which does not have hardware muted support, and it does have a memory barrier instruction. I don’t know if that’s something we can expect from every mcu though.

bare_metal::Mutex cannot deadlock

While that is a phenomenal property of a mutex, it’s not one I expect to be there. In general, mutexes do not guarantee deadlocks can’t happen and neither does Rust in general. The guarantee is that race conditions can not occur.

I was unaware of the mutex-trait. I have to say that I agree that it should be released as a 0.1 so implementations of it can be created and tested.

rust-embedded/wg#395

That still leaves the question of what to do with the mutex in this crate. I’m a bit concerned about the impact of removing it before another alternative is widely available. Even once another is available, the books will need to be updated accordingly.

@therealprof
Copy link
Contributor

After further research on the software implementations, all of them do require a memory barrier. I checked the Cortex-M0, which does not have hardware muted support, and it does have a memory barrier instruction. I don’t know if that’s something we can expect from every mcu though.

Every MCU in scope of embedded Rust has a memory barrier implementation. I haven't checked those implementations but I'd be very surprised if you would not need a CAS. Typically CAS free algorithms assume the absence of hardware interrupts.

While that is a phenomenal property of a mutex, it’s not one I expect to be there. In general, mutexes do not guarantee deadlocks can’t happen and neither does Rust in general. The guarantee is that race conditions can not occur.

The problem is that in the presence of interrupt handlers deadlocks through to use of e.g. spinlocks are much more likely and cannot be prevented or compile-time checked. This is in stark contrast to a regular operating system, so it is a somewhat important property.

@rubberduck203
Copy link
Contributor

Every MCU in scope of embedded Rust has a memory barrier implementation.

I’m not sure what this means. Isn’t “every MCU in scope of embedded Rust” simply every MCU? Just because there isn’t support now doesn’t mean there won’t be support in the future.

I haven't checked those implementations but I'd be very surprised if you would not need a CAS. Typically CAS free algorithms assume the absence of hardware interrupts.

It may be worthwhile to research. The first software implementation listed, Dekker’s Algoritm, indicates that a spin lock can be implemented without even a test-and-set instruction, let alone a compare-and-swap. That algo has some serious limitations (it only works for 2 processes), but it does seem that it’s worthwhile looking into how a software implementation may be provided as a fallback for MCUs that don’t have mutex primitive instructions. Much like the irq::PriorityLock that was mentioned.

But I digress. There seems to be a fair bit of agreement that Mutex does not belong in this crate. So the question, as I see it, is should it be removed, what’s the impact downstream, and how do we proceed?

@therealprof
Copy link
Contributor

I’m not sure what this means. Isn’t “every MCU in scope of embedded Rust” simply every MCU? Just because there isn’t support now doesn’t mean there won’t be support in the future.

It means exactly what I said: Every currently supported MCU can do memory barriers. There may be some which are problematic in that respect but I don't know which ones. I have my doubts those can be supported in Rust same as I have my doubts some will be supported even if technically possible but you're right that this is speculation.

But I digress. There seems to be a fair bit of agreement that Mutex does not belong in this crate. So the question, as I see it, is should it be removed, what’s the impact downstream, and how do we proceed?

We cannot remove it unless we have an established and working and supported replacement. This Mutex is used pretty much everywhere.

@rubberduck203
Copy link
Contributor

Let's not be too theatrical here.
Yes, there are 185 crates dependent on bare-metal, however very few are actually using Mutex.

https://github.com/search?l=Rust&q=bare_metal%3A%3AMutex&type=Code

Out of the 44 repositories returned in the search above, very few are actually using Mutex.
Many more are using CriticalSection actually.
The breakage here looks pretty minimal in reality.
Of course, this doesn't account for any proprietary usages of the API, but anyone using a pre-1.0 ecosystem in production knows what they signed up for.

I'd also like to point out that I don't think anyone here is talking about outright deleting this Mutex implementation. I admit that "remove" was a poor choice of words on my part. "Remove" is in context of "remove it from this crate". I would expect that this "good enough for many single core use cases" implementation would move to it's own crate (critical-section-mutex?) so anyone who is using it could easily continue doing so.

@therealprof
Copy link
Contributor

Let's not be too theatrical here.
Yes, there are 185 crates dependent on bare-metal, however very few are actually using Mutex.

If you're going to argue with me about essentials and call me theatrical please get at least your data straight: Mutex is re-exported via the cortex-m crate (and possibly other foundational crates for different architectures, too) and mostly gets used from there. It is used all over the map so whatever will be done needs to ensure that we're not breaking the whole ecosystem at once.

@rubberduck203
Copy link
Contributor

There's no reason to be upset. Let's just take a breath here.
I was only trying to come at this with data rather than vague statements.

You're absolutely correct. There are significantly more usages of cortex_m::Interrupt::Mutex.
https://github.com/search?l=Rust&p=1&q=cortex_m%3A%3AInterrupt%3A%3AMutex&type=Code

This could still be easily handled by creating a new crate with the critical section mutex that, with the exception of you @therealprof, people don't seem to believe belongs in this crate.
Once released, cortex-m could reference and re-export it transparently to all of those users.
The only people we break are the people using bare_metal::Mutex directly.

Of course, none of this solves the fact that this Mutex is not safe on multi-core systems, but there also seems to be some consensus that it's dubious at best to think that a reasonable multi-core safe Mutex can be implemented without hardware support.

@therealprof
Copy link
Contributor

This could still be easily handled by creating a new crate with the critical section mutex that, with the exception of you @therealprof, people don't seem to believe belongs in this crate.

Only a small fraction of people actually chimed in here, so it's a bit early to make such statements.

Indeed I don't have any issues with the Mutex being here but I don't have any problems with moving it either. I'm just pointing out the obvious that any change to a foundational crate like this needs to be planned and executed with extreme care.

We still don't have the ability to do something like a crater run to ensure that we're not accidentally causing major damage to the ecosystem, so I'd rather we treat with extreme caution.

Of course, none of this solves the fact that this Mutex is not safe on multi-core systems, but there also seems to be some consensus that it's dubious at best to think that a reasonable multi-core safe Mutex can be implemented without hardware support.

Indeed.

@rubberduck203
Copy link
Contributor

@therealprof I was looking into this again this morning.
It is not Mutex that is unsound for multiple cores.
It's using cortex_m::interrupt::free to provide the CriticalSection that is unsound on multi-core.

It is possible to implement a different mechanism to provide the CritcialSection that would be sound.

https://gist.github.com/rubberduck203/20415cb0bdc0726b2ebf0903e7193665

@rubberduck203
Copy link
Contributor

Just to be clear, the lock I linked to isn’t sound either, it should use a compare and swap, not an exchange, but is just to prove out that sound methods of providing a lock for the existing mutex can be implemented.

@therealprof
Copy link
Contributor

Yeah, this has been discussed back and forth.

Problem is: spinlocks are not ideal either for other reasons and also this implementation will not work on e.g. all Cortex-M0 and M0+ because they don't have CAS instructions so it's not an universally applicable approach.

@rubberduck203
Copy link
Contributor

I think that’s the point. There is no universal approach, but the existing Mutex does allow for individualized approaches that will. It’s maybe not an ideal API, but that’s another matter.

Since the Mutex isn’t the soundness problem, should this issue be closed in preference of the other mutex discussions happening?

@jonas-schievink
Copy link
Contributor

There seems to be some confusion about what CriticalSection and Mutex provide here.

  • CriticalSection is just a token that guarantees, for the duration of its existence, that the current core is in a critical section (ie. has any interrupts that could preempt execution disabled). The contract of this type means that any safe code that constructs a CriticalSection without disabling interrupts is unsound.
  • Mutex then takes this no-interrupts guarantee to provide mutual exclusion. This is sound if we adopt Address Multi-Core Soundness by abolishing Send and Sync wg#419, but not at the current state. It's sound in any case if the data is only accessible from a single core.

I'll improve the docs of CriticalSection to make its contract clearer.

@rubberduck203
Copy link
Contributor

That’s a good idea @jonas-schievink. It took me quite a minute to completely grok how the two interact, and the guarantee that CriticalSection must provide. Would you mind explaining to me why Mutex itself is unsound on multi-core though? I’m not understanding why a CriticalSection couldn’t be provided to it from a global monitor, as described in the ARM synchronization primitives paper.

@eddyp
Copy link

eddyp commented Feb 15, 2020

perhaps having a trait here to be implemented in hw specific crates might make sense. However, in that case, I would argue that a Mutex trait belongs in the embedded-hal crate instead.

I am of the opinion that we're still thinking in terms of ARM cores, or even ARM Cortex M cores.

On SoC with hybrid cores there could be a mix of Cortex A and Cortex M cores, or even non-ARM cores such as RISC-V, so I expect there must be a HW peripheral that could properly implement the synchronization across cores, so, just as I suggested in my 2019 Oxidize Conf presentation (https://www.youtube.com/watch?v=IKXrNlXXfL4#t=29m11s), a trait for such HW-enabled mechanisms is desirable.

@jonas-schievink
Copy link
Contributor

That would be by mutex-trait and custom cross-core Send/Sync traits that do not yet exist.

@jonas-schievink
Copy link
Contributor

cross-core Send/Sync traits that do not yet exist.

Actually you might not even need this if the cores have their own peripherals and a shared mutex peripheral.

@rubberduck203
Copy link
Contributor

I’d like to be clear, I referenced the ARM paper, but the problem & solution are the same for any platform.

@adamgreig
Copy link
Member Author

This issue was effectively closed by rust-embedded/wg#419; the Mutex in bare-metal is only considered sound on single-core systems and some other abstraction will be required for multi-core systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants