Skip to content

cpu/powerpc: implement reservation for lwarx and stwcx instruction pair#15082

Draft
gm-matthew wants to merge 2 commits intomamedev:masterfrom
gm-matthew:powerpc
Draft

cpu/powerpc: implement reservation for lwarx and stwcx instruction pair#15082
gm-matthew wants to merge 2 commits intomamedev:masterfrom
gm-matthew:powerpc

Conversation

@gm-matthew
Copy link
Contributor

PowerPC has the instructions lwarx and stwcx which are used to perform atomic update (load-modify-store) operations. lwarx loads a word and creates a reservation for a specific memory block; any write to this block will cause the reservation to be lost. stwcx checks if the reservation is still valid, and if it is, it writes the updated memory value, clears the reservation and sets the EQ bit of cr0 to signify success. If it is not valid, it clears the EQ bit to signify failure and no write is performed.

The size of the reserved memory block is implementation-dependent, generally matching the cache line size. For many early PowerPC models, including most of those emulated in MAME, the reservation size is 32 bytes so that is what I have implemented.

This commit fixes a bug in several Model 3 games that causes them to freeze as a result of the previous incorrect implementation of the lwarx and stwcx instructions.

@angelosa
Copy link
Member

This commit fixes a bug in several Model 3 games that causes them to freeze as a result of the previous incorrect implementation of the lwarx and stwcx instructions.

I guess: fvipers2, vs298 and swtrilgy going in service mode?

@cuavas
Copy link
Member

cuavas commented Mar 11, 2026

Isn’t this flawed because you’re operating on logical addresses? These operations are used when multiple devices have access to the same memory. Other devices won’t be aware of the CPU’s address translation setup.

Since the main purpose of this pair of instructions is to implement atomic operations in a system where multiple devices can access the same memory, how do you plan to support this? I imagine this would be important for fixing seemingly random issues in systems with multiple CPUs and/or bus mastering PCI devices.

Also, there should be a way for implementation classes to set the cache line size.

@gm-matthew
Copy link
Contributor Author

I guess: fvipers2, vs298 and swtrilgy going in service mode?

scud would hang when starting a race and vf3 would hang during attract mode (if 3D rendering is disabled to prevent MAME crashing); both of these no longer occur. I haven't tested the above cases yet but it is plausible that these may have been caused by the same issue.

Isn’t this flawed because you’re operating on logical addresses? These operations are used when multiple devices have access to the same memory. Other devices won’t be aware of the CPU’s address translation setup.

Since the main purpose of this pair of instructions is to implement atomic operations in a system where multiple devices can access the same memory, how do you plan to support this? I imagine this would be important for fixing seemingly random issues in systems with multiple CPUs and/or bus mastering PCI devices.

At the moment this commit works in the event of an interrupt occurring between the lwarx and stwcx instructions where the interrupt handler modifies the reserved memory block, even if it does not use the lwarx/stwcx pair to do so. Monitoring the block for all devices would be a much more complex solution and I am not sure how to implement it properly.

There are two alternative approaches I can think of:

  1. Save the memory block to be reserved to an internal buffer when executing lwarx , and read it back when executing stwcx to ensure that it has not been modified. If it has, we invalidate the reservation. While this doesn't guarantee that no other devices will have written to the memory block, it does guarantee that the contents will not have changed.
  2. Automatically invalidate the reservation when finishing a timeslice. This guarantees that no other device can touch the reserved memory block while the reservation is valid, though it could theoretically cause an infinite loop if the timeslice is too small and the gap between the lwarx and stwcx instructions is too large.

Also, there should be a way for implementation classes to set the cache line size.

This is probably a good idea and should not be too difficult as long as we can find out the cache block size (not cache line, my mistake) for each PowerPC device currently emulated.

@MooglyGuy
Copy link
Contributor

Out of curiosity, isn't listening for modifications to a particular address-space range precisely what memory taps are useful for? https://docs.mamedev.org/techspecs/memory.html#taps

Install a tap upon lwarx to listen for any modifications to the relevant block. Remove the tap upon stwcx and/or when a modification occurs.

@cuavas
Copy link
Member

cuavas commented Mar 11, 2026

Out of curiosity, isn't listening for modifications to a particular address-space range precisely what memory taps are useful for? https://docs.mamedev.org/techspecs/memory.html#taps

Install a tap upon lwarx to listen for any modifications to the relevant block. Remove the tap upon stwcx and/or when a modification occurs.

That won’t work, because in MAME the other CPU will be accessing the memory through its own address space, not your address space (e.g. the same RAM installed in both CPUs’ address spaces with .share(...). The “fastram” optimisation also means accesses to shared RAM likely aren’t going through address spaces at all.

@mamehaze
Copy link
Contributor

fvipers2 and vs298 can enter test mode, swtrilgy can't, but that appears to be the case even without the patch, this patch changes nothing in that regard.

neither exit test mode cleanly

@cuavas
Copy link
Member

cuavas commented Mar 11, 2026

Oh, and BTW there is only a stwcx. instruction, there’s no stwcx instruction, which would be a variant that didn’t affect condition codes.

@angelosa
Copy link
Member

fvipers2 and vs298 can enter test mode, swtrilgy can't, but that appears to be the case even without the patch, this patch changes nothing in that regard.

neither exit test mode cleanly

Sorry I meant swtrilgy doesn't enter test mode, fvipers2 and vs298 crashes during attract mode.

@mamehaze
Copy link
Contributor

those are unchanged by this PR, fvipers2 crashes MAME, vs298 shows an onscreen error.

@galibert
Copy link
Member

That won’t work, because in MAME the other CPU will be accessing the memory through its own address space, not your address space (e.g. the same RAM installed in both CPUs’ address spaces with .share(...). The “fastram” optimisation also means accesses to shared RAM likely aren’t going through address spaces at all.

In the model3 and the powermac cases I don't think there's any shared ram with other cpus or devices, it's in practice more a protection against interrupts, incluing preemption. So in that specific case a tap would do quite ok.

@rb6502
Copy link
Contributor

rb6502 commented Mar 11, 2026

FWIW, this is a slightly more strict version of what both Supermodel and DingusPPC do right now, except Dingus does at least translate the address to physical. (Neither of them mask the cache line size off the address, but that's apparently sufficient to boot OS X). So I wouldn't be mad about making this do the logical to physical translation, adding some sufficiently scary comments, and checking this in more or less as-is as long as there's a nice TODO.

The rough sketch in my head of doing this correctly and MAME-idiomatically is to have a device that maintains a list of active reserved physical memory blocks using your favorite fast-lookup-on-random-key data structure. The device is configured with the block size and masks incoming addresses accordingly. The API should just be reserve a block, remove the reservation on a block, and check if a block is reserved.

CPUs that have multiprocessor bus primitives (68K, PPC, and i960 off the top of my head, but there are likely more) would have a configurable device_finder for this lookup device. In this way, the driver can decide which CPU(s) share a physical address space in multiprocessor systems (BeBox and some later PowerMacs).

@cuavas
Copy link
Member

cuavas commented Mar 12, 2026

(Disclaimer: my PowerPCjutsu isn’t as sharp as it once was, due to atrophy from disuse, but I’m still pretty confident on most of this.)

The reservation granularity for lwarx/stwcx. is 32 bytes (eight words) for the 600, 750 and 7400 series, and 64 bytes (16 words) for the 970 and POWER5 series. I don’t know what it is for other families off the top of my head.

The CPUs support a single reservation, which operates on physical addressing (i.e. the address after address translation).

A misaligned lwarx/stwcx. always causes an exception, even if the access fits within a bus access (not on a four-byte boundary, but not crossing an eight-byte boundary when the CPU is configured for a 64-bit data bus).

Using lwarx/stwcx. on an address within a caching-inhibited or direct-store page/block causes a DSI exception. (Note that this does not include write-through pages/blocks – lwarx/stwcx. is permitted for write-through locations.)

stwcx. only checks that the effective address is valid (aligned and not within a cache-inhibited or direct-store page/block) and that a reservation exists. It does not check that the effective address is within the reserved block. It will succeed for locations outside the reserved block.

The reservation is cancelled when any of the following occur:

  • A stwcx. instruction is executed with any effective address.
  • An attempt by some other device to modify a location in the reservation granularity.
  • The reserved block is evicted due to cache pressure, forcing the cache to be purged, or changing attributes to mark the page/block as cache-inhibiting (IIRC, I’d have to confirm this to be 100% sure).

Well-behaved interrupt handlers and process switching code will explicitly clear any reservation before returning (usually by executing a stwcx. instruction to some dummy location.

See, for example, the recommendations in the 603 user manual section 4.4:

The operating system should execute one of the following when processes are switched:

  • The sync instruction, which orders the effects of instruction execution. All instructions previously initiated appear to have completed before the sync instruction completes, and no subsequent instructions appear to be initiated until the sync instruction completes. For an example showing use of the sync instruction, see Chapter 2, “Register Set,” of the Programming Environments Manual.
  • The isync instruction, which waits for all previous instructions to complete and then discards any fetched instructions, causing subsequent instructions to be fetched (or refetched) from memory and to execute in the context (privilege, translation, protection, etc.) established by the previous instructions.
  • The stwcx. instruction, to clear any outstanding reservations, which ensures that an lwarx instruction in the old process is not paired with an stwcx. instruction in the new process.

The operating system should set the MSR[RI] bit as described in Section 4.2.4, “Setting MSR[RI].”

Or footnote 2 of this blog post by Raymond Chen:

Interrupts and traps do not clear the reservation. This means that if the operating system wants to perform a context switch, it needs to perform a stwcx. to a harmless location to force the reservation to be cleared. Otherwise, the thread being switched to might be in the middle of an atomic operation, and its stwcx. might succeed based on the previous thread’s reservation! This is a rare case where you will intentionally perform a stwcx. to an address that doesn’t match the preceding lwarx.

(Note that that blog post incorrectly states that, “If you attempt to store back to a location different from the most recent preceding lwarx, and the reservation is still valid, the store might or might not succeed, and the eq bit will be unpredictable; it need not reflect the actual success of the store. So don’t do that.” It’s documented that it will succeed for the 600 series.)

Simply writing to another location within the reservation granularity from the same CPU will not cancel the reservation. If it did, it would be impossible to write future-proof code, as future architecture implementations can increase the size of the reservation granularity (and have done so).

@gm-matthew
Copy link
Contributor Author

gm-matthew commented Mar 13, 2026

I decided to whip out my PowerBook 1400cs so I could test the lwarx and stwcx. instructions on an actual PowerPC, and I discovered that writing to a memory address previously reserved by lwarx does not cancel the reservation (unless stwcx. is used); this means that my current commit is incorrect.

Are there any drivers in MAME using a PowerPC that have other devices that share RAM with the PowerPC?

@cuavas
Copy link
Member

cuavas commented Mar 13, 2026

konami/konamim2.cpp is a dual PPC602 SMP setup.

apple/macpdm.cpp supports bus mastering NuBus cards with a PPC601 in principle.

konami/cobra.cpp has a grab bag of PowerPC CPUs (PPC603e main CPU, PPC403 system controller, PPC604 T&L processor) with plenty of DMA capabilities, but emulation is very incomplete.

be/bebox.cpp is a dual PPC603 SMP setup, but I don’t know how complete emulation is.

@angelosa
Copy link
Member

Bebox is very incomplete, out of using legacy PCI & SCSI. For starter it doesn't hookup the ISA and the Super I/O, which means it can't even output to the debug port. Then afaik none of boot CDs actually boots, not sure if it was supposed to do more but tbh it's in a kind of limbo that probably rewriting from scratch is probably for the best (note: reuses MPC105 as Sega Model 3)
https://www.netbsd.org/ports/bebox/hardware.html

…t reserved it does not cancel the reservation
@gm-matthew
Copy link
Contributor Author

I've made it so that writing to a reserved block from the same CPU that reserved it no longer cancels the reservation, as per my test on an actual PowerPC. The m_core->reserve_address value is currently unused so we should either find a way to use it effectively or delete it.

Out of curiosity I looked at the code for Xenia which has to emulate multiple PowerPC cores, and its approach for lwarx and stwcx. is to use a global lock which I presume means no other threads can run until the lock is removed. In the event that a global lock is not available, it also uses an atomic compare-and-swap so that stwcx. only updates the reserved value if it remains the same value as when it is loaded with lwarx, i.e. it has not been modified.

I think attempting to implement it so that any other device writing to a reserved memory block cancels the reservation would be a lot of effort for not much benefit; it would be much easier to either force the PowerPC to keep executing until the reservation is cleared (thus ensuring that nothing else can touch the reserved block), or make it so that stwcx. reloads the reserved value to confirm that it has not been modified before writing the updated value. But that's just my opinion and I shall leave it to the other devs to decide on the best solution.

@cuavas
Copy link
Member

cuavas commented Mar 16, 2026

Look, I don’t think this implementation can work:

  • You’re still setting the reservation address before address translation, which is incorrect.
  • You’re setting the “reservation exists” flag before address translation, which is also incorrect, as an exception during address translation will result in the reservation not being created.
  • You can’t just call the m_read32align handler, because you need to hook into address translation to check whether the address is in a direct-store or cache-inhibited area and raise a DSI exception.
  • Does m_read32align check alignment? I thought it just masked off the low bits. A misaligned lwarx always raises an alignment exception.

It’s similar for stwcx. – you need to hook into address translation to generate DSI exceptions properly, and it always needs an alignment check.

@gm-matthew gm-matthew marked this pull request as draft March 16, 2026 16:44
@galibert
Copy link
Member

Something is unclear to me. The cpu does not snoop every write to the address, only another stxcw can invalidate the reservation?

@gm-matthew
Copy link
Contributor Author

Something is unclear to me. The cpu does not snoop every write to the address, only another stxcw can invalidate the reservation?

If a particular CPU uses lwarx to create a reservation, using the same CPU to write to the same address using an instruction other than stwcx., for example stwx, does not invalidate the reservation.

@galibert
Copy link
Member

And if another cpu or a dma does it?

@gm-matthew
Copy link
Contributor Author

And if another cpu or a dma does it?

Then the cache block containing the reserved value is invalidated, resulting in the reservation also being invalidated.

@gm-matthew
Copy link
Contributor Author

  • Does m_read32align check alignment? I thought it just masked off the low bits. A misaligned lwarx always raises an alignment exception.

Yes, m_read32align (and likewise m_write32align) raises an alignment exception if the address is not aligned.

I'm currently working on moving the logic over to the static_generate_memory_accessor function, adding an additional parameter isreserve that will be set to true for lwarx and stwcx. and false for all other memory access instructions. This will make it easier to access the translated physical address to use for the reservation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants