Nintendo
Why Nintendo's Satoru Iwata refuses to lay off staff - https://www.polygon.com/2013/7/5/4496512/why-nintendos-satoru-iwata-refuses-to-lay-off-staff
...
Nintendo Will Pay Its Workers 10% More - https://www.gamespot.com/articles/nintendo-will-pay-its-workers-10-more/1100-6511268/ - The move is meant to invest in the workforce and address inflation.
== A one in a million bug in Switch kernel ==
Nintendo Switch firmware 14.0.0 was released yesterday. It contained many minor
changes to their kernel. One of them, was that during user-mode cache
operations (flush / clean / zero), it now sets a secret byte in the thread local
storage (TLS) to 1.
If an interrupt is received, kernel-mode reads the user-mode byte from TLS, and
if it's equal to 1, the kernel performs a memory barrier.
Why is this complicated TLS communication scheme necessary between user-mode
and kernel? Nintendo would not introduce this out-of-the-blue, there is some
weird hardware phenomenon going on.
This took some time to figure out, but imagine the following sequence of
instructions executing:
dc civac, x8
add x8, x8, #32
dc civac, x8
add x8, x8, #32
dc civac, x8 ←
what happens if you take an interrupt here?
add x8, x8, #32
dc civac, x8
add x8, x8, #32
dsb sy ←
memory barrier
ret
An interrupt may be received by the CPU at any point during game execution.
Interrupts may lead to "core migration", which is when the kernel scheduler
moves a thread to a different CPU core.
If we imagine a core migration in this code sequence, we can clearly see the
problem:
dc civac, x8 ←- Core 0
add x8, x8, #32 ←- Core 0
dc civac, x8 ←- Core 0
add x8, x8, #32 ←- Core 0
dc civac, x8 ←- Core 1 interrupt! core migration
add x8, x8, #32 ←- Core 1
dc civac, x8 ←- Core 1
add x8, x8, #32 ←- Core 1
dsb sy ←- Core 1 memory barrier
ret
Do you see the problem? There was never a memory barrier on core 0!
This means that *not necessarily* all cache ops are completed by the time
the function returns! For a brief time, the physical DRAM, for some of the
cache lines, will be incorrect.
So to summarize, if the CPU:
(1) takes an interrupt inside a function like this (super rare)
AND
(2) the scheduler decides to perform core migration (super rare)
Then, you'd get some graphical glitches (games mainly use cache operations when
talking to the GPU).
In this situation, devs would probably blame faulty DRAM chips or CPU errata,
but this is totally a pure software bug!
This bug has existed since day zero, which means that it took 5 years (!) for
Nintendo to track it down.
Credits to whoever nameless employee at Nintendo found this bug! The attention
to detail is incredible. And how do you even find / debug a bug like this?
Makes you think, do Linux, Windows and Mac handle this properly? Honestly, I
doubt it!
Thanks to SciresM for discussion / diff.
—plutoo