@gabrielesvelto This was a phenomenal write-up, thank you!
-
@gabrielesvelto fantastic thread thank you

-
@gabrielesvelto Nice thread!
You seem to imply that bugs have become considerably more frequent, largely due to the increased complexity. Right?
To me it's not obvious that the larger number of known issues isn't to a large degree due to much better visibility (we didn't have anywhere close to today's automatic crash collection systems in the past) and due to the vastly increased number of CPUs... Do you have any gut feeling about that?
-
@gsuberland thanks, I was playing a bit fast and loose with the terminology. As I was writing these toots I reminded myself that entire books have been written just to model transistor behavior and propagation delay, and my very crude wording would probably give their authors a heart attack.
-
@AndresFreundTec I've been in charge of Firefox stability for ten years now and some of my early work to detect hardware issues dates back then. In pre-2020 years we could get a 2-3 bugs per year, usually across different CPUs. Now we get dozens, it's really on another level.
-
@AndresFreundTec admittedly we get a lot more after a new microarchitecture launches, and then they go down as microcode updates get rolled out. If Microsoft hadn't started shipping microcode updates with their OS updates we'd be swamped.
-
@gabrielesvelto
There’s also meta-stability. If a value is snapshotted half way through it changing, it may occasionally result in the output not being one or zero, but some ‘half’ value. Depending on the circuits using that result, it may be interpreted as either 1 or 0 — and maybe different parts of the circuit will use different interpretations. Such intermediate states are only meta-stable, and will flip to a firm 1 or 0 at some indeterminate time later, possibly propagating the problem. -
@KimSJ ah yes, very good point. It's been a while since my days in hardware land and I had forgotten about it.

-
@dubiousblur glad you liked it!
-
@tehstu yes, absolutely. I've encountered several bugs in AMD CPUs, not many on ARM just yet, but our ARM user-base is very small compared to x86, so it's just less likely for us to stumble upon them. Plus we have some machinery that can detect some hardware bugs automatically but it doesn't work on ARM just yet.
-
@gabrielesvelto but UEFI is already quite complex, it has to find block devices, read their partition tables, read FAT file systems, read directories and files, load data in memory and transfer execution. Wouldn't a patch after all that not be too late?
-
@gabrielesvelto Intel's officially stated reason is that (too) high voltage (and temperature) caused fast degradation of clock trees inside cores. This degradation resulted in a duty cycle shift (square wave no longer square?), which caused general instability. If they use both posedge and negedge as triggers, then change in duty cycle will definitely violate timing.
-
@arclight timing degradation should not be visible outside of the highest-spec desktop CPUs which are really pushing the envelope even when they're new. Embedded systems and even mid-range desktop CPUs will never fail because of it. What might become visible is increased power consumption over time though.
-
@arclight on the other hand watch out for memory errors. Those can crop up much sooner than CPU problems due to circuit degradation: https://fosstodon.org/@gabrielesvelto/112407741329145666
-
@gabrielesvelto there was also no meaningful computer security nor much need for it in the days of 6502. it's much different when most computers are now connected to the internet and can be infected with malware within seconds of connecting.
-
@mdione yes, it's very complex, but motherboard firmware has a mechanism to load the new microcode right as the CPU is bootstrapped. That is even before the CPU is capable of accessing DRAM. All the rest of the UEFI machinery runs after that. Note that this early bootstrap mechanisms usually involves a separate bootstrap CPU, usually an embedded microcontroller whose task is to get the main x86 core up and running.
-
@gabrielesvelto I wonder if they could use said statistical toys as part of a large-scale fuzzing process to detect such bugs?
-
Fascinating thread. Do you know if the same issues exist on low power, embedded CPUs like ESP32, or is this something that mostly affects high-end stuff?
-
@perpetuum_mobile @gabrielesvelto I used to even code in assembler on 8 bit platforms, for years I could not quite get my head round how modern CPUs worked until this thread (and now I know a bit more)
-
I don’t cut any slack for Intel producing two whole generations of CPUs with manufacturing flaws then trying to cover it up and never really offering full restitution to any customers.
Citiverse è un progetto che si basa su NodeBB ed è federato! | Categorie federate | Chat | 📱 Installa web app o APK | 🧡 Donazioni | Privacy Policy