performance patterns not comparable across reboots

Paul

2024-07-05 10:06:32 UTC

Post by Christian DÃ¼rrhauer
Hi,
i would like to get a fresh view on my topic.
The awkward behavior on an Ivy Bridge-based machine (stable operation, no
question) is that a few of the hardware components perform differently across
reboots. And not reliably or after a pattern, at least not one that i was able
to find, but in probably 5 out 15 reboots. That's a digital video recorder and
it certainly does not need to be exchanged.
There is a NVMe 1.3 SSD in a PCIe 4.0 card in a PCIe 3.0 x4 slot (Samsung 990
Pro).
There is a Realtek 8125B 2,5Gbe network card (PCIe 2.0 x1) in a PCIe 2.0 x1
slot.
Ubuntu 22.04.4 current (kernel 5.15.101 plus Realtek driver package, r8169
driver blacklisted, booting from SATA drive.
When the issue occurs, SSD delivers 1.9GB/s. Network card delivers 169MB/s.
In normal cases, SSD delivers 3,5GB/s, network card delivers 275MB/s (so the
difference is significant, but still functionally ok).
Like i said, i fail to see a pattern. System log files are just too huge, but
despite that i tried to compare them and am relatively confident i did not
find anything striking.
I have swapped power supply, mainboard, SSD, RAM, CPU, keyboard/mouse. Booting
other Ubuntu (clonezilla images) - looks similar.
Tried googling it but no way finding something, google is too smart and knows
what i was looking for (totally polluted with same search terms but totally
different context).
Anyone having an idea what is happening here?

The numbers suggest improper PCI Express negotiation.
275MB/sec is close to the expected 280MB/sec for a 2.5GbE LAN.
This means the PCI Express is running at the expected rate,
the same case where the SSD gets 3.5GB/sec. If the PCIe on a
Realtek is running at half the rate, then the network output
will also be "clipped" accordingly. Peripheral cards,
for the most part, neatly survive starvation and still function.

An NVMe can be connected to a processor directly, or, it can use
the PCH (Southbridge) x4 interface, which runs at usually one lower
standards value than the CPU one. There can be two sled connectors
on the motherboard. The one nearest the CPU runs at 2x the speed
of the one nearest the Southbridge heatsink.

CPU --- PCIe Rev4 --- NVMe connector
|
DMI Rev3 ^
| \
PCH --- PCIe Rev3 --- NVMe connector <=== PCIe can rate-reduce down to version 1.1 by itself,
as part of the startup procedure for it. Some modern
video cards have done this, without telling you.

The Southbridge (PCH) is usually over-subscribed, which means if all
the "peripherals" on the Southbridge become busy at the same
time, the DMI from CPU to PCH does not have the bandwidth for
that. But I don't think that is happening. And that does not
affect PHY negotiation in any case. The DMI bus, if it were
forced to pathological test case, continues to run, and most
of the time the user might not even be aware there is an issue.

Like the gears on a vehicle, for some reason a PCIe hub is running
one standards version too slow. There is probably a way to "jam" this
in software. For example, a few video cards had a different videoBIOS
added, to force their bus interface to a PCIe 1.1 revision rate,
for stability reasons (8800 era). This made the video card, not quite as fast
as it could have been, but it also ensured the video card always worked,
which is pretty important. No more black screens.

I don't know if "dmesg" has log entries for PCIe or not. The
hardware itself can negotiate for the highest rate. But there
should be more than one mechanism for interfering with that.
I'm a bit worried this is a BIOS code issue (SMI/SMM runs multiple
times a second).

One thing I have discovered to my horror, is the BIOS is
pretty autonomous and not above mischief. My processor was
crashing, but this was no ordinary crash. This was not an MCE
(Machine Check Error) like on a legacy CPU. Instead, it would
appear the BIOS "parked" my processor and turned off both the
keyboard +5V and the mouse +5V (PS/2 and USB). None of the
USB ports worked. The mains power (measured by a meter which
is always present), showed 54W versus idle which is 36W. It's
my belief the BIOS did this in an SMI service routine. But, I
cannot find any documentation, nor a means of monitoring the
BIOS while the OS is running.

Placing the CPU into another motherboard, the CPU runs normally.
The BIOS will eventually "tune something" to the point of ruin,
but it might take a long time before one of these "crashes" comes back.
And it's not really a crash, it is a kind of Safe Mode for Hardware.
There is no documentation. Other people have noted something is
wrong with C state control, and switching off C states (CPU runs warmer),
also apparently eliminates this BIOS issue.

When a BIOS ("UEFI") screws around, that destroys the "trust" we had
in the Legacy BIOS era. UEFI can be programmed from the OS. UEFI
can even agree to flash itself (automatic updates from motherboard
maker). There is a huge attack surface for mischief on these
newer motherboards. Such a bad bad idea. we have learned nothing
apparently, over the years, about defensive design.

Paul