performance - What happens after a L2 TLB miss?

Question

Welcome To Ask or Share your Answers For Others

performance - What happens after a L2 TLB miss?

1 Answer

深蓝 · Answer 1 · 2021-10-16T22:10:23+0000

(Some of this is x86 and Intel-specific. Most of the key points apply to any CPU that does hardware page walks. I also discuss ISAs like MIPS that handle TLB misses with software.)

Modern x86 microarchitectures have dedicated page-walk hardware. They can even speculatively do page-walks to load TLB entries before a TLB miss actually happens. And to support hardware virtualization, the page-walkers can handle guest page tables inside a host VM. (Guest physical memory = host virtual memory, more or less. VMWare published a paper with a summary of EPT, and benchmarks on Nehalem).

Skylake can even have two page walks in flight at once, see Section 2.1.3 of Intel's optimization manual. (Intel also lowered the page-split load penalty from ~100 to ~5 or 10 extra cycles of latency, about the same as a cache-line split but worse throughput. This may be related, or maybe adding a 2nd page-walk unit was a separate response to discovering that page split accesses (and TLB misses?) were more important than they had previously estimated in real workloads).

Some microarchitectures protect you from speculative page-walks by treating it as mis-speculation when an un-cached PTE is speculatively loaded but then modified with a store to the page table before the first real use of the entry. i.e. snoop for stores to the page table entries for speculative-only TLB entries that haven't been architecturally referenced by any earlier instructions.

(Win9x depended on this, and not breaking important existing code is something CPU vendors care about. When Win9x was written, the current TLB-invalidation rules didn't exist yet so it wasn't even a bug; see Andy Glew's comments quoted below). AMD Bulldozer-family violates this assumption, giving you only what the x86 manuals say on paper.

The page-table loads generated by the page-walk hardware can hit in L1, L2, or L3 caches. Broadwell perf counters, for example, can count page-walk hits in your choice of L1, L2, L3, or memory (i.e. cache miss). The event name is PAGE_WALKER_LOADS.DTLB_L1 for Number of DTLB page walker hits in the L1+FB, and others for ITLB and other levels of cache.

Since modern page tables use a radix-tree format with page directory entries pointing to the tables of page table entries, higher-level PDEs (page directory entries) can be worth caching inside the page-walk hardware. This means you need to flush the TLB in cases where you might think you didn't need to. Intel and AMD actually do this, according to this paper (section 3). So does ARM, with their Intermediate table walk cache

That paper says that page-walk loads on AMD CPUs ignore L1, but do go through L2. (Perhaps to avoid polluting L1, or to reduce contention for read ports). Anyway, this makes caching a few high-level PDEs (that each cover many different translation entries) inside the page-walk hardware even more valuable, because a chain of pointer-chasing is more costly with higher latency.

But note that x86 guarantees no negative caching of TLB entries. Changing a page from Invalid to Valid doesn't require invlpg. (So if a real implementation does want to do that kind of negative caching, it has to snoop or somehow still implement the semantics guaranteed by x86 manuals.)

(Historical note: Andy Glew's answer to a duplicate of this question over on electronics.SE says that in P5 and earlier, hardware page-walk loads bypassed the internal L1 cache (it was usually write-through so this made pagewalk coherent with stores). IIRC, my Pentium MMX motherboard had L2 cache on the mobo, perhaps as a memory-side cache. Andy also confirms that P6 and later do load from the normal L1d cache.

That other answer has some interesting links at the end, too, including the paper I linked at the end of last paragraph. It also seems to think the OS might update the TLB itself, rather than just the page table, on a page fault (HW pagewalk doesn't find an entry), and wonders if HW page walking can be disabled on x86. (But actually the OS just modifies the page table in memory, and returning from #PF re-runs the faulting instruction so HW pagewalk will succeed this time.) Perhaps the paper is thinking of ISAs like MIPS where software TLB management / miss-handling is possible.

I don't think it's actually possible to disable HW pagewalk on P5 (or any other x86). That would require a way for software to update TLB entries with a dedicated instruction (there isn't one), or with wrmsr or an MMIO store. Confusingly, Andy says (in a thread I quoted below) that software TLB handling was faster on P5. I think he meant would have been faster if it had been possible. He was was working at Imation (on MIPS) at the time, where SW page walk is an option (sometimes the only option), unlike x86 AFAIK.

As Paul Clayton points out (on another question about TLB misses), the big advantage of hardware page-walks is that TLB misses don't necessarily stall the CPU. (Out-of-order execution proceeds normally, until the re-order buffer fills because the load/store can't retire. Retirement happens in-order, because the CPU can't officially commit anything that shouldn't have happened if a previous instruction faulted.)

BTW, it would probably be possible to build an x86 CPU that handles TLB misses by trapping to microcode instead of having dedicated a hardware state-machine. This would be (much?) less performant, and maybe not worth triggering speculatively (since issuing uops from microcode means you can't be issuing instructions from the code that's running.)

Microcoded TLB handling could in theory be non-terrible if you run those uops in a separate hardware thread (interesting idea), SMT-style. You'd need it to have much less start/stop overhead than normal Hyperthreading for switching from single-thread to both logical cores active (has to wait for things to drain until it can partition the ROB, store queue, and so on) because it will start/stop extremely often compared to a usual logical core. But that may be possible if it's not really a fully separate thread but just some separate retirement state, so cache misses in it don't block retirement of the main code, and have it use a couple hidden internal registers for temporaries. The code it has to run is chosen by the CPU designers, so the extra HW thread doesn't have to anywhere near the full architectural state of an x86 core. It rarely has to do any stores (maybe just for the accessed flags in PTEs?), so it wouldn't be bad to let those stores use the same store queue as the main thread. You'd just partition the front-end to mix in the TLB-management uops and let them execute out of order with the main thread. If you could keep the number of uops per pagewalk small, it might not suck.

No CPUs actually do "HW" page-walks with microcode in a separate HW thread that I'm aware of, but it is a theoretical possibility.

Software TLB handling: some RISCs are like this, not x86

In some RISC architectures (like MIPS), the OS kernel is responsible for handling TLB misses. TLB misses result in execution of the kernel's TLB miss interrupt handler. This means the OS is free to define its own page table format on such architectures. I guess marking a page as dirty after a write also requires a trap to an OS-provided routine, if the CPU doesn't know about page table format.

This chapter from an operating systems textbook explains virtual memory, page tables, and TLBs. They describe the difference between software-managed TLBs (MIPS, SPARCv9) and hardware-managed TLBs (x86). A paper, A Look at Several Memory Management Units, TLB-Refill Mechanisms, and Page Table Organizations shows some example code from what is says is the TLB miss handler in Ultrix, if you want a real example.

Comments about TLB coherency from Andy Glew, one of the architects on Intel P6 (Pentium Pro / II / III), then later worked at AMD.

The main reason Intel started running the page ta

Categories

performance - What happens after a L2 TLB miss?