One guess that occurs to me is that Intel intended that when Ring 1 code is running, it is the supervisor, "supervising" ring 3 code. Not ring 1 running under ring 0.
If the ring 1 code wants to call ring 0 code, it can call through a call-gate, and the ring 0 code can change CR3 to a page table that includes mappings for physical pages that weren't present in the page table the ring 1 or 2 code was using.
I really don't know a lot about this stuff, but https://wiki.osdev.org/Task_State_Segment shows that the TSS includes a CR3 field, so using hardware task-switching I'm guessing that calling through a call-gate can trigger the CR3 change directly. (So the call target does not already have to be mapped, otherwise ring 1 / 2 code could have modified it. Or it could be mapped read-only, along with the page table itself and the GDT, to stop the ring 1 code from taking over ring 0 by modifying it.)
This means that an OS that only uses paging [...] unable to benefit from the existence of rings 1 and 2
That's your mistake: you can't "only use paging". Even making interrupt handling from user-space work on a normal x86 OS (with a flat memory model) requires setting up TSS stuff to set ESP to the kernel stack pointer when switching to kernel mode, even if you don't otherwise use hardware task-switching.
x86 has "task gates" and "call gates" and all kinds of really complex stuff I hope I don't ever have to fully understand, but I expect that spending some time reading up on it might shed some light on the kind of things the architects of 386 thought OSes might want to do.
Separate from my previous guess (about ring 1 supervising ring 3), perhaps Intel expected OSes to use segmentation to separate ring 1 / 2 from ring 0 memory in the same page table if desired1. As you say, they probably weren't trying to create something that portable microkernel OSes could just use as a bonus.
A kernel has the luxury of deciding the layout of virtual address space, so it could well assign chunks of that for use by ring 1 code, setting up CS/DS/ES/SS appropriately when calling it.
I think that would have to mean a non-flat model, though, because x86 segmentation makes addresses go from 0..limit, not e.g. allowing access to a range of virtual addresses from low..high without changing the meaning of a pointer.
Footnote 1:
Is it necessary to have full memory protection between ring 0 and ring 1? An OS might use ring 1 for semi-trusted code.
Some privileged instructions require ring 0 so ring 1 would stop that from happening by accident. IO privilege level can be set separately to allow cli
and in
/out
in ring > 0, but other instructions like invlpg
, lgdt
, and mov cr, reg
require actual ring 0.