A core always tries to commit its store buffer to L1d cache (and thus become globally visible) as fast as possible, to make room for more stores.
You can use a barrier (like atomic_thread_fence(memory_order_seq_cst
) to make a thread wait for its stores to become globally visible before doing any more loads or stores, but that works by blocking this core, not by speeding up flushing the store buffer.
Obviously to avoid undefined behaviour in C11, the variable has to be _Atomic
. If there's only one writer, you might use tmp = atomic_load_explicit(&x, memory_order_relaxed)
and store_explicit of tmp+1
to avoid a more expensive seq_cst store or atomic RMW. acq / rel ordering would work too, just avoid the default seq_cst, and avoid an atomic_fetch_add
RMW if there's only one writer.
You don't need the whole RMW operation to be atomic if only one thread ever modifies it, and other threads access it read-only.
Before another core can read data you wrote, it has to make its way from Modified state in the L1d of the core that wrote it out to L3 cache, and from there to the L1d of the reader core.
You might be able to speed this part along, which happens after the data leaves the store buffer. But there's not much you can usefully do. You don't want to clflush
/clflushopt
, which would write-back + evict the cache line entirely so the other core would have to get it from DRAM, if it didn't try to read it at some point along the way (if that's even possible).
Ice Lake has clwb
which (hopefully) leaves the data cached as well as forcing write-back to DRAM. But again that forces data to actually go all the way to DRAM, not just a shared outer cache, so it costs DRAM bandwidth and is presumably slower than we'd like. (Skylake-Xeon has it, too, but handles it the same as clflushopt
. I expect & hope that Ice Lake client/server has/will have a proper implementation.)
Tremont (successor to Goldmont Plus, atom/silvermont series) has _mm_cldemote
(cldemote
). That's like the opposite of a SW prefetch; it's an optional performance hint to write the cache line out to L3, but doesn't force it to go to DRAM or anything.
Without special instructions, maybe you can write to 8 other locations that alias the same set in L2 and L1d cache, forcing a conflict eviction. That would cost extra time in the writing thread, but could make make the data available sooner to other threads that want to read it. I haven't tried this.
And this would probably evict other lines, too, costing more L3 traffic = system wide shared resources, not just costing time in the producer thread. You'd only ever consider this for latency, not throughput, unless the other lines were ones you wanted to write and evict anyway.