I believe your question is the same as asking if mfence
has the same barrier semantics as the lock
-prefixed instructions on x86, or if it provides fewer1 or additional guarantees in some cases.
My current best answer is that it was Intel's intent and that the ISA documentation guarantees that mfence
and lock
ed instructions provide the same fencing semantics, but that due to implementation oversights, mfence
actually provides stronger fencing semantics on recent hardware (since at least Haswell). In particular, mfence
can fence a subsequent non-temporal load from a WC-type memory region, while lock
ed instructions do not.
We know this because Intel tells us this in processor errata such as HSD162 (Haswell) and SKL155 (Skylake) which tell us that locked instructions don't fence a subsequent non-temporal read from WC-memory:
MOVNTDQA From WC Memory May Pass Earlier Locked Instructions
Problem: An execution of (V)MOVNTDQA (streaming load instruction) that loads from WC (write combining) memory may appear to pass an
earlier locked instruction that accesses a different cache line.
Implication: Software that expects a lock to fence subsequent (V)MOVNTDQA instructions may not operate properly.
Workaround: None identified. Software that relies on a locked instruction to fence subsequent executions of (V)MOVNTDQA
should insert an MFENCE instruction between the locked instruction
and subsequent (V)MOVNTDQA instruction.
From this, we can determine that (1) Intel probably intended that locked instructions fence NT loads from WC-type memory, or else this wouldn't be an errata0.5 and (2) that locked instructions don't actually do that, and Intel wasn't able to or chose not to fix this with a microcode update, and mfence
is recommended instead.
In Skylake, mfence
actually lost its additional fencing capability with respect to NT loads, as per SKL079: MOVNTDQA From WC Memory May Pass Earlier MFENCE Instructions - this has pretty much the same text as the lock
-instruction errata, but applies to mfence
. However, the status of this errata is "It is possible for the BIOS to contain a workaround for this erratum.", which is generally Intel-speak for "a microcode update addresses this".
This sequence of errata can perhaps be explained by timing: the Haswell errata only appears in early 2016, years after the the release of that processor, so we can assume the issue came to Intel's attention some moderate amount of time before that. At this point Skylake was almost certainly already out in the wild, with apparently a less conservative mfence
implementation which also didn't fence NT loads on WC-type memory regions. Fixing the way locked instructions works all the way back to Haswell was probably either impossible or expensive based on their wide use, but some way was needed to fence NT loads. mfence
apparently already did the job on Haswell, and Skylake would be fixed so that mfence
worked there too.
It doesn't really explain why SKL079 (the mfence
one) appeared in January 2016, nearly two years before SKL155 (the locked
one) appeared in late 2017, or why the latter appeared so much after the identical Haswell errata, however.
One might speculate on what Intel will do in the future. Since they weren't able/willing to change the lock
instruction for Haswell through Skylake, representing hundreds of million (billions?) of deployed chips, they'll never be able to guarantee that locked instructions fence NT loads, so they might consider making this the documented, architected behavior in the future. Or they might update the locked instructions, so they do fence such reads, but as a practical matter you can't rely on this probably for a decade or more, until chips with the current non-fencing behavior are almost out of circulation.
Similar to Haswell, according to BV116 and BJ138, NT loads may pass earlier locked instructions on Sandy Bridge and Ivy Bridge, respectively. It's possible that earlier microarchitectures also suffer from this issue. This "bug" does not seem to exist in Broadwell and microarchitectures after Skylake.
Peter Cordes has written a bit about the Skylake mfence
change at the end of this answer.
The remaining part of this answer is my original answer, before I knew about the errata, and which is left mostly for historical interest.
Old Answer
My informed guess at the answer is that mfence
provides additional barrier functionality: between accesses using weakly-ordered instructions (e.g., NT stores) and perhaps between accesses weakly-ordered regions (e.g., WC-type memory).
That said, this is just an informed guess and you'll find details of my investigation below.
Details
Documentation
It isn't exactly clear the extent that the memory consistency effects of mfence
differs that provided by lock
-prefixed instruction (including xchg
with a memory operand, which is implicitly locked).
I think it is safe to say that solely with respect to write-back memory regions and not involving any non-temporal accesses, mfence
provides the same ordering semantics as lock
-prefixed operation.
What is open for debate is whether mfence
differs at all from lock
-prefixed instructions when it comes to scenarios outside the above, in particular when accesses involve regions other than WB regions or when non-temporal (streaming) operations are involved.
For example, you can find some suggestions (such as here or here) that mfence
implies strong barrier semantics when WC-type operations (e.g., NT stores) are involved.
For example, quoting Dr. McCalpin in this thread (emphasis added):
The fence instruction is only needed to be absolutely sure that all of
the non-temporal stores are visible before a subsequent "ordinary"
store. The most obvious case where this matters is in a parallel
code, where the "barrier" at the end of a parallel region may include
an "ordinary" store. Without a fence, the processor might still have
modified data in the Write-Combining buffers, but pass through the
barrier and allow other processors to read "stale" copies of the
write-combined data. This scenario might also apply to a single
thread that is migrated by the OS from one core to another core (not
sure about this case).
I can't remember the detailed reasoning (not enough coffee yet this
morning), but the instruction you want to use after the non-temporal
stores is an MFENCE. According to Section 8.2.5 of Volume 3 of the
SWDM, the MFENCE is the only fence instruction that prevents both
subsequent loads and subsequent stores from being executed ahead of
the completion of the fence. I am surprised that this is not
mentioned in Section 11.3.1, which tells you how important it is to
manually ensure coherence when using write-combining, but does not
tell you how to do it!
Let's check out the referenced section 8.2.5 of the Intel SDM:
Strengthening or Weakening the Memory-Ordering Model
The Intel 64 and
IA-32 architectures provide several mechanisms for strengthening or
weakening the memory- ordering model to handle special programming
situations. These mechanisms include:
? The I/O instructions, locking
instructions, the LOCK prefix, and serializing instructions force
stronger ordering on the processor.
? The SFENCE instruction
(introduced to the IA-32 architecture in the Pentium III processor)
and the LFENCE and MFENCE instructions (introduced in the Pentium 4
processor) provide memory-ordering and serialization capabilities for
specific types of memory operations.
These mechanisms can be used as follows:
Memory mapped devices and
other I/O devices on the bus are often sensitive to the order of
writes to their I/O buffers. I/O instructions can be used to (the IN
and OUT instructions) impose strong write ordering on such accesses as
follows. Prior to executing an I/O instruction, the processor waits
for all previous instructions in the program to complete and for all
buffered writes to drain to memory. Only instruction fetch and page
tables walks can pass I/O instructions. Execution of subsequent
instructions do not begin until the processor determines that the I/O
instruction has been completed.
Synchronization mechanisms in multiple-processor systems may depend
upon a strong memory-ordering model. Here, a program can use a locking
instruction such as the XCHG instruction or the LOCK prefix to ensure
that a read-modify-write operation on memory is carried out
atomically. Locking operations typically operate like I/O operations
in that they wait for all previous instructions to complete and for
all buffered writes to drain to memory (see Section 8.1.2, “Bus
Locking”).
Program synchronization can also be carried out with
serializing instructions (see Section 8.3). These instructions are
typically used at critical procedure or task boundaries to force
completion of all previous instructions before a jump to a new section
of code or a context switch occurs. Like the I/O and locking
instructions, the processor waits until all previous instructions have
been completed and all buffered writes have been drained to memory
before executing the serializing instruction.
The SFENCE, LFENCE, and
MFENCE instructions provide a performance-efficient way of ensuring
load and store memory ordering between routines that produce
weakly-ordered results and routines that consume that data. The
functions of these instructions are as follows:
? SFENCE — Serializes
all store (write) operations that occurr