c++ - Benefits & drawbacks of as-needed conditional std::atomic_thread_fence acquire? -
the code below shows 2 ways of acquiring shared state via atomic flag. reader thread calls poll1()
or poll2()
check whether writer has signaled flag.
poll option #1:
bool poll1() { return (flag.load(std::memory_order_acquire) == 1); }
poll option #2:
bool poll2() { int snapshot = flag.load(std::memory_order_relaxed); if (snapshot == 1) { std::atomic_thread_fence(std::memory_order_acquire); return true; } return false; }
note option #1 presented in earlier question, , option #2 similar example code @ cppreference.com.
assuming reader agrees examine shared state if poll
function returns true
, 2 poll
functions both correct , equivalent?
does option #2 have standard name?
what benefits , drawbacks of each option?
is option #2 more efficient in practice? possible less efficient?
here full working example:
#include <atomic> #include <chrono> #include <iostream> #include <thread> int x; // regular variable, complex data structure std::atomic<int> flag { 0 }; void writer_thread() { x = 42; // release value x reader thread flag.store(1, std::memory_order_release); } bool poll1() { return (flag.load(std::memory_order_acquire) == 1); } bool poll2() { int snapshot = flag.load(std::memory_order_relaxed); if (snapshot == 1) { std::atomic_thread_fence(std::memory_order_acquire); return true; } return false; } int main() { x = 0; std::thread t(writer_thread); // "reader thread" ... // sleep-wait test. // production code calls poll() @ specific points while (!poll2()) // poll1() or poll2() here std::this_thread::sleep_for(std::chrono::milliseconds(50)); std::cout << x << std::endl; t.join(); }
i think can answer of questions.
both options correct, not quite equivalent, due broader applicability of stand-alone fences (they equivalent in terms of want accomplish, stand-alone fence technically apply other things -- imagine if code inlined). example of how stand-alone fence different store/fetch fence explained in this post jeff preshing.
the check-then-fence pattern in option #2 not have name far know. it's not uncommon, though.
in terms of performance, g++ 4.8.1 on x64 (linux) assembly generated both options boils down single load instruction. hardly surprising given x86(-64) loads , stores have acquire , release semantics @ hardware level anyway (x86 known quite strong memory model).
for arm, though, memory barriers compile down actual individual instructions, following output produced (using gcc.godbolt.com -o3 -dndebug
):
for while (!poll1());
:
.l25: ldr r0, [r2] movw r3, #:lower16:.lanchor0 dmb sy movt r3, #:upper16:.lanchor0 cmp r0, #1 bne .l25
for while (!poll2());
:
.l29: ldr r0, [r2] movw r3, #:lower16:.lanchor0 movt r3, #:upper16:.lanchor0 cmp r0, #1 bne .l29 dmb sy
you can see difference synchronization instruction (dmb
) placed -- inside loop poll1
, , after poll2
. poll2
more efficient in real-world case :-) (but read further on why might not matter if they're called in loop block until flag changes.)
for arm64, output different, because there special load/store instructions have barriers built-in (ldar
-> load-acquire).
for while (!poll1());
:
.l16: ldar w0, [x1] cmp w0, 1 bne .l16
for while (!poll2());
:
.l24: ldr w0, [x1] cmp w0, 1 bne .l24 dmb ishld
again, poll2
leads loop no barriers within it, , 1 outside, whereas poll1
barrier each time through.
now, 1 more performant requires running benchmark, , unfortunately don't have setup that. poll1
, poll2
, counter-intuitively, may end being equally efficient in case, since spending time waiting memory effects propagate within loop may not waste time if flag variable 1 of effects needs propagate anyway (i.e. total time taken until loop exits may same if individual (inlined) calls poll1
take longer poll2
). of course, assuming loop waiting flag change -- individual calls poll1
do require more work individual calls poll2
.
so, think overall it's safe poll2
should never less efficient poll1
, can faster, long compiler can eliminate branch when it's inlined (which seems case @ least these 3 popular architectures).
my (slightly different) test code reference:
#include <atomic> #include <thread> #include <cstdio> int sharedstate; std::atomic<int> flag(0); bool poll1() { return (flag.load(std::memory_order_acquire) == 1); } bool poll2() { int snapshot = flag.load(std::memory_order_relaxed); if (snapshot == 1) { std::atomic_thread_fence(std::memory_order_acquire); return true; } return false; } void __attribute__((noinline)) threadfunc() { while (!poll2()); std::printf("%d\n", sharedstate); } int main(int argc, char** argv) { std::thread t(threadfunc); sharedstate = argc; flag.store(1, std::memory_order_release); t.join(); return 0; }
Comments
Post a Comment