c++ - Benefits & drawbacks of as-needed conditional std::atomic_thread

c++ - Benefits & drawbacks of as-needed conditional std::atomic_thread_fence acquire? -

April 15, 2013

the code below shows 2 ways of acquiring shared state via atomic flag. reader thread calls poll1() or poll2() check whether writer has signaled flag.

poll option #1:

bool poll1() {     return (flag.load(std::memory_order_acquire) == 1); }

poll option #2:

bool poll2() {     int snapshot = flag.load(std::memory_order_relaxed);     if (snapshot == 1) {         std::atomic_thread_fence(std::memory_order_acquire);         return true;     }     return false; }

note option #1 presented in earlier question, , option #2 similar example code @ cppreference.com.

assuming reader agrees examine shared state if poll function returns true, 2 poll functions both correct , equivalent?

does option #2 have standard name?

what benefits , drawbacks of each option?

is option #2 more efficient in practice? possible less efficient?

here full working example:

#include <atomic> #include <chrono> #include <iostream> #include <thread>  int x; // regular variable, complex data structure  std::atomic<int> flag { 0 };  void writer_thread() {     x = 42;     // release value x reader thread     flag.store(1, std::memory_order_release); }  bool poll1() {     return (flag.load(std::memory_order_acquire) == 1); }  bool poll2() {     int snapshot = flag.load(std::memory_order_relaxed);     if (snapshot == 1) {         std::atomic_thread_fence(std::memory_order_acquire);         return true;     }     return false; }  int main() {     x = 0;      std::thread t(writer_thread);      // "reader thread" ...       // sleep-wait test.     // production code calls poll() @ specific points      while (!poll2()) // poll1() or poll2() here       std::this_thread::sleep_for(std::chrono::milliseconds(50));      std::cout << x << std::endl;      t.join(); }

i think can answer of questions.

both options correct, not quite equivalent, due broader applicability of stand-alone fences (they equivalent in terms of want accomplish, stand-alone fence technically apply other things -- imagine if code inlined). example of how stand-alone fence different store/fetch fence explained in this post jeff preshing.

the check-then-fence pattern in option #2 not have name far know. it's not uncommon, though.

in terms of performance, g++ 4.8.1 on x64 (linux) assembly generated both options boils down single load instruction. hardly surprising given x86(-64) loads , stores have acquire , release semantics @ hardware level anyway (x86 known quite strong memory model).

for arm, though, memory barriers compile down actual individual instructions, following output produced (using gcc.godbolt.com -o3 -dndebug):

for while (!poll1());:

.l25:     ldr     r0, [r2]     movw    r3, #:lower16:.lanchor0     dmb     sy     movt    r3, #:upper16:.lanchor0     cmp     r0, #1     bne     .l25

for while (!poll2());:

.l29:     ldr     r0, [r2]     movw    r3, #:lower16:.lanchor0     movt    r3, #:upper16:.lanchor0     cmp     r0, #1     bne     .l29     dmb     sy

you can see difference synchronization instruction (dmb) placed -- inside loop poll1, , after poll2. poll2 more efficient in real-world case :-) (but read further on why might not matter if they're called in loop block until flag changes.)

for arm64, output different, because there special load/store instructions have barriers built-in (ldar -> load-acquire).

for while (!poll1());:

.l16:     ldar    w0, [x1]     cmp     w0, 1     bne     .l16

for while (!poll2());:

.l24:     ldr     w0, [x1]     cmp     w0, 1     bne     .l24     dmb     ishld

again, poll2 leads loop no barriers within it, , 1 outside, whereas poll1 barrier each time through.

now, 1 more performant requires running benchmark, , unfortunately don't have setup that. poll1 , poll2, counter-intuitively, may end being equally efficient in case, since spending time waiting memory effects propagate within loop may not waste time if flag variable 1 of effects needs propagate anyway (i.e. total time taken until loop exits may same if individual (inlined) calls poll1 take longer poll2). of course, assuming loop waiting flag change -- individual calls poll1 do require more work individual calls poll2.

so, think overall it's safe poll2 should never less efficient poll1 , can faster, long compiler can eliminate branch when it's inlined (which seems case @ least these 3 popular architectures).

my (slightly different) test code reference:

#include <atomic> #include <thread> #include <cstdio>  int sharedstate; std::atomic<int> flag(0);  bool poll1() {     return (flag.load(std::memory_order_acquire) == 1); }  bool poll2() {     int snapshot = flag.load(std::memory_order_relaxed);     if (snapshot == 1) {         std::atomic_thread_fence(std::memory_order_acquire);         return true;     }     return false; }  void __attribute__((noinline)) threadfunc() {     while (!poll2());     std::printf("%d\n", sharedstate); }  int main(int argc, char** argv) {     std::thread t(threadfunc);     sharedstate = argc;     flag.store(1, std::memory_order_release);     t.join();     return 0; }

Search This Blog

Color

c++ - Benefits & drawbacks of as-needed conditional std::atomic_thread_fence acquire? -

Comments

Post a Comment

Popular posts from this blog

Redirect to a HTTPS version using .htaccess -

Unlimited choices in BASH case statement -

javascript - jQuery: Add class depending on URL in the best way -