This problem isn't specific to global variables; it happens with all shared muta...

dragontamer · on July 6, 2021

> And yes, you can put a full memory fence around every access to a variable that is shared across threads. But doing so would just destroy the performance of your program. Given that we're talking about concerns that are specific to a relatively low-level approach to parallelism, I think it's safe to assume that performance is the whole point, so that would be an unacceptable tradeoff.

Indeed.

Just a reminder to everyone: your pthreads_mutex_lock() and pthreads_mutex_unlock() functions already contain the appropriate compiler / cache memory barriers in the correct locations.

This "Memory Model" discussion is only for people who want to build faster systems: for people searching for a "better spinlock", or for writing lock-free algorithms / lock-free data structures.

This is the stuff of cutting edge research right now: its a niche subject. Your typical programmer _SHOULD_ just stick a typical pthread_mutex_t onto an otherwise single-threaded data-structure and call it. Locks work. They're not "the best", but "the best" is constantly being researched / developed right now. I'm pretty sure that any new lockfree data-structure with decent performance is pretty much an instant PH.D thesis material.

-----------

Anyway, the reason why "single-threaded data-structure behind a mutex" works is because your data-structure still keeps all of its performance benefits (from sticking to L1 cache, or letting the compiler "manually cache" data to registers when appropriate), and then you only lose performance when associated with the lock() or unlock() calls (which will innately have memory barriers to publish the results)

That's 2 memory barriers (one barrier for lock() and one barrier for unlock()). The thing about lock-free algorithms is that they __might__ get you down to __1__ memory barrier per operation if you're a really, really good programmer. But its not exactly easy. (Or: they might still have 2 memory barriers but the lockfree aspect of "always forward progress" and/or deadlock free might be easier to prove)

Writing a low-performance but otherwise correct lock free algorithm isn't actually that hard. Writing a lock free algorithm that beats your typical mutex + data-structure however, is devilishly hard.

voidnullnil · on July 6, 2021

> This "Memory Model" discussion is only for people who want to build faster systems: for people searching for a "better spinlock", or for writing lock-free algorithms / lock-free data structures.

Actually, most practitioner code has bugs from their implicit assumptions that shared variable writes are visible or ordered the way they think they are.

dragontamer · on July 6, 2021

But the practitioner doesn't need to know the memory model (aside from "memory models are complicated").

To solve that problem, the practitioner only needs to know that "mutex.lock()" and "mutex.unlock()" orders reads/writes in a clearly defined manner. If the practitioner is wondering about the difference between load-acquire and load-relaxed, they've probably gone too deep.

voidnullnil · on July 6, 2021

> To solve that problem, the practitioner only needs to know that "mutex.lock()"

This is true, but they do not know that. If you do not give some kind of substantiation, they will shrug it off and go back to "nah this thing doesn't need a mutex", like with a polling variable (contrived example).

Kranar · on July 6, 2021

Can you explain what you mean by a "polling variable" needing a mutex? Usually polling is done using atomic instructions instead of a mutex. Are you referring to condition variables?

mumblemumble · on July 6, 2021

In a lot of code I've seen, there are threads polling some variable without using any sort of special guard. The assumption (based, I assume, on how you really could get away with this back in the days of single-core, single-CPU computers) is that you only need to worry about race conditions when writing to primitive variables, and that simply reading them is always safe.

Kranar · on July 6, 2021

Okay but the poster mentioned a mutex, which would not be a good way to go about polling a variable in Java. All you need to guarantee synchronization of primitive values in Java is the use of volatile [1]. If you need to compose atomic operations together, then you can use an atomic or a mutex, but it would not occur to me to use a mutex to perform a single atomic read or write on a variable in Java.

[1] https://docs.oracle.com/javase/specs/jls/se8/html/jls-8.html...

mumblemumble · on July 7, 2021

> All you need to guarantee synchronization of primitive values in Java is the use of volatile

I think I know what you mean, but that's a very dangerous way to word it when speaking in public. It would be more correct to say that "all you need to guarantee reads are protected by memory barriers is volatile."

The distinction matters because, to someone who doesn't already know all about volatile, the way you worded it might lead them to believe that `x++;` is an atomic statement if x is volatile, which is not true. That's a specific example of where things like atomic types are necessary.

(For the curious: https://www.baeldung.com/java-atomic-variables)

I think maybe what you're missing about what I'm saying is that I'm trying to mainly talk for the benefit of people who don't have a solid understanding of how to do safe and performant multithreading. Which is the vast majority of programmers. For that sort of audience, I tend to agree with dragontamer that "just use a mutex" is probably the safest advice to start out. Producing results faster doesn't count for much if you're producing wrong results faster.

dragontamer · on July 6, 2021

Java is somewhat cheating, because it got its memory model figured out years before other languages like C or C++.

In C++, you'd have to use OS-specific + compiler-specific routines like InterlockedIncrement64 to get guarantees about when or how it was safe to read/write variables.

Not anymore of course: C++11 provides us with atomic-load and atomic-store routines with the proper acquire / release barriers (and seq-cst default access very similar to Java's volatile semantics).

-----------

Anyway, put yourself into the mindset of a 2009-era C++ programmer for a sec. InterlockedIncrement works on Windows but not in Linux. You got atomics on GCC, but they don't work the same as Visual Studio atomics.

Answer: Mutex lock and mutex-unlock. And then condition variables for polling. Yeah, its slower than InterlockedIncrement / seq-cst atomic variables with proper memory barriers, but it works and is reasonably portable. (Well, CriticalSections on Windows because I never found a good pthreads library for Windows)

------

Its still relevant because you still see these thread issues come up in old C++ code.

Kranar · on July 7, 2021

I don't understand the relevance of your point. The point I originally asked for clarification about was the use of a mutex for a "polling variable".

Java has had volatile variables since the year 2000, I don't see how it's cheating that Java provided a standardized way of accessing a synchronized value before C and C++ did. Can you elaborate on your point that it's cheating?

In C and C++, for 10 years now, there is a standard library providing atomic data types and atomic instructions. Prior to the standardization one used platform specific atomic facilities. boost has provided cross-platform atomic operations that work on virtually every platform since 2002. Prior to 2002 there were no multicore x86 processors. There would have been mainframe computers that were multicore, is it your argument that code written for those mainframes are of relevant use today by fairly typical C and C++ developers?

At any rate, at no point did any of Java, C, or C++ require the use of a mutex in order to properly synchronize access to a "polling variable". Atomic operations were widely available to all three languages in various ways and would have been the preferred method.

gpderetta · on July 7, 2021

While agree with your general point, there were multiprocessor x86 systems well before 2002. Dual and four socket systems were relatively common and the like of SGI and HP would have been happy to sell you x86 systems with even higher socket counts.