Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Fiber in C++: Understanding the Basics (agraphicsguynotes.com)
135 points by ibobev on Sept 24, 2023 | hide | past | favorite | 75 comments


I'm confused in how this article defines "fiber." At the beginning of the article, fibers are described as a lightweight userspace thread, similar to what is provided in the runtimes of Haskell and Go, and has been recently revived in Java as "virtual threads."

However, later in the article (after a long and apparently irrelevant digression about stack management), they describe Windows API functions that are required for using fibers and seem to suggest that fibers are OS-level, and not application-managed.

Am I missing something?

[edit: typo]


The Windows fiber library is just that, a library. It's not an OS component. It's provided because stack switching in Windows is a somewhat fraught [1] and not officially documented procedure which requires modifying the TIB [2].

Doing it correctly isn't hard, and has been done many times in bullet-proof libs (notably boost::context), but MS would maybe rather you not.

[1]: https://devblogs.microsoft.com/oldnewthing/20080215-00/?p=23...

[2]: https://en.wikipedia.org/wiki/Win32_Thread_Information_Block


It is an OS component, it's provided in kernel32.dll and accesses internal OS data structures that are not publicly defined or stable between versions. You cannot safely implement fibers in Windows without the fibers API, period. Boost.Context's direct Windows context switching code hacks undocumented fields in the TIB: https://github.com/boostorg/context/blob/6fa6d5c50d120e69b2d...

...and this causes problems, because it can't guarantee that all fields are initialized or switched successfully: https://lists.boost.org/boost-bugs/2014/10/38476.php

Microsoft continually adds and changes fields in the TIB with each new release of Windows. Attempting to implement fibers manually is a ticking time bomb that should never be used in production.


That bug was caused by the TLS slots pointer being uninitialized in the newly created Boost fiber, not a change in Windows API. TLS pre-dates boost::context, the field had existed since the very first commit.

I already said "MS would rather you not", obviously, otherwise they would have documented the TIB.

The fact remains that tons of production systems rely on officially-unofficial elements of NT's architecture and this is one such element that is heavily relied on by everyone who uses boost stackful coroutines. Hyrum's Law in action.


The bug was caused by Boost.Context manipulating OS data structure fields that it has no business changing.

And yes, those fields do now have to be maintained for backwards compatibility -- because libraries like Boost.Context hardcoded this into applications, unbeknownst to the users of that library.


"Fibers", "green threads", "stack switching", "cooperative multitasking" are essentially all the same thing, they all rely on being able to switch execution context by switching to a different stack within the same OS thread. As such they can be implemented either in (CPU/OS-specific) user code or in "OS APIs" (like the Windows Fiber functions).

Only downside of the technique is that it cannot be implemented in WASM (and maybe some other esoteric runtime environments), because WASM has separate data- and call-stacks and the call stack is not accessible from within the WASM virtual machine (while 'async-await' which relies on code transformation done by the compiler can be implemented in WASM just fine).

There is a 'stack-switching proposal' for WASM though, but I don't know what's the state of that:

https://github.com/WebAssembly/stack-switching


I've always been curious why WASM has the program flow restrictions it has. Most "ASM" just lets you set whatever register.

There isn't much sandboxing value (WASM shouldn't be able to touch addresses outside the sandbox anyways). Was the reason ease of compilation/optimization for arbitrary architectures? Easier to run inside an interpreter?

The default idea for WASM feels like it should have been more like RISC-V (but with enough wiggle-room to allow easy JITing to x86/ARM).

Or are most of WASM's quirks just because it has ASM.js in its lineage?


Having a native stack switch for WASM would be nice but it's not necessary. Map all memory segments as shared into several instances of a single module -- each instance is a fiber.

Implementation is left as an exercise for the reader.


This doesn't address the shadow stack at all. It doesn't work.


Each instance has its own stack. Switching between them is cooperative on a single thread. What am I missing?


WebAssembly stores stack data - parameters, return addresses, local variables - in a "shadow stack" that does not live in the native heap your application has access to. There's no way for you to switch it.


Each instantiated module has its own complete function call stack. A call out of the module, to a bit of control code in JS, and then into a different instance of the same module, with the same memory mappings, would look (to the program itself) like a stack switch, no?

I guess it raises the philosophical question of what is a stack switch, anyway? If you end up cooperatively running multiple "green threads" within a single "OS thread" does it matter that what you actually switched was the entire execution environment?


Setting aside certain implementation details, here's the basics:

Kernel Threads (aka threads): OS-Level and preemptive.

Green Threads (aka lightweight threads, virtual threads or goroutines): User-level and preemptive.

Fibers: User-level (sometimes OS-level or offered as a OS library, like in Windows) and cooperative.

Coroutines: User-level and cooperative.


The devil's in the implementation details though. The operating system may provide both fully-preemptive threads on which to schedule user space cooperative threads (scheduling on system call). At that point green threads (e.g. threading implemented by a virtual machine on top of the operating system) do not provide the same value. This is actually why Green threads were effectively pulled out of Java for a few decades.

Fibers are distinct in that they have no scheduling and are code being run on a thread - if the thread is preempted, it will resume on that same thread. Unlike cooperative threading, they must explicitly yield.

Coroutines have such varying implementations that you would need to define requirements to know if they count as fibers or not - for instance, whether you mandate a C-compatible stack.


There's one thing missing here, whereas fibers are cooperative, they have a user space scheduler deciding which fiber to switch to on a thread when you yield, coroutines specify who they're yielding to.


As per article, Windows fibers literally have a SwitchToFiber call and come with no scheduler. So there is really ni difference.


You'll notice the function's documentation is "Schedules a fiber. The function must be called on a fiber." so while it may not be a literal difference (after all they're both semi-user space), they are difference systems to use, but they're both roughly at the same level of "power".

If you're not using a scheduler with fibers then I'd argue you shouldn't be calling it a fibers system (and rather coroutines), since otherwise it's a distinction without a difference.


Windows has its own fibers facility which does some things for you but still leaves the scheduling to the application so it's still fibers. It's a little confusing in the article but they look at the Windows API as a kind of API template - what do you need to implement to have fibers. Then they describe their own implementation.


My general understanding is that a fiber is a lightweight thread, and conceptually can be implemented inside any program that wants to schedule its own fibers.

They’re distinct from CPU threads.

When I last dived into fibers I discovered windows was trying to make them a thing quite a while ago which is why we have things like [0].

Fibers being virtual threads, are recursive. In the same way you can run a VM inside a VM, you could build fibers on top of fibers (which sounds like a recipe for unpredictable performance).

[0] https://learn.microsoft.com/en-us/windows/win32/procthread/u...


Userspace != Application-managed

On Windows, many things exist in user-space that are managed by the OS. This is in contrast to linux where the Linux kernel is its own thing with its own stable API.

Fibers are an OS-level concept that exists in user-space.


Fiber is a concept in Windows API for user-space thread.

It was designed for MS SQL Server. It have a very confusing API and did not gain much use in 3rd party applications.


Personally if I had a choice of just one I’d pick the threads over fibers if the threads facility is really high performance like the one in Java.

Today you have quad cores even on a cheap phone and 16 on a desktop and 60+ on a server and OS and some runtimes make it pretty easy to get large speed ups on many tasks even when subtasks are closely coordinated. (One thing I like about Java is that it has low-level thread primitives that really scale as opposed to many systems have have a small set of low-level primitives that in theory let you do everything but not scale.)

Contrast that to fibers and similar things (JS and Python async) that do a great job of keeping a CPU busy when you are waiting for network activity but are limited to one CPU.


Some random historical notes coroutine stuff which are perhaps of interest to some:

1. You can implement "stack-free coroutines" in C with some really entertaining macros https://www.chiark.greenend.org.uk/~sgtatham/coroutines.html

2. Does anyone remembers Cilk? Its still the fastest stackful coroutine models I know of.

Having implemented and used both before (for different purposes), I like both, but with a mild preference for the stackful version. Largely because you don't have function coloring problems, and you can actually get a proper stack trace and is so significantly easier to debug when things go wrong.

It is also good to differentiate coroutines from async (they seem to be very interleaved these days). Coroutines are a mechanic to achieve mutual recursion / generators. And that is fine for expressing certain algorithms and systems in a much cleaner fashion. Note that parallelism is not necessarily implied by "coroutine".

Async is a mechanic to "doing something else" while waiting for IO. Fibers or state machines are both different solutions to this problem. Certain coroutine implementations can help with this, but I think it is massively overused. Rust due to the function coloring problem, ends up requiring almost everything to be marked "async". And some golang code I have read seems to overuse goroutines unecessarily. I think use of async should be narrow and minimized, and localized to only the places where it makes sense.


> Does anyone remembers Cilk?

Yes! It's a shame it never gained traction and Intel abandoned it, eventually getting dropped from icc and gcc.


I believe that fibers are considered more trouble than they worth? [0]

[0] https://devblogs.microsoft.com/oldnewthing/20191011-00/?p=10...


The Gor Nishanov paper linked in that piece goes into a lot of detail of some of the downsides of fibers in an unmanaged language like C++.

[0] http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2018/p136...


Thanks for the link to the Gor's paper.

Apparently the 2018 paper table only covered top 10 languages in the TIOBE index, I wished it can be updated and extended further to top 50 languages.

Another popular compiled language with GC (by default) namely D language (TIOBE ranked 37 as of Sept 2023) does support fiber and it's implemented in its vibe.d web framework [1],[2].

Gor also mentioned about significant 160 ns overhead for stackful fiber in Go (goroutine) to interact with a C library i.e. cost of switching between Go's goroutines and C threads (as of 2018). It will be interesting to know the fiber overhead in D language since C compilation is now natively supported by D compiler [3].

[1] Fibers in Programming in D:

https://ddili.org/ders/d.en/fibers.html

[2] Fibers, what for:

https://forum.dlang.org/thread/eowfajvnzqnncmfnjhnf@forum.dl...

[3] Adding ANSI C11 C compiler to D so it can import and compile C files directly:

https://news.ycombinator.com/item?id=27102584


Since coroutines have been in the standard since C++20, I wish he would have explained better what are the advantages of using a Fiber over a Coroutine.


Unfortunately in 2023 c++20 coroutines are just barely starting to become usable. Compilers have bugs, especially with symmetric transfer, e.g. msvc doesn't suspend until await_suspend returns (which causes races and double-frees), gcc stops using tail calls in some modes (which can cause stack overflow), I ended up emulating symmetric transfer because you can't rely on it, and emulating it is of course slower. Compilers spill local variables from await_suspend to coroutine frames, which again causes races, which necessitates workarounds like marking await_suspend noinline (in a non-portable way). Having noinline anything in the code path disables heap allocation elision, so all coroutines start to allocate from the heap. When comparing a function call to a coroutine call there's a vast difference in performance. And you need to use coroutines all the way down, which really adds up.

Fibers have their own problems: you need to allocate stacks (and that may be expensive), you can't switch fibers between system threads (compilers cache thread local addresses, unfortunately with x86 linux tls often uses fs: segment addressing, so it may appear to work until it doesn't, and weirdly this problem existed in c++20 coroutines too until recent clang versions), widely used open source implementations (e.g. boost context / boost coroutine / boost fiber) are not even exception safe, because you need to save/restore exception globals when switching fibers, and nobody seems to care. :-/

One huge upside to fibers is that a function call is just a normal function call, and it's fast as a result. I wish c++ taken fibers more seriously and added steps for making them safe to use (portable way to handle global state, portable way to mark switching functions as invalidating tls addresses, etc.), we could have something similar to java virtual threads or go goroutines then.


> exception safe, because you need to save/restore exception globals when switching fibers,

Do you have more details about this issue? What exception globals? And for which ABI? You mean some sort of current pending exception when switching during unwinding?

[I'm a huge fan of stackful coroutines and I wish were blessed by the standard]


C++ ABI has some per-thread globals: the number that is returned from std::uncaught_exceptions(), and the chain of currently caught exceptions. For example in llvm this is available with a cxa_get_globals call:

https://github.com/llvm/llvm-project/blob/b05f1d93469fbd6451...

These need to be saved/restored when switching fibers, otherwise fiber switches from catch clauses (and destructors!) are unsafe, throw without argument may rethrow incorrect exception, code that commits/rollbacks based on uncaught exceptions counter will not work correctly, etc.

One example I know where this save/restore is implemented is the userver framework, but it seems to be unexpectedly rare in fiber implementations last time I looked.


Thanks. It is unsurprising that most libraries punt on that. It seems that switching this state would easily dominate over the cost switching. IIRC on gcc the current exception is protected by a global mutex, at least until recently.

It is another reason for having this built into the language/standard library so that it can be implemented optimally.


Yes, in my (limited) use case with coroutines for a recursive descent generator I got it working by following this tutorial:

https://www.scs.stanford.edu/~dm/blog/c++-coroutines.html

but I was unpleasantly surprised by much extra code it required to make it run.

The Tiny Fiber library in the article looks pretty hacky right now and the amount of x64/arm specific code might cause trouble with portability, but the concept itself is intriguing. Thank you for the details.


Advantage: No function colouring. Just like with threads, you don't need to declare which functions might potentially yield. There's none of that noisy async/await syntax.

Advantage: Fast switching. Switching between fibers is done by swapping in new values for the program counter and stack pointer. Compare with async, where yielding only takes you up one level of stack, so your code ends up walking a stack of coroutine invocations to get all the way back to the scheduler.

Disadvantage: Expensive setup - each fiber needs its own pre-allocated stack.

Disadvantage: Operating systems don't provide I/O systems readymade to work with fiber schedulers, so if you want to yield to a different fiber during blocking I/O, then you need to invent your own system to handle that.

PS about naming: "Fiber" is the Microsoft Windows name for their implementation of full stack coroutines, but the proper platform-independent name for it is actually "coroutine". I've kept with the word "fiber" here to avoid misunderstandings.

PS about C++: This is based on async like you find it in C# or Python, I don't really know about C++ coroutines.


A disadvantage of the ‘no function coloring’ in fibers is that it makes lockless programming harder. A nested function call can switch from under you without your knowledge, making it hard to know where the preemption points are and whether to take locks when making updates to shared state. With function coloring you immediately see whether a function might switch as there’s a different calling syntax.

I’ve programmed both fiber based systems and coroutines. I even created my own fiber libraries for Python (https://github.com/geertj/gruvi) and C++ (https://github.com/geertj/cgreenlet, mostly an experiment, and incorrectly named coroutines for C++ while it’s really fibers). In the Python version I experimented with some features to help you know whether a nested function might switch.

In the end, for me and for the problem domains I worked in, the explicit async/await co-routine style wins over fibers. It gives you most of the performance benefits of user mode switching, all of the memory benefits, while keeping your code mostly lock free.


That is where an Actor or CSP style comes in useful, one then avoids the locks by having that state mutated in the context of one coroutine alone.

Most problems can be recast in to that form, only a small subset are better handled by mutable shared state (and hence explicit locks. In those cases one should probably be modelling the updates as 'Compare and Swap' / 'Compare and Set' types of mutable operation, hence avoiding the situation where unexpected preemption can occur while holding a lock.

From my perspective, having use both forms (manually and with language support), I generally find the coroutine (plus Actor/CSP) scheme to be easier to work with and reason about.


[I assume you are using lock free in a loose way, as lacking explicit mutual exclusion].

In practice in most real world applications in C++, absence of async calls is not enough to guard against re-entrancy. Shared state might still be mutated by callbacks, by recursively re-entering the event loop or by other threads. So you need to either document explicitly all the assumptions you are making and vet every single function you are calling in your implicit critical section, or you need some other mechanism that is going to look like a critical section anyway.


Yeah I was using the term lockless loosely here, not referring to atomics.

I agree that using explicitly switched coroutines is not sufficient for mostly “lockless” programming (but I do think it’s necessary). I was coming from a thread per core perspective where you run one event loop per core and limit cross core communications (like in the Seastar framework). With this you should be able to perform most state manipulations without locks.


> A disadvantage of the ‘no function coloring’ in fibers is that it makes lockless programming harder.

But if you treat fiber-style coroutines like threads, and use locks and message queues exactly the same as you would with threads (although differently implemented, of course), then coroutines are basically better-behaved, and nicely deterministic, threads. It's a wonderful programming model, with the one big drawback that you can't take advantage of multiple cores.


C++ coroutines are based on C# async model, as they were originally proposed by Microsoft, they even use similar concepts in regards to compiler magic of Awaitable types with special methods being reckognised by the compiler, as means to create an awaitable type for any kind of data structure.


That's what I thought, thanks for confirming.


Fun fact, what you call async is also widely referred to as coroutines. Stackless coroutines vs stackful coroutines seems like the most appropriate way to disambiguate the two.


Thanks, stackful/stackless is noted for next time. I still kinda like using "async" as a shorthand for stackless coroutines though.


Before async for I/O became popular, stackless coroutines were mainly used for generator abstractions (for example yield in python), so async can be reductive.


And that's exactly why I would say "async", because then everyone will know that I'm talking about something like async in Python/C#, and not something like Python generators.

Technically, they're may well be almost the same thing, but practically, they're used very differently.


Just a note on naming: fibers, goroutines, green threads, virtual threads and the like are stackful coroutines (miniature threads, essentially). Then there are stackless coroutines, which are state machine objects that have function coloring (async/await).


I see. Seems pretty similar conceptually to C++ coroutines.

Fast switching between fibers is mainly what got me interested, of course. Thank you for the explanations.


Thank you for this in-depth article.

I am a less than a C++ beginner but I asked Stack Overflow how to run C++ coroutines in a thread pool. It seems coroutines in C++20 are unfinalised but don't quote me on that but I did get some sourcecode for older versions of the C++20 standard.

I used Marce's Coll's excellent blog post about how to create coroutines in assembly by adjusting the RSP register.

https://blog.dziban.net/posts/coroutines/

I extended Marce's code to run the coroutines in kernel threads:

https://github.com/samsquire/assembly (see threadedcoroutines.S)

I have been thinking of coroutines in terms of query compilation for database engines and the volcano query model and this article:

https://www.chiark.greenend.org.uk/~sgtatham/coroutines.html

Tying together two pieces of code that call eachother in push or pull driven style is really powerful. Or if you're running multiple independent tasks that need their own state. This as I understand it is the original intent of object orientation that Alan Kay wanted and is represented by Erlang and partly Go.

Specifically, I am thinking of compiler created coroutines where code can be interleaved at compile time rather than at runtime.


Somewhat related to Fibres are the Trails of synchronous reactive programming languages. Both allow efficient logical concurrency based on cooperative scheduling. The nice thing with Trails is that the scheduling can be determined by the compiler by extracting dependencies from the synchronous reactive program thus increasing determinism of your app.


Remind me again why no OS other than Solaris ever provided "scheduler activations" (i.e. the kernel calls back into a user-space scheduler whenever a kernel-level scheduling event occurs)



Windows had a custom scheduler API (https://learn.microsoft.com/en-us/windows/win32/procthread/u...) but they removed it, presumably because it wasn't used much


As the Wikipedia page shows, NetBSD had an implementation but later removed it. There isn't really enough information to make a sound judgement on scheduler activations. Although if I had to guess, there might be too much overhead (maintenance of code and/or actual performance) for something that is rarely used. It might make more sense now that we're moving towards user-level concurrency everywhere (async/await, goroutines, virtual threads, etc.).


Basically yes. M:N just wasn't a good fit for the way applications used (and use) pthreads. For those that need pure CPU parallelism M:N is just useless overhead, while highly concurrent applications were either using one process per request for simplicity and robustness, or directly building on top of poll/epoll/kqueue for performance.

So, at least at that time, M:N and scheduler activation was a solution in search of a problem. It might still see a resurgence on one way or another.


But that's not what SA was about.

Even with N:N architecture, if one kernel thread blocks (say, on I/O), the point is for the user-space scheduler to decide which thread should run next. The point was that the kernel scheduler cannot/should not understand the application thread scheduling requirements, and that when a kernel thread allocated to the task blocks, only user space can (correctly) decide what to do.


> Fiber is quite a lightweight thread of execution. Like coroutine, fiber allows yielding at any point inside it. To some degree, we can regard fiber as a form of stackful coroutine [...]

Can someone explain what the "stackful coroutine" term means (ideally with an example)? And how can one implement it?


It's exactly what it says, it's a coroutine with its own stack.

A fiber isn't to some degree a stackful coroutine, it is a stackful coroutine.

A stackless coroutine does not have a its own stack, it uses the calling context's stack and therefore can only yield to the calling context from the top-level coroutine function. A stackful coroutine has its own stack, and so can yield to the calling context from anywhere.

As far as how to implement, I always liked Malte Skarupke's blog post on the subject: https://probablydance.com/2013/02/20/handmade-coroutines-for...


I assume it's this:

In Python, coroutines cannot yield from within another function call: if coroutine A calls function B, B cannot yield. In Lua, it's possible to yield a coroutine at any function depth: B can yield.

To implement something like this in a compiled language, you just need multiple stacks instead of the usual 1. Most architectures, such as x86, have a stack pointer register; when yielding a coroutine, just change the sp register to point to some other stack -- when resuming, restore the sp.

This is not a new concept. Pokémon on the gameboy did this for its UI fiber, for example.


> In Python, coroutines cannot yield from within another function call: if coroutine A calls function B, B cannot yield.

I know you likely know this, but just for context. This is technically correct of course, B cannot yield, but not how you would use it in real life. If B needs to yield because it calls C which need to yield, then B needs to be a coroutine itself. This allows for a ‘nested yield’ where A, B, and C are all paused by yielding to their direct parent. Python made this explicit before async/await as you had to use “yield from” instead of just yield. With async/await that became “await”.

This does lead to what’s typically mentioned as the biggest disadvantage of stackless coroutines which is that you get parallel sets of functions, one async and one sync. This mostly means that IO and networking libraries end up being either sync or async but not both.


    Some readers may have a question by now. What is the rationale behind the choices of the registers that need to be stored?
Yet another article that goes into the gorey details of how fiber works but not how fibers are used.

How do we write the "Applications own scheduler"?


I’ve plenty of Classic Mac experience with cooperative threads. You’d basically yield to run another “thread”.

Can someone tell me the difference from those to fibres?


Still looking for a lib that supports fibers... boost::asio is... well, boost, and therefore annoying. It also does not support it out of the box


Boost asio is available as a standalone lib as well. What do you mean it doesn't support "it" out of the box?


Asio has no coroutine implementation and C++20 Coros are stackless. So, to build fibers with asio, you need boost::coroutine for stackful coros.



Asio has strands which may be an alternative depending on the use case.


Why is more and more languages embracing Fibers / Green Threads now? I’m curious


No reasons, probably. No, really. If you flip a coin 10 times, 3-7 or 7-3 is a perfectly OK outcome, and our n (the more and more languages) is not even 10.


pushback against the stackless coroutine model mostly (i.e. async/await).

But people have been playing with stackful coroutines in C++ for decades.


Again, academia and the tech press overlook to talk about Actors, and we live in a world with poorly implemented pretend Actor subsets, without any of the true Actor features (work stealing, message passing etc, and best of all, no contention since no shared resources, other than Actor references).

Fibres, goroutines, promises, futures, job queues, etc, are poor substitutes for Actors.


> Fibres, goroutines, promises, futures, job queues, etc, are poor substitutes for Actors.

They are not necessarily substitutes, they are lower level primitives that allow more complex models, like Actors, to be implemented.

All execution models have trade offs; and sometimes we might prefer an execution model without deadlock, livelock and starvation. So for a general purpose language, a complex model like Actors is ideally a library and not enforced as the only way of achieving concurrency.


Perfect is the enemy of good.

Just like with FP and LP, I rather have something close enough, than nothing at all.


No shared resources can be a performance problem, I can see why there are rusts and vales efforts to come to a slightly different tradeoff.


Pony/Ponylang claims that they can do actors fast and safe (even safer than rust). There is a type system that allows one to share data between actors safely, or so they claim.


I believe they use a GC, so they may be able to leverage the runtime to help out. If it's all type system-based, the compiler probably has to be stricter to ensure safety (Rice's theorem implies that compilers err on the side of rejecting some valid programs in order to reject all invalid programs). Maybe Pony could be a blend of Erlang and Rust?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: