I think two things are interesting here. 1: "when
does Erlang GC a process' heap?" and 2: "where does
Erlang keep a process' data?".
1: Erlang GCs a process' heap whenever that process'
heap gets full, or when you call
erlang:garbage_collect() explicitly.
2: Erlang stores most data associated with a process
on the process heap, there's one such heap per
erlang process. Binaries are a special case. If
they're large (> 64 octets), the heap only
contains a reference to the binary, the binary
itself is stored in an area specifically for
binaries.
1+2 can result in a lot of binary garbage left lying
around. Here's how it can happen:
a) A process creates a relatively large amount of
unused heap space. This could happen by
temporarily using a large amount of heap, e.g. by
calling binary_to_list() on a large-ish binary,
doing something with the list then dropping
it. For arguments' sake, let's say we made a heap
(for one Erlang process) with 30M free.
b) Now the process moves on to a new phase: creating
large but short-lived binaries. Let's say 1M
each. Those binaries don't live on the heap, only
a reference to them does. So they only consume 8
(?) octets on the heap.
c) If the references are the only thing using the
heap, then you can make 4M of them before filling
the process heap. But since they're 1M each,
they'll eat 4T of the binary heap. i.e. you'll
run out of memory.
As you found out, setting 'fullsweep_after' to 0
doesn't help, since a GC is never triggered. But
explicitly calling erlang:garbage_collect() does.
You can investigate a bit more using
tracing. erlang:trace/3 can generate a message
whenever the target process is GCed (the
garbage_collection entry in flaglist). If my guess
is right, then you should see that your processes
holding all the binaries are never (or rarely)
GCed. Tracing the GC is cheap, you won't notice any
performance difference if you only do it on a few
processes.
The process_info BIF can also tell you quite a bit,
e.g. the process heap size.
Disclaimer #1: my knowledge about the details above may
be wrong or out of date. But the mechanism is known.
I've seen it in embedded systems and handled it in similar
ways to your approach.
Disclaimer #2: obviously I'm taking a guess. It's possible
you've run into something else entirely. That's why I
suggested some things to look at to confirm or reject my
suspicion.
Might this be considered a symptom of an architecture problem? I'm far from an Erlang expert but my understanding is that processes generally should be short lived, except for supervisors, which should only supervise. Instead of having a single long-lived process handling a lot of large binaries would a better design be to have separate processes handling each binary?
It depends. Some state needs to live, some needs to go. Long-lived processes should ideally do few state manipulations, or be easy to replace (so they store less state). Risky or frequent operations should be done far down in the supervision tree.
The processes with complex state, things that can't be lost, might tend to be long-lived. In these cases, they should either only do very simple operations, or be isolated from the operations on that state.
These processes will generally live higher up in the supervision tree structure and delegate the risky work to processes lower in the hierarchy; these short-lived workers will thus have their impact limited, but will also have their state known before some unit of work, a bit like an invariant. If the short-lived worker dies, then restarting it with its short-lived state is a cheap operation.
Restarting the long-lived process is a difficult thing because the state might be a) possible to re-compute, but complex to do so, or b) bound to events that cannot be repeated, and can't be lost.
In my case the processes needing gc are streaming TCP sockets. Their module is just a loop function which receives a binary and sends it to the client then does a tail call. So there should be no reason for the process to die. They have been running indefinitely except when the system ran into swap.
Binaries are either stored on a private heap or in a global area where they are reference counted. Binary reference counting depends on ProcBin objects stored on a process heap.
Reference counting is only effective when garbage collection occurs, thus forcing an explicit collection on an process removes the ProbBin objects.
The system has no way of knowing if a collection should be triggered to free ref counted memory. Collection is local to a process.
It could be argued that this is a poor reason to switch from Erlang to C -- this is a problem with how binaries are handled in user code, not something intrinsic to Erlang VM.
Eh, every factor involved is implementation detail. The threshold at which binaries are heap-alloc'ed, gc behavior, etc. I really fail to see how this is user error.
Specifically we're working to get Erlang out of the data path. So the code where throughput and latency really matter is written by hand and compiles directly to machine code, and Erlang continues to do what it does best - manage distributed systems and asynchronous processes.
In theory a great VM could out-perform C, but after you've chased down a few Erlang VM WTFs, there's something nice about being closer to the metal.
1: Erlang GCs a process' heap whenever that process' heap gets full, or when you call erlang:garbage_collect() explicitly.
2: Erlang stores most data associated with a process on the process heap, there's one such heap per erlang process. Binaries are a special case. If they're large (> 64 octets), the heap only contains a reference to the binary, the binary itself is stored in an area specifically for binaries.
1+2 can result in a lot of binary garbage left lying around. Here's how it can happen:
a) A process creates a relatively large amount of unused heap space. This could happen by temporarily using a large amount of heap, e.g. by calling binary_to_list() on a large-ish binary, doing something with the list then dropping it. For arguments' sake, let's say we made a heap (for one Erlang process) with 30M free.
b) Now the process moves on to a new phase: creating large but short-lived binaries. Let's say 1M each. Those binaries don't live on the heap, only a reference to them does. So they only consume 8 (?) octets on the heap.
c) If the references are the only thing using the heap, then you can make 4M of them before filling the process heap. But since they're 1M each, they'll eat 4T of the binary heap. i.e. you'll run out of memory.
As you found out, setting 'fullsweep_after' to 0 doesn't help, since a GC is never triggered. But explicitly calling erlang:garbage_collect() does.
You can investigate a bit more using tracing. erlang:trace/3 can generate a message whenever the target process is GCed (the garbage_collection entry in flaglist). If my guess is right, then you should see that your processes holding all the binaries are never (or rarely) GCed. Tracing the GC is cheap, you won't notice any performance difference if you only do it on a few processes.
The process_info BIF can also tell you quite a bit, e.g. the process heap size.
Disclaimer #1: my knowledge about the details above may be wrong or out of date. But the mechanism is known. I've seen it in embedded systems and handled it in similar ways to your approach.
Disclaimer #2: obviously I'm taking a guess. It's possible you've run into something else entirely. That's why I suggested some things to look at to confirm or reject my suspicion.