Figuring out a gen_tcp:recv limitation

In which a suprisingly pernicious framed payload leads to OTP spelunking.

erlang gen_tcp framing

The setup: sending a string over TCP

Let's say you want to send the ASCII string Fiat lux! to an Erlang process listening on the other side of a TCP connection. Not a big deal, right?

Our sending application is written in Python. Here's what it might look like:

#!/usr/bin/env python3
import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(("127.0.0.1", 7777))
data_to_send = b"Fiat Lux!"
sock.sendall(data_to_send)

... and here's the receiving Erlang application:

#!/usr/bin/env escript
main(_) ->
  {ok, L} = gen_tcp:listen(7777, [binary, {active, false}, {reuseaddr, true}]),
  {ok, Sock} = gen_tcp:accept(L),
  {ok, String} = gen_tcp:recv(Sock, 0),
  io:format("Got string: ~ts~n", [String]),
  erlang:halt(0).

If we start the Erlang receiver (in shell 1), then run the Python sender (in shell2), we should see the receiver emit the following:

$ ./receive.escript
Got string: Fiat Lux!
$

As you can see, we optimistically sent all our data over TCP from the Python app, and received all that data, intact, on the other side. What's important here is that our Erlang socket is in passive mode, which means that incoming TCP data needs to be recv'd off of the socket. The second argument in gen_tcp:recv(Sock, 0) means that we want to read however many bytes are available to be read from the OS's network stack. In this case all our data was kindly provided to us in one nice chunk.

Success! Our real production application will be dealing with much bigger pieces of data, so it behooves us to test with a larger payload. Let's try a thousand characters.

More data

We update the sender and receiver as follows:

#!/usr/bin/env python3
import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(("127.0.0.1", 7777))
data_to_send = b'a' * 1000
sock.sendall(data_to_send)
#!/usr/bin/env escript
main(_) ->
  {ok, L} = gen_tcp:listen(7777, [binary, {active, false}, {reusaddr, true}]),
  {ok, Sock} = gen_tcp:accept(L),
  {ok, String} = gen_tcp:recv(Sock, 0),
  io:format("Got string of length: ~p~n", [byte_size(String)]),
  erlang:halt(0).

When we run our experiment, we see that our Erlang process does indeed get all 1000 bytes. Let's add one more zero to the payload.

#!/usr/bin/env python3
import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(("127.0.0.1", 7777))
data_to_send = b'a' * 10000
sock.sendall(data_to_send)

And we hit our first snag!

Got string of length: 1460

Aha! Our gen_tcp:recv(Sock, 0) call asked the OS to give us whatever bytes it had ready in the TCP buffer, and so that's what we received. TCP is a streaming protocol, and there is no guarantee that a given sequence of bytes received on the socket will correspond to a logical message in our application layer. The low-effort way of handling this issue is by prefixing every logical message on the TCP socket with a known-width integer, representing the length of the message in bytes. "Low-effort" sounds like the kind of thing you put in place when the deadline was yesterday. Onward!

Let's take our initial string as an example. Instead of sending the following sequence of 9 bytes on the wire:

Ascii:     F    i   a    t   ␣    l    u    x   !

Binary:   70  105  97  116  32  108  117  120  33

We'd first prefix it with an 32-bit integer representing its size in bytes, and then append the binary, giving 13 bytes in total.

Ascii:     ␀   ␀  ␀   ␉  F    i   a    t   ␣    l    u    x   !

Binary:    0   0   0   9  70  105  97  116  32  108  117  120  33

Now, the first 4 bytes that reach our receiver can be interpreted as the length of the next logical message. We can use this number to tell gen_tcp:recv how many bytes we want to read from the socket.

To encode an integer into 32 bits, we'll use Python's struct module. struct.pack(">I", 9) will do exactly what we want: encode a 32-bit unsigned Integer (9, in this case) in Big-endian (or network) order.

#!/usr/bin/env python3
import socket
import struct
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(("127.0.0.1", 7777))
data_to_send = b'a' * 10000
header = struct.pack(">I", len(data_to_send))
sock.sendall(header + data_to_send)

On the decoding side, we'll break up the receiving into two parts:

1) Read 4 bytes from the socket, interpret these as Header, a 32-bit unsigned int.

2) Read Header bytes off the socket. The receiving Erlang process will 'block' until that much data is read (or until the other side disconnects). The received bytes constitute a logical message.

#!/usr/bin/env escript
main(_) ->
  {ok, L} = gen_tcp:listen(7777, [binary, {active, false}, {reuseaddr, true}]),
  {ok, Sock} = gen_tcp:accept(L),
  {ok, <<Header:32>>} = gen_tcp:recv(Sock, 4),
  io:format("Got header: ~p~n", [Header]),
  {ok, String} = gen_tcp:recv(Sock, Header),
  io:format("Got string of length: ~p~n", [byte_size(String)]),
  erlang:halt(0).

When we run our scripts, we'll see the Erlang receiver print the following:

Got header: 10000
Got string of length: 10000

Success! But apparently, our application needs to handle messages much bigger than 10 kilobytes. Let's see how far we can take this approach.

Yet more data

Can we do a megabyte? Ten? A hundred? Let's find out, using the following loop for the sender:

#!/usr/bin/env python3
import socket
import struct
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(("127.0.0.1", 7777))
for l in [1000, 1000*1000, 10*1000*1000, 100*1000*1000]:
    data_to_send = b'a' * l
    header = struct.pack(">I", len(data_to_send))
    sock.sendall(header + data_to_send)
sock.close()

...and a recursive receive function for the receiver:

#!/usr/bin/env escript
recv(Sock) ->
  {ok, <<Header:32>>} = gen_tcp:recv(Sock, 4),
  io:format("Got header: ~p~n", [Header]),
  {ok, String} = gen_tcp:recv(Sock, Header),
  io:format("Got string of length: ~p~n", [byte_size(String)]),
  recv(Sock).

main(_) ->
  {ok, L} = gen_tcp:listen(7777, [binary, {active, false}, {reuseaddr, true}]),
  {ok, Sock} = gen_tcp:accept(L),
  recv(Sock).

Running this will lead to our Erlang process crashing with an interesting message:

Got header: 1000
Got string of length: 1000
Got header: 1000000
Got string of length: 1000000
Got header: 10000000
Got string of length: 10000000
Got header: 100000000
escript: exception error: no match of right hand side value {error,enomem}

enomem looks like a strange kind of error indeed. It happens when we get the 100-megabyte header and attempt to read that data off the socket. Let's go spelunking to find out where this error is coming from.

Spelunking for {error, enomem}

First, let's take a look at what gen_tcp:recv does with its arguments. It seems that it checks inet_db to find our socket, and calls recv on that socket.

OK, let's check out inet_db. Looks like it retrieves module information stored via erlang:set_port_data, in the call above.

A grepping for a call to inet_db:register_module reveals that multiple modules register themselves this way. Among these, we find one of particular interest.

lib/kernel/src/inet_tcp.erl
169:        inet_db:register_socket(S, ?MODULE),
177:        inet_db:register_socket(S, ?MODULE),

Let's see how inet_tcp.erl implements recv. Hmm, just a pass-through to prim_inet. Let's look there.

It seems here that our erlang call-chain bottoms out in a call to ctl_cmd, which is itself a wrapper to erlang:port_control, sending control data over into C-land. We'll need to look at out TCP port driver to figure out what comes next.

    case ctl_cmd(S, ?TCP_REQ_RECV, [enc_time(Time), ?int32(Length)])

A slight hitch is finding the source code for this driver. Perhaps the marco ?TCP_REQ_RECV can help us find what we're after?

$  rg 'TCP_REQ_RECV'
lib/kernel/src/inet_int.hrl
100:-define(TCP_REQ_RECV,           42).

erts/preloaded/src/prim_inet.erl
584:    case ctl_cmd(S, ?TCP_REQ_RECV, [enc_time(Time), ?int32(Length)]) of

erts/emulator/drivers/common/inet_drv.c
735:#define TCP_REQ_RECV           42
10081:    case TCP_REQ_RECV: {
10112:  if (enq_async(INETP(desc), tbuf, TCP_REQ_RECV) < 0)

A-ha! inet_drv.c, here we come!

Indeed, this C function here, responsible for the actual call to sock_select, will proactively reject recv calls where the requested payload size n is bigger than TCP_MAX_PACKET_SIZE:

if (n > TCP_MAX_PACKET_SIZE)
    return ctl_error(ENOMEM, rbuf, rsize);

and TCP_MAX_PACKET_SIZE itself is defined in the same source file as:

#define TCP_MAX_PACKET_SIZE 0x4000000 /* 64 M */

thereby explaining our weird ENOMEM error.

Now, how to solve this conundrum? A possible approach would be to maintain some state in our receiver, optimistically read as much data as possible, and then try to reconstruct the logical messages, perhaps using something like erlang:decode_packet to take care of the book-keeping for us.

Taking a step back — and finding a clean solution

Before we jump to writing more code, let's consider our position. We're trying to read a framed message off of a TCP stream. It's been done thousands of times before. Surely the sagely developers whose combined experience is encoded in OTP have thought of an elegant solution to this problem?

It turns out that if you read the very long man entry for inet:setopts, you'll eventually come across this revealing paragraph:

{packet, PacketType}(TCP/IP sockets)

Defines the type of packets to use for a socket. Possible values:

raw | 0

No packaging is done.

1 | 2 | 4

Packets consist of a header specifying the number of bytes in the packet, followed by that number of bytes. The header length can be one, two, or four bytes, and containing an unsigned integer in big-endian byte order. Each send operation generates the header, and the header is stripped off on each receive operation.

The 4-byte header is limited to 2Gb.

Packets consist of a header specifying the number of bytes in the packet, followed by that number of bytes. Yes indeed they do! Let's try it out!

#!/usr/bin/env escript
recv(Sock) ->
  {ok, String} = gen_tcp:recv(Sock,0),
  io:format("Got string of length: ~p~n", [byte_size(String)]),
  recv(Sock).

main(_) ->
  {ok, L} = gen_tcp:listen(7777, [binary, {active, false}, {reuseaddr, true}, {packet, 4}]),
  {ok, Sock} = gen_tcp:accept(L),
  recv(Sock).

And the output is:

Got string of length: 1000
Got string of length: 1000000
Got string of length: 10000000
Got string of length: 100000000
escript: exception error: no match of right hand side value {error,closed}

Problem solved! (The last error is from a recv call on the socket after it has been closed from the Python side). Turns out that our TCP framing pattern is in fact so common, it's been subsumed by OTP as a mere option for gen_tcp sockets!

If you'd like to know why setting this option lets us sidestep the TCP_MAX_PACKET_SIZE check, I encourage you to take a dive into the OTP codebase and find out. It's suprisingly easy to navigate, and full of great code.

And if you ever find yourself fighting a networking problem using brute-force in Erlang, please consider the question: "Peraphs this was solved long ago and the solution lives in OTP?" Chances are, the answer is yes!