First the code for a small socket example.
#include <unistd.h>
#include <string.h>
#include <stdio.h>
#include <sys/socket.h>
#define MSG "Hello, World!"
int main(const int argc,
const char *const argv[])
{
char read_buf[sizeof(MSG)];
int socket[2];
const int ret =
(AF_UNIX, SOCK_STREAM,
socketpair0, socket);
if (ret < 0) {
("socketpair");
perrorreturn 1;
}
const ssize_t write_len =
(socket[0], MSG, sizeof(MSG));
writeif (write_len < 0) {
("write");
perrorreturn 1;
}
("Wrote %zu of %zu bytes\n",
printf, sizeof(MSG));
write_len
const ssize_t read_len =
(socket[1], read_buf,
readsizeof(read_buf));
if (read_len < 0) {
("read");
perrorreturn 1;
}
("Read %zu of %zu bytes\n",
printf, sizeof(MSG));
read_len
return 0;
}
You may copy this to a file socket.c
, compile and run it
as follows.
-Wall -pedantic sockets.c -o sockets
$ gcc ./sockets
$ 14 of 14 bytes
Wrote 14 of 14 bytes Read
Preamble
Now let me tell you about man pages. On almost any Linux distribution you can type the following in a terminal.
$ man 2 socketpair
Or often just man socketpair
will do. Exactly what
happens in this case is dependant on your distribution. There are
things, such as Emacs’s helm mode or make -k
, which can
search the man pages. This is useful if, like myself, you are irritated
by searching the web.
Most system calls are documented in man pages. They are not always accurate, complete or easy to read. However it is expected that Linux (and POSIX) behave the way the man pages describe.
socketpair
is a system call. System calls are how the
user, in user land, tells the Linux kernel, in kernel land, to do
something. Usually kernel land is where the network stack and
sockets live. In user land we are just given an ID number, a file
descriptor, representing the socket. We never interact with the
socket ‘object’1 directly.
Usually system calls are required to issue commands to the kernel. These are like function calls in C except that they cause a context switch. That is, a switch between user land context and kernel context. Exactly what that entails changes with every kernel version, hardware architecture and configuration.
You can find out more with man 2 syscalls
. More
importantly right now, there is a useful tool for tracking system
calls.
-e read,write,socketpair ./sockets >/dev/null
$ strace (3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0p|\2\0\0\0\0\0"..., 832) = 832
read(AF_UNIX, SOCK_STREAM, 0, [3, 4]) = 0
socketpair(3, "Hello, World!\0", 14) = 14
write(4, "Hello, World!\0", 14) = 14
read(1, "Wrote 14 of 14 bytes\nRead 14 of "..., 41) = 41
write+++ exited with 0 +++
On SUSE this can be installed with zypper in strace
. It
is probably similar on other distributions.
The above command prints system calls that our sockets
program makes. the -e
flag filters all calls except
read
, write
and socketpair
. The
first read
call is loading the libc
library
and can be ignored. Try running strace
with no filter. It
can be seen that the system call trace does not match exactly to the
source code.
Calls to read
and write
take a file
descriptor (FD) as the first argument. This is an index number for a row
in the FD table. Each process has its own FD table. This is managed by
the kernel, we can’t access the table directly, only via system
calls.
Sockets are not files, the name “file descriptor”, is historical. Lots of things can be represented by an FD. This includes, but is not limited to, files and sockets. The above program will have a FD table similar to the below by the end.
ID | Description |
---|---|
0 | stdin |
1 | stdout |
2 | stderr |
3 | UNIX socket 0 |
4 | UNIX socket 1 |
You can inspect a program’s FDs either by looking in
/proc/
or using lsof
. Programs like
netstat
and ss
can display more socket
specific information.
UNIX
Sockets are an interface centered around the socket ‘object’. As sockets live in the kernel, we are just given a file descriptor as a reference to a socket. Usually sockets are used to send and receive data over a network. However in the example above we are not sending data over a network. Just from a process to one or more buffers in the kernel and back again.
Usually socketpair
is used in a program which forks a
child process. Let’s make the example above a little more realistic by
creating a child process.
...
static int child_proc(int socket)
{
char read_buf[sizeof(MSG)];
const ssize_t read_len =
(socket, read_buf,
readsizeof(read_buf));
if (read_len < 0) {
("read");
perrorreturn 1;
}
("Read %zu of %zu bytes\n",
printf, sizeof(MSG));
read_len
return 0;
}
int main(const int argc,
const char *const argv[])
{
int socket[2];
const int ret =
(AF_UNIX, SOCK_STREAM,
socketpair0, socket);
if (ret < 0) {
("socketpair");
perrorreturn 1;
}
const pid_t child_pid = fork();
if (child_pid < 0) {
("fork");
perrorreturn 1;
}
if (!child_pid) {
(socket[0]);
closereturn child_proc(socket[1]);
}
(socket[1]);
close
const ssize_t write_len =
(socket[0], MSG, sizeof(MSG));
writeif (write_len < 0) {
("write");
perrorreturn 1;
}
("Wrote %zu of %zu bytes\n",
printf, sizeof(MSG));
write_len
return 0;
}
When forking (with fork()
) the file descriptor table is
copied from the parent to the child process. So we can use
socketpair
to create a pair of connected sockets. Then
assign one to each process by closing one in the child and the other in
the parent. Closing them avoids confusion, but it is possible to leave
both ends open.
Again we can run strace
on this program to see what is
happening. However an extra flag is needed (-f
) to see what
the child process does.
-f -e read,write,socketpair,close,clone ~/c/scratch/sockets > /dev/null
$ strace ...
(AF_UNIX, SOCK_STREAM, 0, [3, 4]) = 0
socketpair(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fd2148ad850) = 12370
clone: Process 12370 attached
strace[pid 12369] close(4) = 0
[pid 12369] write(3, "Hello, World!\0", 14 <unfinished ...>
[pid 12370] close(3 <unfinished ...>
[pid 12369] <... write resumed>) = 14
[pid 12370] <... close resumed>) = 0
[pid 12370] read(4, "Hello, World!\0", 14) = 14
[pid 12369] write(1, "Wrote 14 of 14 bytes\n", 21) = 21
[pid 12370] write(1, "Read 14 of 14 bytes\n", 20) = 20
[pid 12369] +++ exited with 0 +++
+++ exited with 0 +++
The output of strace
is becoming more confusing. Our
call to fork
actually resulted in a call to
clone
. Also because some system calls were executed in
parallel they interrupt each others’ log messages. You may wish to try
playing with the strace
options to see what information can
be revealed.
There are many different socket families which support various types of socket and protocols. Additionally there are many socket options. These change the operations (system calls) available and their behaviour. These changes are significant and can be surprising.
Currently we are using the stream type of a UNIX socket. Otherwise
known as a local socket, because they only allow communication between
processes on the same machine. As usual there is a man page
(man 7 unix
).
The way we are currently using UNIX sockets is almost identical to a
pipe (man 2 pipe
). Indeed to use a pipe all we need to do
is substitute socketpair()
for pipe()
then
swap the FD numbers. Unlike UNIX sockes a pipe is unidirectional, so we
need to read and write to the correct FD. There are many other subtle
differences. However we are unlikely to notice the difference with our
simple program.
As well as being bidirectional there are other things a UNIX stream
socket can do. For one thing we can use the send
,
recv
, sendmsg
and recvmsg
interfaces. Before continuing, you may wish to convert the program to
use these yourself.
Something to note is that we only send a very small amount of data.
We also don’t interrupt our program with signals. So read
and write
are likely to receive or send the full amount.
However, in general, there is no guarantee they will read
or write
the full amount. This means the above programs are
technically incorrect.
Next let’s start using sockets capable of remote communication.
UDP
User Datagram Protocol allows us to send packets (datagrams) to an IP
address. We do not need to setup a connection. We can send and
receive packets immediately. Although usually one participant needs to
bind
to a known address and port. UDP will automatically
choose a port and address, but remote peers won’t know what this is
until we message them.
Of course there are connections. However these are maintained by lower parts of the stack. Such as the IP, ARP and Ethernet layers. Our program usually doesn’t need to set these up. We just aim a packet at an IP address, send it and hope it is routed to the correct location.
UDP is not reliable, it will happily let us send messages to a
location that doesn’t exist. The below program is also unreliable and
contains a race condition, note the usleep
.
#include <errno.h>
#include <unistd.h>
#include <string.h>
#include <ctype.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/uio.h>
#include <netinet/in.h>
#include <netinet/udp.h>
#define PING "PING"
#define PONG "PONG"
#define PONG_ADDR { \
.sin_family = AF_INET, \
.sin_port = htons(21000), \
.sin_addr = { \
htonl(INADDR_LOOPBACK) \
} \
}
static int udp_socket(void)
{
const int sk =
(AF_INET, SOCK_DGRAM, 0);
socket
if (sk < 0) {
("socket");
perror(1);
exit}
return sk;
}
static ssize_t udp_recvfrom(const int sk,
const struct iovec *const iov,
const struct sockaddr_in *const addr)
{
= sizeof(*addr);
socklen_t addr_len const ssize_t recv_len =
(sk,
recvfrom->iov_base,
iov->iov_len - 1,
iov0,
(struct sockaddr *)addr,
? &addr_len : NULL);
addr
if (recv_len < 0) {
("recvfrom");
perror(1);
exit}
if (addr_len != sizeof(*addr)) {
("address is not expected size\n");
printf(1);
exit}
((char *)iov->iov_base)[recv_len] = '\0';
return recv_len;
}
static ssize_t udp_sendto(const int sk,
const struct iovec *const iov,
const struct sockaddr_in *const addr)
{
const ssize_t send_len =
(sk,
sendto->iov_base,
iov->iov_len,
iov,
MSG_DONTROUTE(struct sockaddr *)addr,
sizeof(*addr));
if (send_len < 0) {
("sendto");
perror(1);
exit}
return send_len;
}
static int can_print(const char *buf)
{
while (isprint(*(buf++)))
;
return *buf == '\0';
}
static int pinger(void)
{
const struct sockaddr_in pong_addr = PONG_ADDR;
char buf[BUFSIZ];
const struct iovec recv_iov = {
.iov_base = buf,
.iov_len = BUFSIZ
};
const struct iovec send_iov = {
.iov_base = PING,
.iov_len = sizeof(PING)
};
const int sk = udp_socket();
(sk, &send_iov, &pong_addr);
udp_sendtoconst ssize_t recv_len =
(sk, &recv_iov, NULL);
udp_recvfrom
if (can_print(buf))
("pinger recv: %s\n", buf);
printfelse
("pinger recv %zd bytes\n", recv_len);
printf
return 0;
}
static int ponger(void)
{
const struct sockaddr_in pong_addr = PONG_ADDR;
const struct sockaddr_in ping_addr;
char buf[BUFSIZ];
const struct iovec recv_iov = {
.iov_base = buf,
.iov_len = BUFSIZ
};
const struct iovec send_iov = {
.iov_base = PONG,
.iov_len = sizeof(PONG)
};
const int sk = udp_socket();
const int ret =
(sk,
bind(struct sockaddr *)&pong_addr,
sizeof(pong_addr));
if (ret < 0) {
("bind");
perrorreturn 1;
}
const ssize_t recv_len =
(sk, &recv_iov, &ping_addr);
udp_recvfrom
if (can_print(buf))
("ponger recv: %s\n", buf);
printfelse
("ponger recv %zd bytes\n", recv_len);
printf
(sk, &send_iov, &ping_addr);
udp_sendto
return 0;
}
int main(const int argc,
const char *const argv[])
{
int ret;
;
siginfo_t infop
const pid_t pinger_pid = fork();
if (!pinger_pid)
return pinger();
const pid_t ponger_pid = fork();
if (!ponger_pid)
return ponger();
do {
= waitid(P_ALL, 0, &infop, WEXITED);
ret if (!ret)
/* should read infop here */
continue;
switch (errno) {
case EINTR:
continue;
case ECHILD:
break;
default:
("waitid");
perrorreturn 1;
}
} while (0);
return 0;
}
This starts two processes. Ponger; which binds to
localhost:21000
and waits for a packet. When it receives a
packet it prints the contents and sends “PONG” back. Meanwhile Pinger
sends a packet to localhost:21000
and waits for a response.
When it gets a response it prints it.
Pinger does not choose an address to bind to. It is automatically
assigned a port and is bound to any local address. Meanwhile we bind
Ponger to localhost
or the address of the loopback
device. Usually localhost
(127.0.0.1
,
::1
, lo
etc.) can not receive messages from a
remote host. So Ponger probably won’t receive messages from a remote
device.
Pinger on the other hand will receive messages from anywhere. So long as they are addressed to some network interface on the local machine (or in the process’s network namespace). And they are addressed the port which was automatically assigned to it. This means Pinger could randomly receive a packet from some remote source. Ponger also could receive some unexpected data from a local process. Possibly port 21000 is used for something else.
Which brings me onto constructing an address. Let’s look at Ponger’s address with the macro expanded.
const struct sockaddr_in pong_addr = {
.sin_family = AF_INET,
.sin_port = htons(21000),
.sin_addr = (struct in_addr){
(INADDR_LOOPBACK)
htonl}
}
What catches this author out time and again; is that the port and
address are in network byte order. This happens to be big
endian, meanwhile my computer uses little endian. So we
need to swap the bytes around. Consider that
21000 = 0x5208
.
Byte 0 | Byte 1 | |
---|---|---|
Little Endian | 0x08 |
0x52 |
Big Endian | 0x52 |
0x08 |
If byte 0 is on the left, then the end is considered to be on the left. This is, of course, nonsensical as this means Big Endian starts the transmission with the high (i.e. big) order byte. Perhaps it should be called Big Startian or HOBAZ (High Order Byte At Zero)?
Another way to visualise it is from top to bottom. Address zero is at the top end and there is no bottom end; it goes all the way down to infinity. So the end is address zero.
The littleness or bigness of the end depends on the
significance of the byte. The significance is greater if the
byte has a greater effect on the number’s magnitude. So the least
significant byte can only add at most 255 (0xff
) to a
number. The next byte can add at most 255 * 256
(0xff00
).
To be clear we are discussing bytes not bits. Binary numbers written in Arabic numerals (that is 0 and 1) have the high order bit on the left. Generally programming languages and machine instructions follow this convention. What order the bits are stored or transmitted by hardware is irrelevant.
Let’s say we shift bits left (<<
) in a 64-bit
int
. Then we expect the low order bit to now be zero. All
other bits are expected to move one place to the left. Regardless of if
they cross a byte boundary and what order the bytes are handled by the
CPU. Nor do we care what the actual bit order is within bytes.
Individual bits are not directly addressable. You need to use a combination of shifts and masking to get a single bit’s value. Which bit you consider to be index zero is arbitrary. It can be the low or high order bit.
Now let’s look at receiving a packet.
static ssize_t udp_recvfrom(const int sk,
const struct iovec *const iov,
const struct sockaddr_in *const addr)
{
= sizeof(*addr);
socklen_t addr_len const ssize_t recv_len =
(sk,
recvfrom->iov_base,
iov->iov_len - 1,
iov0,
(struct sockaddr *)addr,
? &addr_len : NULL);
addr
if (recv_len < 0) {
("recvfrom");
perror(1);
exit}
if (addr_len != sizeof(*addr)) {
("address is not expected size\n");
printf(1);
exit}
((char *)iov->iov_base)[recv_len] = '\0';
return recv_len;
}
The struct iovec
is used to wrap the buffer and length
into a single argument. It’s not necessary, however it’s commonly used
in networking.
We reserve one byte of the receive buffer for null termination. That is, we add a sentinel value which marks the end of a string. Pinger and Ponger already send a null terminated string. However we could get some random data from another source. It’s also possible to receive corrupted data. Although UDP does have a checksum to mitigate that. It can happen so it will happen.
When we receive a UDP packet the kernel informs us of the source address. This allows us to respond. The source address could be fraudulent. It’s only some data sent in the packet’s header. There is no encryption or signing in basic UDP. So we can’t trust anything.
It’s worth noting that send
and recv
only
ever accept or return one packet. The data in this packet can be between
0 and the maximum-transmission-unit in size. The buffer we use to
receive the packet data in must be large enough to contain all of it.
Furthermore the order the packets are sent in may not be the order they
are received in.
TCP & HTTP
This is quite unlike files or streams where we can read or write
arbitrarily sized chunks of data. Where the data is usually in the order
it was sent or written. If we want to use a stream instead then we can
use TCP. The above example can be converted to TCP by using the
listen
and connect
system calls and switching
to read
and write
.
TCP is connection or stream orientated, meaning we have to establish a connection before sending or receiving data. Once we have a connection then we can write bytes to a socket on one end and expect them to be read in same order at the other end. Of course things can still go wrong, but it is more reliable than UDP. On the other hand we can no longer read and write single packets. Nor can we just send a packet immediately.
Although things like QUIC now exist, TCP is generally used to serve web content. Let’s make a minimal HTTP web server to serve my static website. Now I have to warn you that HTTP is hugely complicated. We can get away with ignoring most of that complication, but we still end up with a fair old chunk of code.
#define _GNU_SOURCE
#include <limits.h>
#include <errno.h>
#include <unistd.h>
#include <string.h>
#include <stdio.h>
#include <fcntl.h>
#include <signal.h>
#include <sys/stat.h>
#include <sys/socket.h>
#include <sys/sendfile.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <arpa/inet.h>
const char *const http_head =
"HTTP/1.1 200 OK\r\n"
"Connection: close\r\n"
"Content-Type: %s\r\n"
"Content-Length: %lu\r\n"
"\r\n";
static void serve_file(const int sk, const int public_dir)
{
char recv_buf[BUFSIZ];
char head_buf[BUFSIZ];
const size_t buf_len = BUFSIZ - 1;
char path_buf[256];
char *file_path;
ssize_t recv, sent;
size_t recv_total = 0, sent_total = 0;
int body_fd;
while (recv_total < buf_len) {
= read(sk,
recv + recv_total,
recv_buf - recv_total);
buf_len
if (recv < 0) {
("[-] read");
perrorreturn;
}
if (!recv) {
(STDERR_FILENO,
dprintf"[-] End of data before header was received\n");
return;
}
+= recv;
recv_total [recv_total] = 0;
recv_buf
if (strstr(recv_buf, "\r\n\r\n"))
goto got_header;
}
(STDERR_FILENO,
dprintf"Exceeded buffer reading header\n");
return;
:
got_header("[*] <<<\n%s\n", recv_buf);
printfif (!sscanf(recv_buf, "GET %250s HTTP/1.1", path_buf)) {
(STDERR_FILENO,
dprintf"[-] 'GET <file_path> HTTP/1.1' not matched in:\n %s",
);
recv_buf}
if (!strcmp("/", path_buf)) {
(path_buf, "index.html");
strcpy= path_buf;
file_path } else if (path_buf[0] == '/') {
= path_buf + 1;
file_path }
("[*] Opening %s", file_path);
printf= openat(public_dir, file_path, O_RDONLY);
body_fd
if (body_fd < 0 && errno == ENOENT) {
(file_path + strlen(file_path), ".html");
strcpy= openat(public_dir, file_path, O_RDONLY);
body_fd (" failed trying with .html");
printf}
("\n");
printf
if (body_fd < 0) {
("[-] openat");
perrorreturn;
}
const char *mime = "text/html";
if (strstr(file_path, ".css"))
= "text/css";
mime if (strstr(file_path, ".map"))
= "application/json";
mime if (strstr(file_path, ".svg"))
= "image/svg+xml";
mime if (strstr(file_path, ".jpg"))
= "image/jpg";
mime if (strstr(file_path, ".png"))
= "image/png";
mime
struct stat body_stat;
if (fstat(body_fd, &body_stat)) {
("[-] fstat");
perrorgoto close_body;
}
(head_buf, http_head, mime, body_stat.st_size);
sprintf("[*] >>>\n%s", head_buf);
printf
while (sent_total < strlen(http_head)) {
= write(sk, head_buf + sent_total, strlen(head_buf));
sent
if (sent < 0) {
("[-] write");
perrorgoto close_body;
}
+= sent;
sent_total }
do {
= sendfile(sk, body_fd, NULL, body_stat.st_size);
sent
if (sent < 0) {
("[-] sendfile");
perrorgoto close_body;
}
+= sent;
sent_total } while (sent > 0);
:
close_body(body_fd);
close}
int main(const int argc, const char *const argv[])
{
const pid_t orig_parent = getppid();
const struct sockaddr_in self_addr = {
.sin_family = AF_INET,
.sin_port = htons(9000),
.sin_addr = {
(INADDR_LOOPBACK)
htonl}
};
const int listen_sk = socket(AF_INET, SOCK_STREAM, 0);
const int public_dir = open(argv[1], O_PATH);
struct sockaddr client_addr;
;
socklen_t addr_len
if (argc < 2) {
(STDERR_FILENO,
dprintf"usage: %s <dir to serve files from>\n",
[0]);
argvreturn 1;
}
if (bind(listen_sk, (struct sockaddr *)&self_addr, sizeof(self_addr))) {
("bind");
perrorreturn 1;
}
if (listen(listen_sk, 8)) {
("listen");
perrorreturn 1;
}
("[+] Listening; press Ctrl-C to exit...\n");
printf
while (orig_parent == getppid()) {
const int sk = accept(listen_sk, &client_addr, &addr_len);
if (sk < 0) {
("[-] accept");
perrorbreak;
}
("[+] Accepted Connection\n");
printf
(sk, public_dir);
serve_file(sk);
close}
return 0;
}
I tested this on Firefox and Chromium. Niether seemed too concerned that most of the things they asked for were ignored. They didn’t cope very well without the content-length header though and Chromium also needs the MIME type to be spelled out for it.
All of the HTTP complication is in serve_file
. So if we
look in main
, this shows what is involved in accepting an
incoming TCP connection. The client side is simpler, you just need to
call connect
.
Inside serve_file
we first load the whole HTTP header
into a buffer. We do this by looking for the first instance of a newline
followed by a newline (\r\n\r\n
). HTTP doesn’t appear to
set any limit on the size of a header. It also has a dreadful feature
which allows “comments” to be put in some header fields which are
delimited by (
and )
. These can contain
\r\n\r\n
. It doesn’t matter to us though because we ignore
most of the header and are not trying to be standards compliant.
The browser would prefere it if we kept the connection open between requests, but it’s easier for us just to close it. However it should be noted that opening and closing TCP connections is expensive. It seems that Firefox even preemptively opens a connection when you move your mouse towards a link.
Anyway, once we have some complete data then we scan the first line
of it to get the URI path. We only accept paths up to 250 characters
long which leaves another 5 characters for “.html” to be added, plus
\0
, the null character.
Unfortunately the C libraries string functions are prone to dangerous
errors. It’s easy to overwrite the null terminating character
\0
or to forget it requires extra space in buffers. Also
you need to pay attention to whether functions like strlen
count \0
. Then there are the attempted fixes for these
functions, like strncpy
, which make matters worse by
potentially leaving strings unterminated.
C itself does not help because by default there is no bounds checking. Although thorough testing with the address sanitizer enabled can help with that.
Eventually we open the file requested. Which, as the file path is not
validated, could include any file on your system. We use
openat
which takes, as the first argument, a file
descriptor for a path to a directory. Not the directory itself, just the
path to that directory. The second argument is the file path relative to
the directory described by the FD. This avoids having to construct the
full file path with sprintf
or similar.
We then stat
the file to get its size for the
content-length header. The header is formatted and sent before writing
the file content to the socket with sendfile
.
The sendfile
system call shown here is unique to Linux.
Although FreeBSD has a similar one as no doubt other kernels do. It
avoids having to read the file into a buffer before writing it back to
the socket. The reason for this function’s existence is probably
performance. However it also happens to make things simpler, hence why
it’s used here.
Once we are finished sending the file, the FD and socket are closed. Then we wait for the next connection.
I’m using the term object in a loosely defined way. There are a number of C structs and associated data used to represent a socket in the kernel. Exactly what is encapsulated in the socket object and what is external to it is unclear↩︎