First the code for a small socket example.
#include <unistd.h>
#include <string.h>
#include <stdio.h>
#include <sys/socket.h>
#define MSG "Hello, World!"
int main(const int argc,
const char *const argv[])
{
char read_buf[sizeof(MSG)];
int socket[2];
const int ret =
(AF_UNIX, SOCK_STREAM,
socketpair0, socket);
if (ret < 0) {
("socketpair");
perrorreturn 1;
}
const ssize_t write_len =
(socket[0], MSG, sizeof(MSG));
writeif (write_len < 0) {
("write");
perrorreturn 1;
}
("Wrote %zu of %zu bytes\n",
printf, sizeof(MSG));
write_len
const ssize_t read_len =
(socket[1], read_buf,
readsizeof(read_buf));
if (read_len < 0) {
("read");
perrorreturn 1;
}
("Read %zu of %zu bytes\n",
printf, sizeof(MSG));
read_len
return 0;
}
You may copy this to a file socket.c
, compile and
run it as follows.
-Wall -pedantic sockets.c -o sockets
$ gcc ./sockets
$ 14 of 14 bytes
Wrote 14 of 14 bytes Read
Preamble
Now let me tell you about man pages. On almost any Linux distribution you can type the following in a terminal.
$ man 2 socketpair
Or often just man socketpair
will do. Exactly what
happens in this case is dependant on your distribution. There are
things, such as Emacs’s helm mode or make -k
, which can
search the man pages. This is useful if, like myself, you are
irritated by searching the web.
Most system calls are documented in man pages. They are not always accurate, complete or easy to read. However it is expected that Linux (and POSIX) behave the way the man pages describe.
socketpair
is a system call. System calls are how
the user, in user land, tells the Linux kernel, in kernel land, to
do something. Usually kernel land is where the network
stack and sockets live. In user land we are just given an ID
number, a file descriptor, representing the socket. We
never interact with the socket ‘object’1
directly.
Usually system calls are required to issue commands to the kernel. These are like function calls in C except that they cause a context switch. That is, a switch between user land context and kernel context. Exactly what that entails changes with every kernel version, hardware architecture and configuration.
You can find out more with man 2 syscalls
. More
importantly right now, there is a useful tool for tracking system
calls.
-e read,write,socketpair ./sockets >/dev/null
$ strace (3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0p|\2\0\0\0\0\0"..., 832) = 832
read(AF_UNIX, SOCK_STREAM, 0, [3, 4]) = 0
socketpair(3, "Hello, World!\0", 14) = 14
write(4, "Hello, World!\0", 14) = 14
read(1, "Wrote 14 of 14 bytes\nRead 14 of "..., 41) = 41
write+++ exited with 0 +++
On SUSE this can be installed with zypper in strace
.
It is probably similar on other distributions.
The above command prints system calls that our
sockets
program makes. the -e
flag filters
all calls except read
, write
and
socketpair
. The first read
call is loading
the libc
library and can be ignored. Try running
strace
with no filter. It can be seen that the system
call trace does not match exactly to the source code.
Calls to read
and write
take a file
descriptor (FD) as the first argument. This is an index number for a
row in the FD table. Each process has its own FD table. This is
managed by the kernel, we can’t access the table directly, only via
system calls.
Sockets are not files, the name “file descriptor”, is historical. Lots of things can be represented by an FD. This includes, but is not limited to, files and sockets. The above program will have a FD table similar to the below by the end.
ID | Description |
---|---|
0 | stdin |
1 | stdout |
2 | stderr |
3 | UNIX socket 0 |
4 | UNIX socket 1 |
You can inspect a program’s FDs either by looking in
/proc/
or using lsof
. Programs like
netstat
and ss
can display more socket
specific information.
UNIX
Sockets are an interface centered around the socket ‘object’. As sockets live in the kernel, we are just given a file descriptor as a reference to a socket. Usually sockets are used to send and receive data over a network. However in the example above we are not sending data over a network. Just from a process to one or more buffers in the kernel and back again.
Usually socketpair
is used in a program which forks
a child process. Let’s make the example above a little more
realistic by creating a child process.
...
static int child_proc(int socket)
{
char read_buf[sizeof(MSG)];
const ssize_t read_len =
(socket, read_buf,
readsizeof(read_buf));
if (read_len < 0) {
("read");
perrorreturn 1;
}
("Read %zu of %zu bytes\n",
printf, sizeof(MSG));
read_len
return 0;
}
int main(const int argc,
const char *const argv[])
{
int socket[2];
const int ret =
(AF_UNIX, SOCK_STREAM,
socketpair0, socket);
if (ret < 0) {
("socketpair");
perrorreturn 1;
}
const pid_t child_pid = fork();
if (child_pid < 0) {
("fork");
perrorreturn 1;
}
if (!child_pid) {
(socket[0]);
closereturn child_proc(socket[1]);
}
(socket[1]);
close
const ssize_t write_len =
(socket[0], MSG, sizeof(MSG));
writeif (write_len < 0) {
("write");
perrorreturn 1;
}
("Wrote %zu of %zu bytes\n",
printf, sizeof(MSG));
write_len
return 0;
}
When forking (with fork()
) the file descriptor table
is copied from the parent to the child process. So we can use
socketpair
to create a pair of connected sockets. Then
assign one to each process by closing one in the child and the other
in the parent. Closing them avoids confusion, but it is possible to
leave both ends open.
Again we can run strace
on this program to see what
is happening. However an extra flag is needed (-f
) to
see what the child process does.
-f -e read,write,socketpair,close,clone ~/c/scratch/sockets > /dev/null
$ strace ...
(AF_UNIX, SOCK_STREAM, 0, [3, 4]) = 0
socketpair(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fd2148ad850) = 12370
clone: Process 12370 attached
strace[pid 12369] close(4) = 0
[pid 12369] write(3, "Hello, World!\0", 14 <unfinished ...>
[pid 12370] close(3 <unfinished ...>
[pid 12369] <... write resumed>) = 14
[pid 12370] <... close resumed>) = 0
[pid 12370] read(4, "Hello, World!\0", 14) = 14
[pid 12369] write(1, "Wrote 14 of 14 bytes\n", 21) = 21
[pid 12370] write(1, "Read 14 of 14 bytes\n", 20) = 20
[pid 12369] +++ exited with 0 +++
+++ exited with 0 +++
The output of strace
is becoming more confusing. Our
call to fork
actually resulted in a call to
clone
. Also because some system calls were executed in
parallel they interrupt each others’ log messages. You may wish to
try playing with the strace
options to see what
information can be revealed.
There are many different socket families which support various types of socket and protocols. Additionally there are many socket options. These change the operations (system calls) available and their behaviour. These changes are significant and can be surprising.
Currently we are using the stream type of a UNIX socket.
Otherwise known as a local socket, because they only allow
communication between processes on the same machine. As usual there
is a man page (man 7 unix
).
The way we are currently using UNIX sockets is almost identical
to a pipe (man 2 pipe
). Indeed to use a pipe all we
need to do is substitute socketpair()
for
pipe()
then swap the FD numbers. Unlike UNIX sockes a
pipe is unidirectional, so we need to read and write to the correct
FD. There are many other subtle differences. However we are unlikely
to notice the difference with our simple program.
As well as being bidirectional there are other things a UNIX
stream socket can do. For one thing we can use the
send
, recv
, sendmsg
and
recvmsg
interfaces. Before continuing, you may wish to
convert the program to use these yourself.
Something to note is that we only send a very small amount of
data. We also don’t interrupt our program with signals. So
read
and write
are likely to receive or
send the full amount. However, in general, there is no guarantee
they will read
or write
the full amount.
This means the above programs are technically incorrect.
Next let’s start using sockets capable of remote communication.
UDP
User Datagram Protocol allows us to send packets (datagrams) to
an IP address. We do not need to setup a connection. We can
send and receive packets immediately. Although usually one
participant needs to bind
to a known address and port.
UDP will automatically choose a port and address, but remote peers
won’t know what this is until we message them.
Of course there are connections. However these are maintained by lower parts of the stack. Such as the IP, ARP and Ethernet layers. Our program usually doesn’t need to set these up. We just aim a packet at an IP address, send it and hope it is routed to the correct location.
UDP is not reliable, it will happily let us send messages to a
location that doesn’t exist. The below program is also unreliable
and contains a race condition, note the usleep
.
#include <errno.h>
#include <unistd.h>
#include <string.h>
#include <ctype.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/uio.h>
#include <netinet/in.h>
#include <netinet/udp.h>
#define PING "PING"
#define PONG "PONG"
#define PONG_ADDR { \
.sin_family = AF_INET, \
.sin_port = htons(21000), \
.sin_addr = { \
htonl(INADDR_LOOPBACK) \
} \
}
static int udp_socket(void)
{
const int sk =
(AF_INET, SOCK_DGRAM, 0);
socket
if (sk < 0) {
("socket");
perror(1);
exit}
return sk;
}
static ssize_t udp_recvfrom(const int sk,
const struct iovec *const iov,
const struct sockaddr_in *const addr)
{
= sizeof(*addr);
socklen_t addr_len const ssize_t recv_len =
(sk,
recvfrom->iov_base,
iov->iov_len - 1,
iov0,
(struct sockaddr *)addr,
? &addr_len : NULL);
addr
if (recv_len < 0) {
("recvfrom");
perror(1);
exit}
if (addr_len != sizeof(*addr)) {
("address is not expected size\n");
printf(1);
exit}
((char *)iov->iov_base)[recv_len] = '\0';
return recv_len;
}
static ssize_t udp_sendto(const int sk,
const struct iovec *const iov,
const struct sockaddr_in *const addr)
{
const ssize_t send_len =
(sk,
sendto->iov_base,
iov->iov_len,
iov,
MSG_DONTROUTE(struct sockaddr *)addr,
sizeof(*addr));
if (send_len < 0) {
("sendto");
perror(1);
exit}
return send_len;
}
static int can_print(const char *buf)
{
while (isprint(*(buf++)))
;
return *buf == '\0';
}
static int pinger(void)
{
const struct sockaddr_in pong_addr = PONG_ADDR;
char buf[BUFSIZ];
const struct iovec recv_iov = {
.iov_base = buf,
.iov_len = BUFSIZ
};
const struct iovec send_iov = {
.iov_base = PING,
.iov_len = sizeof(PING)
};
const int sk = udp_socket();
(sk, &send_iov, &pong_addr);
udp_sendtoconst ssize_t recv_len =
(sk, &recv_iov, NULL);
udp_recvfrom
if (can_print(buf))
("pinger recv: %s\n", buf);
printfelse
("pinger recv %zd bytes\n", recv_len);
printf
return 0;
}
static int ponger(void)
{
const struct sockaddr_in pong_addr = PONG_ADDR;
const struct sockaddr_in ping_addr;
char buf[BUFSIZ];
const struct iovec recv_iov = {
.iov_base = buf,
.iov_len = BUFSIZ
};
const struct iovec send_iov = {
.iov_base = PONG,
.iov_len = sizeof(PONG)
};
const int sk = udp_socket();
const int ret =
(sk,
bind(struct sockaddr *)&pong_addr,
sizeof(pong_addr));
if (ret < 0) {
("bind");
perrorreturn 1;
}
const ssize_t recv_len =
(sk, &recv_iov, &ping_addr);
udp_recvfrom
if (can_print(buf))
("ponger recv: %s\n", buf);
printfelse
("ponger recv %zd bytes\n", recv_len);
printf
(sk, &send_iov, &ping_addr);
udp_sendto
return 0;
}
int main(const int argc,
const char *const argv[])
{
int ret;
;
siginfo_t infop
const pid_t pinger_pid = fork();
if (!pinger_pid)
return pinger();
const pid_t ponger_pid = fork();
if (!ponger_pid)
return ponger();
do {
= waitid(P_ALL, 0, &infop, WEXITED);
ret if (!ret)
/* should read infop here */
continue;
switch (errno) {
case EINTR:
continue;
case ECHILD:
break;
default:
("waitid");
perrorreturn 1;
}
} while (0);
return 0;
}
This starts two processes. Ponger; which binds to
localhost:21000
and waits for a packet. When it
receives a packet it prints the contents and sends “PONG” back.
Meanwhile Pinger sends a packet to localhost:21000
and
waits for a response. When it gets a response it prints it.
Pinger does not choose an address to bind to. It is automatically
assigned a port and is bound to any local address. Meanwhile we bind
Ponger to localhost
or the address of the
loopback device. Usually localhost
(127.0.0.1
, ::1
, lo
etc.) can
not receive messages from a remote host. So Ponger probably won’t
receive messages from a remote device.
Pinger on the other hand will receive messages from anywhere. So long as they are addressed to some network interface on the local machine (or in the process’s network namespace). And they are addressed the port which was automatically assigned to it. This means Pinger could randomly receive a packet from some remote source. Ponger also could receive some unexpected data from a local process. Possibly port 21000 is used for something else.
Which brings me onto constructing an address. Let’s look at Ponger’s address with the macro expanded.
const struct sockaddr_in pong_addr = {
.sin_family = AF_INET,
.sin_port = htons(21000),
.sin_addr = (struct in_addr){
(INADDR_LOOPBACK)
htonl}
}
What catches this author out time and again; is that the port and
address are in network byte order. This happens to be
big endian, meanwhile my computer uses little
endian. So we need to swap the bytes around. Consider that
21000 = 0x5208
.
Byte 0 | Byte 1 | |
---|---|---|
Little Endian | 0x08 |
0x52 |
Big Endian | 0x52 |
0x08 |
If byte 0 is on the left, then the end is considered to be on the left. This is, of course, nonsensical as this means Big Endian starts the transmission with the high (i.e. big) order byte. Perhaps it should be called Big Startian or HOBAZ (High Order Byte At Zero)?
Another way to visualise it is from top to bottom. Address zero is at the top end and there is no bottom end; it goes all the way down to infinity. So the end is address zero.
The littleness or bigness of the end depends on the
significance of the byte. The significance is greater if
the byte has a greater effect on the number’s magnitude. So the
least significant byte can only add at most 255 (0xff
)
to a number. The next byte can add at most 255 * 256
(0xff00
).
To be clear we are discussing bytes not bits. Binary numbers written in Arabic numerals (that is 0 and 1) have the high order bit on the left. Generally programming languages and machine instructions follow this convention. What order the bits are stored or transmitted by hardware is irrelevant.
Let’s say we shift bits left (<<
) in a 64-bit
int
. Then we expect the low order bit to now be zero.
All other bits are expected to move one place to the left.
Regardless of if they cross a byte boundary and what order the bytes
are handled by the CPU. Nor do we care what the actual bit order is
within bytes.
Individual bits are not directly addressable. You need to use a combination of shifts and masking to get a single bit’s value. Which bit you consider to be index zero is arbitrary. It can be the low or high order bit.
Now let’s look at receiving a packet.
static ssize_t udp_recvfrom(const int sk,
const struct iovec *const iov,
const struct sockaddr_in *const addr)
{
= sizeof(*addr);
socklen_t addr_len const ssize_t recv_len =
(sk,
recvfrom->iov_base,
iov->iov_len - 1,
iov0,
(struct sockaddr *)addr,
? &addr_len : NULL);
addr
if (recv_len < 0) {
("recvfrom");
perror(1);
exit}
if (addr_len != sizeof(*addr)) {
("address is not expected size\n");
printf(1);
exit}
((char *)iov->iov_base)[recv_len] = '\0';
return recv_len;
}
The struct iovec
is used to wrap the buffer and
length into a single argument. It’s not necessary, however it’s
commonly used in networking.
We reserve one byte of the receive buffer for null termination. That is, we add a sentinel value which marks the end of a string. Pinger and Ponger already send a null terminated string. However we could get some random data from another source. It’s also possible to receive corrupted data. Although UDP does have a checksum to mitigate that. It can happen so it will happen.
When we receive a UDP packet the kernel informs us of the source address. This allows us to respond. The source address could be fraudulent. It’s only some data sent in the packet’s header. There is no encryption or signing in basic UDP. So we can’t trust anything.
It’s worth noting that send
and recv
only ever accept or return one packet. The data in this packet can
be between 0 and the maximum-transmission-unit in size. The buffer
we use to receive the packet data in must be large enough to contain
all of it. Furthermore the order the packets are sent in may not be
the order they are received in.
TCP & HTTP
This is quite unlike files or streams where we can read or write
arbitrarily sized chunks of data. Where the data is usually in the
order it was sent or written. If we want to use a stream instead
then we can use TCP. The above example can be converted to TCP by
using the listen
and connect
system calls
and switching to read
and write
.
TCP is connection or stream orientated, meaning we have to establish a connection before sending or receiving data. Once we have a connection then we can write bytes to a socket on one end and expect them to be read in same order at the other end. Of course things can still go wrong, but it is more reliable than UDP. On the other hand we can no longer read and write single packets. Nor can we just send a packet immediately.
Although things like QUIC now exist, TCP is generally used to serve web content. Let’s make a minimal HTTP web server to serve my static website. Now I have to warn you that HTTP is hugely complicated. We can get away with ignoring most of that complication, but we still end up with a fair old chunk of code.
#define _GNU_SOURCE
#include <limits.h>
#include <errno.h>
#include <unistd.h>
#include <string.h>
#include <stdio.h>
#include <fcntl.h>
#include <signal.h>
#include <sys/stat.h>
#include <sys/socket.h>
#include <sys/sendfile.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <arpa/inet.h>
const char *const http_head =
"HTTP/1.1 200 OK\r\n"
"Connection: close\r\n"
"Content-Type: %s\r\n"
"Content-Length: %lu\r\n"
"\r\n";
static void serve_file(const int sk, const int public_dir)
{
char recv_buf[BUFSIZ];
char head_buf[BUFSIZ];
const size_t buf_len = BUFSIZ - 1;
char path_buf[256];
char *file_path;
ssize_t recv, sent;
size_t recv_total = 0, sent_total = 0;
int body_fd;
while (recv_total < buf_len) {
= read(sk,
recv + recv_total,
recv_buf - recv_total);
buf_len
if (recv < 0) {
("[-] read");
perrorreturn;
}
if (!recv) {
(STDERR_FILENO,
dprintf"[-] End of data before header was received\n");
return;
}
+= recv;
recv_total [recv_total] = 0;
recv_buf
if (strstr(recv_buf, "\r\n\r\n"))
goto got_header;
}
(STDERR_FILENO,
dprintf"Exceeded buffer reading header\n");
return;
:
got_header("[*] <<<\n%s\n", recv_buf);
printfif (!sscanf(recv_buf, "GET %250s HTTP/1.1", path_buf)) {
(STDERR_FILENO,
dprintf"[-] 'GET <file_path> HTTP/1.1' not matched in:\n %s",
);
recv_buf}
if (!strcmp("/", path_buf)) {
(path_buf, "index.html");
strcpy= path_buf;
file_path } else if (path_buf[0] == '/') {
= path_buf + 1;
file_path }
("[*] Opening %s", file_path);
printf= openat(public_dir, file_path, O_RDONLY);
body_fd
if (body_fd < 0 && errno == ENOENT) {
(file_path + strlen(file_path), ".html");
strcpy= openat(public_dir, file_path, O_RDONLY);
body_fd (" failed trying with .html");
printf}
("\n");
printf
if (body_fd < 0) {
("[-] openat");
perrorreturn;
}
const char *mime = "text/html";
if (strstr(file_path, ".css"))
= "text/css";
mime if (strstr(file_path, ".map"))
= "application/json";
mime if (strstr(file_path, ".svg"))
= "image/svg+xml";
mime if (strstr(file_path, ".jpg"))
= "image/jpg";
mime if (strstr(file_path, ".png"))
= "image/png";
mime
struct stat body_stat;
if (fstat(body_fd, &body_stat)) {
("[-] fstat");
perrorgoto close_body;
}
(head_buf, http_head, mime, body_stat.st_size);
sprintf("[*] >>>\n%s", head_buf);
printf
while (sent_total < strlen(http_head)) {
= write(sk, head_buf + sent_total, strlen(head_buf));
sent
if (sent < 0) {
("[-] write");
perrorgoto close_body;
}
+= sent;
sent_total }
do {
= sendfile(sk, body_fd, NULL, body_stat.st_size);
sent
if (sent < 0) {
("[-] sendfile");
perrorgoto close_body;
}
+= sent;
sent_total } while (sent > 0);
:
close_body(body_fd);
close}
int main(const int argc, const char *const argv[])
{
const pid_t orig_parent = getppid();
const struct sockaddr_in self_addr = {
.sin_family = AF_INET,
.sin_port = htons(9000),
.sin_addr = {
(INADDR_LOOPBACK)
htonl}
};
const int listen_sk = socket(AF_INET, SOCK_STREAM, 0);
const int public_dir = open(argv[1], O_PATH);
struct sockaddr client_addr;
;
socklen_t addr_len
if (argc < 2) {
(STDERR_FILENO,
dprintf"usage: %s <dir to serve files from>\n",
[0]);
argvreturn 1;
}
if (bind(listen_sk, (struct sockaddr *)&self_addr, sizeof(self_addr))) {
("bind");
perrorreturn 1;
}
if (listen(listen_sk, 8)) {
("listen");
perrorreturn 1;
}
("[+] Listening; press Ctrl-C to exit...\n");
printf
while (orig_parent == getppid()) {
const int sk = accept(listen_sk, &client_addr, &addr_len);
if (sk < 0) {
("[-] accept");
perrorbreak;
}
("[+] Accepted Connection\n");
printf
(sk, public_dir);
serve_file(sk);
close}
return 0;
}
I tested this on Firefox and Chromium. Niether seemed too concerned that most of the things they asked for were ignored. They didn’t cope very well without the content-length header though and Chromium also needs the MIME type to be spelled out for it.
All of the HTTP complication is in serve_file
. So if
we look in main
, this shows what is involved in
accepting an incoming TCP connection. The client side is simpler,
you just need to call connect
.
Inside serve_file
we first load the whole HTTP
header into a buffer. We do this by looking for the first instance
of a newline followed by a newline (\r\n\r\n
). HTTP
doesn’t appear to set any limit on the size of a header. It also has
a dreadful feature which allows “comments” to be put in some header
fields which are delimited by (
and )
.
These can contain \r\n\r\n
. It doesn’t matter to us
though because we ignore most of the header and are not trying to be
standards compliant.
The browser would prefere it if we kept the connection open between requests, but it’s easier for us just to close it. However it should be noted that opening and closing TCP connections is expensive. It seems that Firefox even preemptively opens a connection when you move your mouse towards a link.
Anyway, once we have some complete data then we scan the first
line of it to get the URI path. We only accept paths up to 250
characters long which leaves another 5 characters for “.html” to be
added, plus \0
, the null character.
Unfortunately the C libraries string functions are prone to
dangerous errors. It’s easy to overwrite the null terminating
character \0
or to forget it requires extra space in
buffers. Also you need to pay attention to whether functions like
strlen
count \0
. Then there are the
attempted fixes for these functions, like strncpy
,
which make matters worse by potentially leaving strings
unterminated.
C itself does not help because by default there is no bounds checking. Although thorough testing with the address sanitizer enabled can help with that.
Eventually we open the file requested. Which, as the file path is
not validated, could include any file on your system. We use
openat
which takes, as the first argument, a file
descriptor for a path to a directory. Not the directory itself, just
the path to that directory. The second argument is the file path
relative to the directory described by the FD. This avoids having to
construct the full file path with sprintf
or
similar.
We then stat
the file to get its size for the
content-length header. The header is formatted and sent before
writing the file content to the socket with
sendfile
.
The sendfile
system call shown here is unique to
Linux. Although FreeBSD has a similar one as no doubt other kernels
do. It avoids having to read the file into a buffer before writing
it back to the socket. The reason for this function’s existence is
probably performance. However it also happens to make things
simpler, hence why it’s used here.
Once we are finished sending the file, the FD and socket are closed. Then we wait for the next connection.
I’m using the term object in a loosely defined way. There are a number of C structs and associated data used to represent a socket in the kernel. Exactly what is encapsulated in the socket object and what is external to it is unclear↩︎