ZSV: Swiss army knife CLI for CSV
ZSV is a collection of high performance tools for reading, manipulating and viewing CSV files. At it’s core is a library for processing CSV files in C and on top of that there is an extensible CLI and TUI.
There are commands in the CLI for converting CSV to JSON or selecting a subset of columns or rows from the CSV and normalising the results. It is compatible with Excel so can be used for cleaning files that wouldn’t be accepted by other tools.
In this article we are going to focus on the sheet command. This is a CSV viewer that displays the file in a terminal window and allows you to browse it.
Reading very large CSV files
CSV files are a simple text format where each row can vary in length. The only way to know how long a row is, is to read it into memory and parse the data. If you are only interested in the millionth row of a CSV file then you still have to read all of the rows preceding it.
This presents a problem if you wish to open a CSV file and skip around in it. Especially if you want to skip to the end and still know what the row numbers are.
ZSV can parse even quite large CSV files from start to finish in under a second and without using a large amount of memory. The Sheet command in particular only loads chunks of the file into memory that it needs to display.
However if you have a 13GB CSV file then parsing it from start to finish can take several seconds. If you want to jump around to particular line numbers then the closer you get to the end of the file, the longer it’ll take to jump.
The full 13GB of CSV is not loaded into memory all at once. My laptop has more than enough memory to do that, but imagine you are trying to browse several of these files simultaneously. Being forced to close one file to open another is inconvenient.
So you can imagine that if the user is trying to jump between rows towards the end of the file then there will be unbearable latency. If the CSV is appended to in such a way that the newest entries are at the end then it’s common for the user to skip to the end.
Improving performance with an index
Let’s imagine for a second that our CSV file has a fixed row size
where each line is the same number of bytes. In this case we could
jump directly to any line we desired using a simple calculation:
row_number * row_length
;
CSV files are not usually fixed width however, so we have to parse the file and count the rows. On smaller files this operation is so quick the user will not notice. On larger files it can take a considerable time.
For smaller files we can load each line into an array entry.
Getting a row then becomes a case of looking up an array entry.
Typically array lookups are considered a constant time operation
(O(1)
) and are very quick. Faster than a lookup on a
more complicated data structure that also offers constant time
lookups (e.g. a hash map).
Meanwhile for larger files we don’t want to load the full CSV data into memory. If the CSV file had fixed row sizes then we wouldn’t have to. Instead we could calculate where the row exists inside the file and just read that location. However the rows are different lengths.
What we can do is store the offset of each row’s line end or line beginning in an array. Then when we want to read a particular row we can lookup its location inside the array. This array is our index and it allows us to read any row in constant time and we never need to load the whole CSV file into memory at the same time.
Note that the performance improvement of this change is so dramatic that we can rely on simple functional testing to validate it. Focusing on radical performance improvments has a number of advantages.
Reducing the index’s memory footprint
If a CSV file contains a large number of rows where each row is only a few bytes, then the index may not be much smaller than the data itself. Indeed if we use 64bit file offsets then that is 8 bytes per index entry. So a CSV file with less than 8 ASCII characters per line on average will be smaller than its index.
In addition the ZSV sheet command typically loads 1024 lines at a time. If we were to systematically jump around a file until we had seen every line, then most of the index entries would never be used.
Finally reading 1024 lines with ZSV is very fast unless the lines are exceptionally large. So storing the location of every line end is both wasteful and unnecessary.
Therefor ZSV takes the approach of only storing every 1024th line end in the index. If a line number is requested that is not divisible by 1024 then the previous index entry will be used and it will parse and discard the rows leading up to the requested line.