Generating type specific deserialisers for BSON.jl

This was orgiinally posted here

I have spent quite some time trying to optimise the excellent BSON.jl library as we make heavy use of it. The result of this so far is some updates to the original library and a fork called BSONqs which has more radical changes (more on that at the bottom).

I like working with Julia’s structs and organise most of my data into structs. Even if there is no need to do so for performance reasons and it introduces some rigidity, I still prefer it over working with generic data structures like Dicts or tables. I have a bad memory so I will forget what properties the data has if it is not written down somewhere.

After trying a few different methods which will serialise arbitrary Julia structs in a reliable way I found that BSON.jl worked the best for what I am doing. I won’t go into any more details on that because at the time many libraries were simply broken in Julia 1.0 or didn’t exist.

We store several GBs of very boring Linux test results in Redis and load large chunks of it into RAM. This can take up to 30s and a lot of that time is spent by BSON.jl. So I decided to try optimising its read performance as much as possible.

Like most (de)serialisers BSON.jl first takes the raw BSON document and parses that into an intermediate format. Which originally was a set of nested Dicts. The dictionaries mirror the raw BSON very closely. Below you can see how that looks.

julia> struct Foo

julia> io = IOBuffer();

julia> BSON.bson(io, Dict(:foo => Foo(Set(["baz"]))))

julia> seek(io, 0);

julia> BSON.parse(io)
Dict{Symbol,Any} with 1 entry:
  :foo => Dict{Symbol,Any}(:tag=>"struct",:type=>Dict{Symbol,Any}(:tag=>"datatype",:params=>Any[],:name=>Any["Main", "Foo"]),:data=

julia> ans[:foo]
Dict{Symbol,Any} with 3 entries:
  :tag  => "struct"
  :type => Dict{Symbol,Any}(:tag=>"datatype",:params=>Any[],:name=>Any["Main", "Foo"])
  :data => Any[Dict{Symbol,Any}(:tag=>"struct",:type=>Dict{Symbol,Any}(:tag=>"datatype",:params=>Any[Dict{Symbol,Any}(:tag=>"dataty…

After doing a few basic optimisations, the profiler showed that a lot of time was being spent simply allocating dictionaries for the intermediate format. From the above you can see the dictionaries have very few members and these are always the same for serialised Julia data structures.

So instead of using Dicts I defined some structs for use in the intermediate layer instead.

"""Type class which represents a tagged dictionary

Tagged dictionaries are used to represent complex Julia types. Using a struct
instead of an actual Dictionary requires less memory allocation and allows us
to use multiple dispatch on the resulting tree structure.

It inherits abstract dict just for show."""
abstract type Tagged <: AbstractDict{Symbol, Any} end

"Type class for types which can occupuy the 'type' field in a struct"
abstract type TaggedStructType <: Tagged end


mutable struct TaggedType <: TaggedStructType

mutable struct TaggedStruct <: TaggedStructType

This resulted in an annoying and complex type system (maybe necessarily, maybe not), but saved a huge amount of memory. Initially the performance was worse, which was pretty upsetting, but I figured out this was because I had increased the number of type unstable functions resulting in more dynamic dispatch. After fixing some of that I got a 100% speedup.

At this point I decided to go in a completely different direction and try to cut out the intermediate layer altogether. This basically means we now have three BSON parsers:

  1. Using Dicts for the intermediate layer (in BSON.jl)
  2. Using dedicated types for the intermediate layer (in BSONqs.jl with load_compat)
  3. No intermediate layer and type specific parsing (in BSONqs.jl with load)

One could argue that 3. still has an intermediate data structure, but that it is allocated on the stack in the form of a finite state machine. The wisdom of removing the intermediate layer is debatable as it makes a lot of things harder to debug and reason about. Also the intermediate layer may also have better cache efficiency.

On the other hand, with no intermediate layer we can copy data directly from the input stream/buffer to the native Julia data structures. In theory we could even make it zero-copy if Julia’s memory system allows it. This could be extremely useful when parsing massive, contiguous arrays of numerical data from a memory-mapped file. As we would not even need to read most of the data until it is used in a computation.

The third parser is by far the quickest for large sets of structs. However while I have differentiated the parsers by the intermediate layer or lack thereof. The third parser is also more aware of the resultant Julia type of the data being parsed.

In many cases we know in advance what type the data should be that we are parsing. For example take the following structure.

struct Foo

bson(io, Dict(:foo => Foo(rand(Int, 1000))))

When we come to parse the :foo entry, we will have to determine its type (which is Foo) first. However after that point we can switch to a parser specifically generated to parse Foo. For Foo’s only member bar, we don’t even need to parse its type because we already know what it is from the containing struct’s field definition. In theory we can call a generated method which does the bare minimum necessary for parsing a Vector{Int} and is type stable.

In practice, this is pretty close to how the third parser actually works, although there is definitely room for improvement. It achieves this through the use of two strategies. Firstly by using multiple-dispatch on concrete types, thus allowing the Julia compiler to work its magic, creating type specific code from generic methods.

In places where that doesn’t work, it uses @generated methods, which allow us to explicitly generate code based on the types of the method arguments. I doubt the below code will make much sense taken out of context (or in context for that matter), but hopefully it can give you an idea of what this looks like.

# This is called when we know what type 'T' we want. There are many
# definitions of parse_specific, this one is the most generic and so
# is called when all the others have failed to match. The assumption here is
# that T will be a struct type (if it is concrete) as all other types will
# match with a more specific parse_specific method (say what you like about my
# naming)
function parse_specific(io::IO, ::Type{T}, tag::BSONType,
                        ctx::ParseCtx)::T where T

  # structs are always represented by BSON documents
  # note that this runs normally at 'runtime' before any generated code or the
  # fallback
  @asserteq tag document

  # This tells the compiler this is a generated method. The code in the first
  # branch of this if statement is run at/before 'compile time' and returns an
  # AST (expression). The AST is then compiled, this is the same as for macros.
  # The difference is macro's have no access to type information.
  if @generated
    # If it is not a concrete type we fall back to a method which will first load
    # the embedded type information from the BSON document. This will happen
    # if (for example) you put abstract or union types in your struct definitions
    if !isconcretetype(T)
      return :(parse_tag(io, tag, ctx))

    # If it is a concrete struct, this chunk of code looks through the BSON
    # document for members we are 
    # expecting in a serialised struct or reference to a struct and saves
    # pointers to what it finds. This can be simplified if we can garuantee
    # the order the elements will appear in, I am being cautious for now
      startpos = position(io)
      len = read(io, Int32)
      local data::BSONElem
      ref = nothing

      for _ in 1:4
        if (tag = read(io, BSONType)) == eof

        k = parse_cstr_unsafe(io)
        if k == b"tag"
          @asserteq tag string
        elseif k == b"ref"
          ref = BSONElem(tag, io)
        elseif k == b"type"
          @asserteq tag document
        elseif k == b"data"
          @asserteq tag array
          data = BSONElem(tag, io)

        skip_over(io, tag)

      endpos = position(io)
      @asserteq (startpos + len) endpos

      # Objects which appear in the data more than once are lowered into a special
      # array and their instances replaced with indexes/references into that
      # array before serialisation to BSON.
      if ref ≠ nothing
        # Follow the reference into the array to get/parse the actual struct
        ret = parse_specific_ref(io, $T, ref, ctx)::$T
        seek(io, endpos)
        return ret

      seek(io, data.pos)
      # This is where most the work is done to reconstitute the struct T
      ret = load_struct(io, $T, data.tag, ctx)
      seek(io, endpos)
    # Fallback code to use if the compiler decides not to use the
    # generated version
    parse_tag(io::IO, tag, ctx)::T


function load_struct(io::IO, ::Type{T}, dtag::BSONType, ctx::ParseCtx)::T where T
  if @generated

    # This time we have 4 alternative code blocks which may be produced
    # depending on the type of struct. Fistly we have some code to build C
    # style structs with a straight forward memory layout. These structs
    # only contain bits types themselves (like Int, Float, ComplexF64).
    if isprimitive(T)
      @assert isbitstype(T)
        @asserteq dtag binary
        bits = parse_bin_unsafe(io, ctx)
        # This calls a C function in the Julia runtime which copies 'bits'
        # directly into a new Julia struct.
        ccall(:jl_new_bits, Any, (Any, Ptr{Cvoid}), $T, bits)
    elseif T <: Dict
      # dictionaries have their own special representation in BSON
      :(load_dict!(io, $T(), ctx))
    elseif fieldcount(T) < 1
      # Now we have to deal with the type of struct which has an
      # unknown memory layout and we have to build one field at a time.
      n = fieldcount(T)
      @assert n > 0
      FT = fieldtype(T, 1)

      block = :(x = ccall(:jl_new_struct_uninit, Any, (Any,), $T);
                setref(x, ctx);
                parse_array_len(io, ctx);
                tag = parse_array_tag(io, ctx);
                f = parse_specific(io, $FT, tag, ctx)::$FT;
                ccall(:jl_set_nth_field, Nothing, (Any, Csize_t, Any), x, 0, f))
      for i in 2:n
    # this loop is at compile time, we concatenate the previous contents
        # of 'block' with the current field expression. There will be no loop in the
        # final code just a list of type specific calls to parse and set field
        FT = fieldtype(T, i)
        block = :($block;
                  tag = parse_array_tag(io, ctx);
                  f = parse_specific(io, $FT, tag, ctx)::$FT;
                  ccall(:jl_set_nth_field, Nothing, (Any, Csize_t, Any), x, $i-1, f))

      :($block; x)
    # Fallback code omitted, it looks similar to the above but without expressions

This results in a ~3.5x performance improvement in my benchmarks. Probably in the best case, with only concrete types, it would be ~4x. However this is on the second run with a large data set. On the first run, with any dataset which is not huge, the compilation time can easily come to dominate and the compilation time for the third parser is much longer. This is probably because it must generate far more type specific code.

I am not sure exactly what to do next with this, if anything, although it clearly would benefit from some cleanup, but who has time for that? I have forked the package because I have practically rewritten most of the library and changed its behavior (although it is still mostly compatible). Definitely some of these improvements can be added to the original (with effort), for others it is not so clear if they are wanted as no one has shown an interest in it.

There is nothing strictly binding me to this particular BSON format either and there is probably something intrinsically more efficient. For a start a lot of type information could be omitted. On the other hand, I’d like to keep up the pretense of being compatible with something that has an existing user base.