Today we’re going to look at a simple serialization library written in Rust. Simple and utterly terrifying.
The library, Abomonation (typo intentional), uses Rust’s
unsafe keyword in some interesting ways. The intent is that it is actually safe if you only deserialize binary data serialized by the library.
I should say that many of the things going on in here look a lot like what goes on in CapnProto. If you are serious about being a grown-up about things, I would absolutely look over there. Among other things, it has the advantage of not necessarily lowering your opinion of me, unlike this post.
Serialization and Deserialization
Our goal is to write a pair of methods with the following signatures:
That is, if you present a slice of data
&[T] we can populate a
Vec<u8> with appropriate binary data. Similarly, if you provide us some binary data
&[u8], we can interpret it as typed data
&[T] for you.
We aren’t going to do this by the end of the post, but we’ll have something pretty similar. These are the actual signatures at the moment (the implementations are at the end of the post; don’t read them yet!):
There are important distinctions here, so let’s briefly think about what’s being said by the interfaces.
- We are only going to be able to do this for some types
T, those implementing
Abomonation. This makes sense, because some types are hard to serialize. The name doesn’t make sense. Yet.
- We are taking a
&mut [u8]rather than a
&[u8]. That’s weird. Why would we need exclusive access to the input data? Are we going to change the binary data we’ve received? (I’m so sorry.)
- We are returning a reference to a
Vec<T>. Where did that
Vec<T>come from and who owns it? The
decodefunction doesn’t take one as a parameter, and you (usually) don’t just get references to owned data out of nowhere.
Well isn’t that just a big pile of mysterious language? Apparently we are going to do some things, and you should all be terrified.
Encoding and Decoding nice data
Let’s start with some easy examples, ones that use the
unsafe keyword but which we’ll pretty easily convince ourselves are really pretty friendly. We’ll get to learn a bit about how Rust manages its data under the covers, and maybe get our feet wet. What could go wrong?
The following function takes a slice of typed data and presents it back as a slice of binary data.
unsafe label on the function. This is because it uses the
from_raw_parts method, which is itself unsafe (it doesn’t check to make sure that the underlying data look correct). In this case it is safe, because we only want to look at the binary data (not mutate it) and every byte is a valid
I could choose to remove the
unsafe label and put unsafe interally like so, declaring the method safe:
I have been warned this is bad, though, because reading a struct-padding byte is undefined behavior. This will not be the most irresponsible thing that happens in this post.
On the other end of the safety spectrum, I present:
This not only exposes mutable slices, but permits any type at all. Imagine what would happen if
String, for example:
String owns its
[u8] buffer and cointains a pointer to it. You would be able to dereference arbitrary places in memory! Danger!! Danger!! The
unsafe label stays on for this one.
Imagine you just saw
bytes_to_typed eat a goat. It is powerful, but the
unsafe fence protects us. Just like in that movie with all those cautionary tales that I didn’t stay for the end of.
You said “nice data”
These two methods work great if all you want to serialize are slices of primitive types. If you have a
&[u64] you can just transmute it over to a
&[u8] and write out the data. If you get a
&[u8] back you can (in good conscience) know that transmuting it to
&[u64] works well.
We are going to use the above two functions in our implementation of
decode, so let’s be brave and add the following sorts of implementations:
Notice that these implementations don’t do anything, they just indicate that it is safe for
decode to transmute them around. In fact,
Abomonation will have a few methods, but these types can just use the default implementations.
The Abomonation stirs …
Primitive types are neat and all, but we are serious people! What about things like tuples? Surely I should be able to transmute a
&[(u64, i8)] to binary and back, right? Yes, you should. But…
… it’s complicated. You see, we built this trait to reach beyond primitive types. Much further beyond.
Ok, let’s just write down this trait.
Wow, grim. In case
unsafe wasn’t enough, I chose creepy method names. Creepy, but appropriate!
The two methods have the following intents:
entomb: having written
&selfas binary, serialize any further data it might own.
exhume: having populated
&mut selfwith binary data, deserialize data it owned.
The reason this trait exists is because our method
encode(&Vec<T>, &mut Vec<u8>) is just going to copy the contents of the typed vector into the binary vector. Seems harmless enough for the types implementing
Abomonation so far, but we will use
entomb to check with each
T to see if they want to write any more. Similarly,
exhume on each element to recover any owned data.
Wait, owned data? What? I mean,
u64 doesn’t own any data, so does it need to do anything? Nope!
All those other primitive types don’t need to either. They can use the default empty implementation.
All those happy townsfolk, oblivious to what awaits…
We got this far because we needed to talk about tuples. Let’s see how we might implement the weirdly capable
Abomonation for a pair of abomonable types.
Really, the whole interface is probably just messsing with forces we don’t understand.
For example, you probably don’t understand why
exhume uses a
&[u8] when we said up above that we were going to use
&mut [u8]. In retrospect, there were warning signs that something was wrong.
Late in the lab one night …
Some drinks in, an ambitious scientist finds she needs to serialize a
&[Vec<Vec<u64>>]. She thinks:
Hey, couldn’t I implement
Well, you can certainly type it.
You know, this might just work. A
Vec is a pointer, a length, and a capacity. Three
usize values. Serializing this triple doesn’t serialize the
Vec’s contents, but we can just do that in
entomb. Similarly, when we pick up the serialized form of the
Vec, a triple, it doesn’t point at anything valid (to our great shame, it points at some location in memory wherever we serialized this), but
exhume gives us a chance to read valid data back and make things right.
If the scientist did something responsible in
exhume, like create a new
Vec<T> with new backing memory and assign it to
*self, the story ends happily. But our story doesn’t follow that path.
Writing the code that shouldn’t be written
The scientist, unfortunately, was mad with power. She demanded zero allocations. She would use the binary data as provided to back her creepy, hard-to-reason-about deserialization machinations.
Rust provides unsafe functions capable of assembling a
Vec out of “raw parts”: a pointer, a length, and a capacity. Our scientist is going to connect wires not meant to be connected, by building a
Vec using a pointer into the supplied byte array.
Now, our scientist may be power-mad, but she is still a scientist so she’s going to do this properly.
Pretty much all of these lines (everything except step
1.) are unsafe. Because the method has the
unsafe tag, we don’t need to flag individual regions as unsafe. They are all quite unsafe.
Let’s take a tranquil moment to reflect on what’s been done, and what might happen. We’ve minted a
Vec<T> pointing at memory that cannot be freed. The memory is valid, and the rest of the
Vec<T> appears legitimate (length, capacity). But, as soon as we return from this method that
Vec<T> is loose in the country-side and who knows what might happen.
Well, what could go wrong, really?
You know, actually it all sort of works.
Yeah, there is a disasterous mess of things here, and really Rust’s memory safety isn’t touching any of this with a ten foot lifetime bound. But let’s talk about the possible issues.
First, let’s be clear that
unsafe. Don’t call it. I haven’t figured out how to keep people from calling it yet (public traits export all their methods), but just don’t. It makes a few assumptions about how it will be called. It assumes that:
&[u8]data is, in fact, exclusively held. We go and transmute it to
&mut [u8]which is very wrong unless we are sure that we have the only reference. Fortunately,
&mut [u8]as its argument, and this ensures that our references are exclusive. We are also careful to partition the
&[u8]into disjoint parts when we use it, to maintain this invariant.
&mut selfparameter to
exhumewill only be presented outwards as a
&self. We can’t let anyone mutate or own this data. Trying to add elements to
Vecwill trigger a re-alloction, which would attempt to release the memory. It is very important that all references to the data are just that, only references. And references with lifetimes bounded by that of the source
&mut selfparameter to
exhumewill never be dropped. Dropping the
Vecwould attempt to release the backing memory, which … well it isn’t going to work out. Fortunately, there never was a
Vec. There were some bytes, and we pretended that a reference to the bytes was actually a reference to a
Vec, but there was never a
There are probably some other assumptions. Everything breaks if you pass in invalid data, for example. Don’t do that either.
In fact, sorting out whether these assumptions are properly nailed down, or whether something truly horrible has happened, is basically what I’m up to now. I don’t really know, and it isn’t particularly clear what I need to do to use
unsafe correctly. I invite your thoughts and criticism.
Rust’s position is, perhaps refreshingly honestly: “yeah, it’s unsafe. you even said so.”
What about our ambitious, power-mad scientist; where did she get off to?
While we have been fretting about what might have gone wrong, she is experiencing excellent throughput numbers. Let’s look at a few using Rust’s benchmarking facilities.
Here is a helpful routine to spin Rust’s
Bencher struct on any type implementing
We try it out with types
Vec<Vec<(u64, String)>> and see:
test bench_enc_u64 ... bench: 411 ns/iter (+/- 84) = 19990 MB/s test bench_enc_string ... bench: 12039 ns/iter (+/- 3330) = 2893 MB/s test bench_enc_vec_u_s ... bench: 12578 ns/iter (+/- 1665) = 3482 MB/s
String is just
format!("grawwwwrr!"). Results may vary with the string’s length, silliness.
I should also point out that the throughput is not goodput. It includes the extra pointers and capacities and stuff we didn’t need to send. The numbers do get a bit worse, especially for short strings. It’s the sort of thing you could fix by having a buffer implement
decode and push/pull just lengths. However, you do need to stage the
Vecs somewhere, and our approach let
bytes do that for us.
What about deserialization? That is where all the mess is, right? Well, it is a bit hard to measure deserialization speed fairly, because we are really not doing much other than fixing up some pointers.
I did put some testing code in place to make sure that we were getting the right results back:
These give pretty comparable results. Apparently pushing a bunch of bytes works at about the same rate as checking that those bytes are what you expected. Again, not goodput numbers; caveat emptor.
test bench_dec_u64 ... bench: 525 ns/iter (+/- 262) = 15649 MB/s test bench_dec_string ... bench: 11289 ns/iter (+/- 2432) = 3086 MB/s test bench_dec_vec_u_s ... bench: 12557 ns/iter (+/- 2183) = 3488 MB/s
Just to show how silly things are, here are the numbers where we comment out the assert loop, just checking the resulting lengths.
test bench_dec_u64 ... bench: 2 ns/iter (+/- 0) = 4108000 MB/s test bench_dec_string ... bench: 2625 ns/iter (+/- 1031) = 13272 MB/s test bench_dec_vec_u_s ... bench: 3020 ns/iter (+/- 1266) = 14503 MB/s
Hey look, 4TB/s deserialization. Maybe it is time to head back to Silicon Valley, maybe start things up.
Appendix: encode and decode
I wanted to put the implementations up here, but they aren’t very pretty. In particular, I suspect they could be greatly improved (possibly by deletion).
The whole pile of stuff will be up on github and crates.io eventually, but I wanted to take at least a bit of time with other eyeballs on this to see if a) it is all horribly wrong, and b) whether the risk is ever worth it.
Also, I should obviously avoid serializing pointers to your memory locations. No one needs to see that.