If you’ve ever built a non-trivial software system, especially any kind of distributed system, you’ve probably found yourself in need of a portable and efficient mechanism for storing and exchanging data. This is precisely what both Apache Thrift and Google’s Protocol Buffers provide: a language and platform neutral way of serializing structured data for use in communications protocols, data storage etc. Of course, several smart people have attacked this problem over the years and as a result there several good open source alternatives to choose from, including but not limited to Avro, plain-old JSON or XML etc.
But thrift and protobuf are by far the most popular and a common question people ask is — which one should I use? Most discussion around thrift and protobuf are polarized around performance and/or features, but I’m afraid the discussion is more nuanced than that. My hope with this post is to shed some more light on these two systems and how you should go about evaluating what is best for your needs.
One of the key attractions of thrift is that out-of-the-box it has more features and support for more languages and platforms than protobuf. Here are a few concrete examples:
- The protobuf runtime only provides serialization and deserialization. It can generate some generic RPC stubs but the runtime does not ship with an RPC implementation. Thrift provides RPC implementations (both client and server) across multiple languages, including asynchronous variants in many languages.
- The thrift grammar is much richer than protobuf in terms of supported constructs — you can specify typedefs, constants, unions, lists, sets and maps.
However, these features do come at a price — the protobuf implementations are far more consistent and robust across all the officially supported target languages. For instance, even through the thrift grammar supports unions, native support for unions is only provided for Ruby and Java — all other languages fall back to a regular ’struct’. Read through the following sections for more details on where thrift is still sub-par compared to protobuf.
Bottom line: Thrift has more features and supports more languages out of the box, but protobuf has more robust and consistent implementations.
The serialization/deserialization performance of thrift and protobuf are very similar and I would not consider that as a metric for evaluation. The RPC framework performance is a different matter though — Thrift uses its own network stack (as opposed to something like Netty) and there are no official RPC implementations to protobuf so a fair comparison is hard to do.
If you really want some performance numbers, Eishay Smith maintains a good benchmark.
Bottom line: Serialization/deserialization performance are unlikely to be a decisive factor
Update: This page made it to Hacker News! As some of the commenters have noted, it seems like Thrift serialization/deserialization for Python is much faster than the protobuf default implementation.
Code Quality and Design
Thrift was originally written by some ex-Googlers who left for Facebook so naturally both the systems have a lot in common. However, the protobuf code base feels a lot more mature and robust compared to that of thrift. This is in part due to thrift’s turbulent evolution — it was open sourced by Facebook in April 2007 probably to speed up development and leverage the community’s efforts. Since then it done time in Apache incubation and recently graduated to a top-level Apache project. In between mailing list transitions, constantly shifting home pages, unclear style guides and conflicting priorities, no one person or organization had good ownership over Thrift development. And the code reflects this evolution.
In contrast, protobuf had been in heavy use at Google for many years before they finally open sourced it in July 2008. Google puts a very strong emphasis on coding style, consistency and testability and again, the code reflects this.
To get a feel for the code bases, here are a few examples:
Let me also give some examples where the protobuf design works better (some of this is arguably my personal opinion):
- the protobuf compiler basically generates a tree of “Descriptor” objects. Various “CodeGenerator”s then operate on this Descriptor tree to output language-specific code. This neatly decouples the parsing process from the code generation process. A nice side-effect is that now people can independently write code generators for new languages and the protobuf compiler can basically invoke these “plugins” at runtime. In contrast, the Thrift compiler is tightly coupled with the code generators. This leads to a ton of code duplication when writing generators for a new language. Not to mention having to work with the monstrous thrift codebase just to write a new code generator.
- protobuf generated Java classes use the builder pattern, which makes it easy to detect errors such as required fields not set at construction time. In constrast, thrift generated Java objects can be constructed willy-nilly and will not catch errors until they are serialized/deserialized.
There are many other such examples throughout the code base.
Bottom line: protobuf has a better design and overall higher code quality (generated as well as protobuf compiler and libraries) than thrift.
Development Process and Open-ness
Thrift is an Apache project, so arguably the thrift development is as open as it gets. Most of the development is driven by an open issue tracker; anyone is free to contribute patches; no person or organization has tight control over the project direction. In general, the Apache way is based on meritocracy and has worked very well for several highly successful projects (such as httpd, various XML parsers etc). Of course, the amorphous and evolutionary nature of this process has its shortcomings — there is a lot of feature creep, the review process is quite ad-hoc, there is no well defined roadmap or predictable release cycles and no single party has strong incentives to ensure that ALL of the code base is in top-shape.
Protocol Buffers, while open source, are hardly open when it comes to the development process. You are free to file issues and submit patches, but at the end of the day development is largely controlled and guided by Google. Of course, one is free to fork off protobuf at any point, if this process is unsatisfactory. So far, it hasn’t been an issue for many users. In fact, I feel that many protobuf users (especially businesses) probably like the fact that there is a strong ownership of the project.
Bottomline: Thrift is definitely more open, but protobuf is doing fine.
Who is using it?
Facebook is obviously using Thrift and likewise Google for protobufs. But many other companies are fast adopting Thrift – here’s a list from the Thrift Wiki. Also note that many projects in the Hadoop ecosystem require or support Thrift such as Cassandra and Scribe, which also contributes to Thrift usage.
I couldn’t find a list of protobuf users anywhere. If you know of one, drop me a note!
Bottomline: Both systems have their users, but Thrift seems to have a growing and larger user base.
That said, I’m doing my part in improving Thrift documentation. Check out Thrift: The Missing Guide.
Bottomline: Protobuf documentation is way better, but Thrift is catching up (and I’m helping!)
The following table summarizes the discussion:
|Features||Richer feature set, but varies from language to language||Fewer features but robust implementations|
|Performance||Not a differentiator||Marginally better than Thrift, if at all|
|Code Quality/Design||Haphazard development; design that works but not necessarily well||Better designed, extensible and robust|
|Open-ness||Apache project||Open mailing list, code base and issue tracker but Google still drives development|
|Documentation||Severely lacking, but catching up||Excellent documentation|