Thrift vs. Protocol Buffers

Background

If you’ve ever built a non-trivial software system, especially any kind of distributed system, you’ve probably found yourself in need of a portable and efficient mechanism for storing and exchanging data. This is precisely what both Apache Thrift and Google’s Protocol Buffers provide: a language and platform neutral way of serializing structured data for use in communications protocols, data storage etc. Of course, several smart people have attacked this problem over the years and as a result there several good open source alternatives to choose from, including but not limited to Avro, plain-old JSON or XML etc.

Thrift vs. Protocol Buffers
Fight! fight!

But thrift and protobuf are by far the most popular and a common question people ask is — which one should I use? Most discussion around thrift and protobuf are polarized around performance and/or features, but I’m afraid the discussion is more nuanced than that. My hope with this post is to shed some more light on these two systems and how you should go about evaluating what is best for your needs.

Features

One of the key attractions of thrift is that out-of-the-box it has more features and support for more languages and platforms than protobuf. Here are a few concrete examples:

  • By default, protobuf only supports C++, Java and Python as target languages. There are third-part code generators for other languages, but they are not “official”. Thrift, on the other hand, ships with code generators for C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk and OCaml. Phew!
  • The protobuf runtime only provides serialization and deserialization. It can generate some generic RPC stubs but the runtime does not ship with an RPC implementation. Thrift provides RPC implementations (both client and server) across multiple languages, including asynchronous variants in many languages.
  • The thrift grammar is much richer than protobuf in terms of supported constructs — you can specify typedefs, constants, unions, lists, sets and maps.

However, these features do come at a price — the protobuf implementations are far more consistent and robust across all the officially supported target languages. For instance, even through the thrift grammar supports unions, native support for unions is only provided for Ruby and Java — all other languages fall back to a regular  ’struct’. Read through the following sections for more details on where thrift is still sub-par compared to protobuf.

Bottom line: Thrift has more features and supports more languages out of the box, but protobuf has more robust and consistent implementations.

Performance

The serialization/deserialization performance of thrift and protobuf are very similar and I would not consider that as a metric for evaluation. The RPC framework performance is a different matter though — Thrift uses its own network stack (as opposed to something like Netty) and there are no official RPC implementations to protobuf so a fair comparison is hard to do.

If you really want some performance numbers, Eishay Smith maintains a good benchmark.

Bottom line: Serialization/deserialization performance are unlikely to be a decisive factor

Update: This page made it to Hacker News! As some of the commenters have noted, it seems like Thrift serialization/deserialization for Python is much faster than the protobuf default implementation.

Code Quality and Design

Thrift was originally written by some ex-Googlers who left for Facebook so naturally both the systems have a lot in common. However, the protobuf code base feels a lot more mature and robust compared to that of thrift. This is in part due to thrift’s turbulent evolution — it was open sourced by Facebook in April 2007 probably to speed up development and leverage the community’s efforts. Since then it done time in Apache incubation and recently graduated to a top-level Apache project. In between mailing list transitions, constantly shifting home pages, unclear style guides and conflicting priorities, no one person or organization had good ownership over Thrift development. And the code reflects this evolution.

In contrast, protobuf had been in heavy use at Google for many years before they finally open sourced it in July 2008. Google puts a very strong emphasis on coding style, consistency and testability and again, the code reflects this.

To get a feel for the code bases, here are a few examples:

Let me also give some examples where the protobuf design works better (some of this is arguably my personal opinion):

  • the protobuf compiler basically generates a tree of “Descriptor” objects. Various “CodeGenerator”s then operate on this Descriptor tree to output language-specific code. This neatly decouples the parsing process from the code generation process. A nice side-effect is that now people can independently write code generators for new languages and the protobuf compiler can basically invoke these “plugins” at runtime. In contrast, the Thrift compiler is tightly coupled with the code generators. This leads to a ton of code duplication when writing generators for a new language. Not to mention having to work with the monstrous thrift codebase just to write a new code generator.
  • protobuf generated Java classes use the builder pattern, which makes it easy to detect errors such as required fields not set at construction time. In constrast, thrift generated Java objects can be constructed willy-nilly and will not catch errors until they are serialized/deserialized.

There are many other such examples throughout the code base.

Bottom line: protobuf has a better design and overall higher code quality (generated as well as protobuf compiler and libraries) than thrift.

Development Process and Open-ness

Thrift is an Apache project, so arguably the thrift development is as open as it gets. Most of the development is driven by an open issue tracker; anyone is free to contribute patches; no person or organization has tight control over the project direction. In general, the Apache way is based on meritocracy and has worked very well for several highly successful projects (such as httpd, various XML parsers etc). Of course, the amorphous and evolutionary nature of this process has its shortcomings — there is a lot of feature creep, the review process is quite ad-hoc, there is no well defined roadmap or predictable release cycles and no single party has strong incentives to ensure that ALL of the code base is in top-shape.

Protocol Buffers, while open source, are hardly open when it comes to the development process. You are free to file issues and submit patches, but at the end of the day development is largely controlled and guided by Google. Of course, one is free to fork off protobuf at any point, if this process is unsatisfactory. So far, it hasn’t been an issue for many users. In fact, I feel that many protobuf users (especially businesses) probably like the fact that there is a strong ownership of the project.

Bottomline: Thrift is definitely more open, but protobuf is doing fine.

Who is using it?

Facebook is obviously using Thrift and likewise Google for protobufs. But many other companies are fast adopting Thrift – here’s a list from the Thrift Wiki. Also note that many projects in the Hadoop ecosystem require or support Thrift such as Cassandra and Scribe, which also contributes to Thrift usage.

I couldn’t find a list of protobuf users anywhere. If you know of one, drop me a note!

Bottomline: Both systems have their users, but Thrift seems to have a growing and larger user base.

Documentation

Compare the protobuf documentation to the thrift wiki. Enough said.

That said, I’m doing my part in improving Thrift documentation. Check out Thrift: The Missing Guide.

Bottomline: Protobuf documentation is way better, but Thrift is catching up (and I’m helping!)

Summary

The following table summarizes the discussion:

Area Thrift Protocol Buffer
Features Richer feature set, but varies from language to language Fewer features but robust implementations
Performance Not a differentiator Marginally better than Thrift, if at all
Code Quality/Design Haphazard development; design that works but not necessarily well Better designed, extensible and robust
Open-ness Apache project Open mailing list, code base and issue tracker but Google still drives development
Documentation Severely lacking, but catching up Excellent documentation

19 comments

  1. Mandel

    An example of an application that uses protobuffers is Ubuntu One from canonicals. Protobuffers are used for the communication between the diff clients and the server side. You can find the client side implementation and the client on:

    lp:ubuntuone-storage-protocol
    lp:ubuntuone-client

  2. chris

    Another one to look at is: http://bsonspec.org/
    Basically a binary version of json, so it’s faster and more compact. But unlike Protobuf or Thrift does not require each object to be predefined (static typed? ). For some applications BSON is more appropriate.

    It’s ONLY a serialization technique.

    • Diwaker Gupta

      Sounds similar to Avro, at least conceptually. Avro also uses JSON and the objects are self-describing and schemaless. I’m not sure what the on-wire format is though.

    • cowtowncoder

      Unless you use MongoDB and need BSON, I would not recommend using it. Plain old JSON is not much more verbose, is MUCH more widely supported; and performance is not much worse — in fact, for Java at least, BSON is pretty slow (due to lack of high-perf parser/generator implementations).
      BSON is not even proper sub- or super-set of JSON: it does not support all JSON constructs (field name, value limitations), but extends it with a few new types. So can’t necessarily convert between the two reliably.

      And JSON has much much wider framework support as well, if one really wants RPC (as opposed to simple REST-style interaction using JAX-RS).

      But if you use Mongo, BSON may make sense.

  3. Amstel

    We manufacture industrial data acquisition systems and have chosen Protocol Buffers for all of our configuration storage and data transmission in new products.

    We love the beautiful simplicity of protobuf, the way it packs data almost as tightly as binary but keeps the extensibility of XML. Every byte matters to us because we are sending large amounts of data over low bandwidth links. It’s a real quality design.

    We’re using it in C (that’s right, not C++) on embedded devices and Java on web apps and we’re just about to get started on the C# port for our desktop products.

    • Bruno Rijsman

      Hi Amstel,

      I also have a project which requires using protobuf in C (regular C, not C++).

      How was your experience with using protobuf from C? Any particular issues or problems?

      Which C API are you using for protobuf? Is it protobuf-c (http://code.google.com/p/protobuf-c/)? Can you comment on the quality of that API?

      Thank you.

      • Dan

        Hi Amstel & Bruno,

        Curious to hear a follow up from either of you how your embedded system implementation worked out. Basically I am echoing Bruno’s post here, hoping for a 2013 update.

        Cheers

        Dan

    • Diwaker Gupta

      Very interesting, thanks Dirk! IIUC, they seem to be playing around with protobuf for implementing replication. I don’t think it is being used for actual data storage, right?

  4. huxi

    Oh wow, you are absolutely right…
    Thrift documentation is really lacking!

    I wanted to give it a quick look but that’s a rather nice try at the moment. ;)
    What interests me most is the JavaScript support of Thrift. Did I understand correctly that JS is only supporting JSON?

    Cheers & thanks for your effort,
    Joern.

Leave a Reply