Some early Avro benchmarks

Avro is my current project. It’s a slightly different take on data serialization.

Most data serialization systems, like Thrift and Protocol Buffers, rely on code generation, which can be awkward with dynamic languages and datasets. For example, many folks write MapReduce programs in languages like Pig and Python, and generate datasets whose schema is determined by the script that generates them. One of the goals for Avro is to permit such applications to achieve high performance without forcing them to run external compilers.

A few early Avro benchmarks are now in. A month ago, Johan Oskarsson (of Last.fm) ran his serialization size benchmark using Avro. And today, Sharad Agarwal (my Avro collaborator) ran an existing java serialization benchmark using Avro, and the initial results look decent. Curiously, Avro’s generic (no code generation) and specific (generated classes) APIs diverged significantly and unexpectedly despite sharing much of their implementation. This suggests that both might be easily improved.

Tags: avro, hadoop, protobuf, thrift

This entry was posted on May 12, 2009 at 12:00 pm and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

6 Responses to “Some early Avro benchmarks”

Anne Says:
May 12, 2009 at 12:18 pm | Reply
Is a benchark more like a loanshark or an aardvark?
Doug Cutting Says:
May 12, 2009 at 12:27 pm | Reply
Typo fixed. Thanks, Anne!
Inductive Bias » Large Scalability - Papers and implementations Says:
June 23, 2009 at 4:08 am | Reply
[…] Protocol Buffers, Thrift, Avro, more traditional: Hessian, Java serialization, early benchmarks […]
Ron Says:
April 6, 2010 at 12:48 pm | Reply
Hi,
Can you please point me to some sample code on how to use Avro? The quick start here http://github.com/phunt/avro-rpc-quickstart is not much.

I’m trying to use Avro to define messages, schema, etc. and serialize from one side and deserialize on the other side. I don’t want to use the provided HTTP and raw socket RPC mechanism, is it possible to get the serialized data as a byte stream? Possibly a byte[] ?

Thanks
James Abley Says:
August 28, 2010 at 3:17 am | Reply
I enjoyed the video of the tech talk recently given at Digg. In it, you answered a question (41:04) about not dynamically generating Java classes at runtime. Have you considered asm or similar to see if that might be useful?
Srinivasarao Daruna Says:
March 25, 2016 at 7:26 am | Reply
Hi Doug,

I have one question regarding the avro conversion.
We are converting a json data to Avro and we have defined the complete schema. However, we are facing issues with missing fields.
We have ensured the field is defined with [null or filed data] union, but it does not work if the field is not at all present in the json data. What is the way to handle the missing fields which are not presented in the data at all.?

Free Search