Apache Avro at Flurry

Last week we released a major update to our ad serving infrastructure, which provides enhanced targeting and dynamic control of inventory. As part of this effort we decided it was time to move away from our custom binary protocol to Apache Avro for our data serialization needs. Avro provides a large feature set out of the box, however, we settled on the use of a subset of its functionality and then made a few adjustments to best suit our needs in mobile. 

Why Avro?
 
There are many good alternatives for serialization frameworks and just as many good articles that compare them to one another across several dimensions. Given that our focus will be on what made Avro a good fit for us instead of our research into other frameworks.

Perhaps the most critical element was platform support. Even though we are currently only applying Avro for our ad serving through iOS and Android, we wanted an encoder that provides uniform support across all of our SDKs  (Windows Phone 7, HTML5, Blackberry, and JavaME) in the event we extend its use to all services. We receive over 1.4 billion analytics sessions per day, so it was critical that we maintain a binary protocol that is small and efficient. While this binary protocol is a necessity, Avro also supports a JSON encoder. This has dual benefits for us as it can be used for our HTML5 SDK and as a testing mechanism (we’ll show later how easy it is to curl an endpoint). Lastly, Avro provides the rich data structures we need with records, enums, lists, and maps.  

Our Subset

We are using Avro-C for our iOS SDK and Avro Java for our Android SDK and server. As mentioned earlier Avro has a large feature set. The first thing we had to decide was what components made sense for our environment. All language implementations of Avro support core, which includes Schema parsing from JSON and the ability to read/write binary data. The ability to write data files, compress information, and send it over embedded RPC/HTTP channels is not as uniform. Our SDKs have always stored information separately from the transport mechanism and likewise our server parses that data and stores in a format efficient for backend processing. Therefore, we did not have the need to write data files.  In addition to this we have a sophisticated architecture for the reliable transport of data and using Avro’s embedded RPC mechanisms would affect our deployment and failover strategy.  Given our environment we integrated only core from Avro.

Integrating core provided us a great deal of flexibility in how to use this efficient data serialization protocol. One of the diversions we made was in our treatment of schemas. Schemas are defined on the Avro site as:

Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.  

We maintain the invariant that the schema is always present when reading Avro data, however, unlike Avro’s RPC mechanism we never transport the schema with the data. While the Avro docs note this can be optimized to reduce the need to transport the schema in a handshake, such optimization could prove difficult and would likely not provide much added benefit for us. In mobile, we deal with many small requests from millions of devices that have transient connections. Fast, reliable transmissions with as little overhead as possible are our driving factors.

As an illustration of how this makes a large difference at scale, our AdRequest schema is roughly 1.5K while the binary-encoded datum is approximately 230 bytes on any request. At 50 million requests per day (we are actually well above that figure) this adds about 70GB of daily traffic across our band.

To meet our network goals we version our endpoints to correspond to the protocol. We feel it is more transparent and maintainable to have a one to one schema mapping instead of relying on Avro to resolve the differences between a varied reader and writer schema. The result is a single call without the additional communication for schema negotiation while transporting only datum that has no per-value overhead of field identifiers.

Mobile Matters

As a 3rd party library on mobile we had some additional areas we needed to address. First on Android, we needed to maintain support for Version 1.5+. Avro uses a couple of methods that were introduced in version 2.3, namely Arrays.copyOf() and  String.getBytes(Charset). Those were easily changed to use System.arrayCopy and String.getBytes(String), but it did require us to build from source and repackage.

On iOS, we need to avoid naming conflicts in case someone else integrates Avro in their library. We are able to do this by stripping the symbols. See this guide from Apple on that process.  On Android, we use ProGuard, which solves conflict issues through obfuscation. Alternatively, you can use the jarjar tool to change your package naming.

Sample Set Up

We’ve talked a lot about our Avro set up so now we’ll provide a sample of its use. These are just highlights, but we have made the full source available on our GitHub account at: https://github.com/flurry/avro-mobile. This repo includes a sample Java server, iOS client, and Android client.

1. Set up your schema. We use a protocol file (avpr) vs a schema file (avsc) simply because we can define all schemas for our protocol in a single file. The protocol file is typically used with Avro’s embedded RPC mechanisms, but it doesn’t have to be used exclusively in that manner.  Here are our sample schemas outlining a simple AdRequest and AdResponse:

AdRequest:

https://gist.github.com/3098447

AdResponse:

https://gist.github.com/3098472

2. Pre-compile the schemas into Java objects with the avro-tools jar. We have included an ant build file that will perform this action for you.

3. Build our sample self contained Avro Jetty server. The build file puts all necessary libs in a single jar (flurry-avro-server-1.0.jar) for ease of running. Run this from the command line with one argument for the port:
java -jar flurry-avro-server-1.0.jar 8080

4. Test. One of the things that distinguishes Avro from other serialization protocols is its support for dynamic typing. Since no code generation is required, we can construct a simple curl to verify our server implements the protocol as expected. This has proven useful for us since we have an independent verification measure for any issue that occurs in communication between our mobile clients and servers. We simply construct the identical request in JSON, send to our servers via curl, and hope we can blame the server guys for the issue. This also allows anyone working on the server to simulate requests without needing to instantiate mobile clients.

Sample Curl:

https://gist.github.com/3098754

Note: To simulate an error, type the AdSpace Name as “throwError”.

5. Setting up the iOS app takes a little more work than on the Java side simply because Avro-C does not support static compilation. The easiest way we have found to define the schemas is to go to the Java generated schema file and copy the JSON text from the SCHEMA$ definition:

https://gist.github.com/3098811

Paste that text as a definition into your header file, then get the avro schema from the provided method avro_schema_from_json. From that point write/read from the fields in the schema using the provided Avro-C methods as done in AdNetworking.m.

In the sample app provided, run AvroClient > Sim or Device and send your ad request to your server (defaults to calling http://localhost:8080). To simulate an error, type the AdSpace Name as “throwError”.

6. Since Android is based on Java, moving from the Avro Java Server to Android is a simple process. The same avpr and buildProtocol.xml files can be used to generate the Java objects. Constructing objects are equally as easy as seen here:

https://gist.github.com/3098834

Conclusion

We’re excited about the use of Apache Avro on our new Ad Serving platform. It provides consistency through structured schemas, but also allows for independent testing due to its support for dynamic typing and JSON. Our protocol is much easier to understand internally since the Avpr file is standard JSON and attaches data types to our fields. Once we understood the initial setup across all platforms it became very easy to iterate over new protocol versions. We would recommend anyone looking to simplify communication over multiple environments while maintaining high performance give Avro strong consideration.

 

 

Standard

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s