Monday, January 5, 2015

Comparing Apache Avro to Protocol Buffers, XML and JSON payloads

Comparing Avro to Protocol Buffers, XML and JSON payloads

In a previous blog post I compared Mainframe COBOL payloads to Protocol Buffers, XML and JSON.

The idea was to compare the number of bytes needed to encode structured data in these various languages.

The results were:

LanguageBytes size
Mainframe 422
XML1960
JSON1275
Protobuf 403

Since we have recently introduced legstar support for Apache Avro, I was curious to get a sense of how optimized Avro is. So I performed the same test and came out with 370 bytes.

Of course, you will get different results with different COBOL structures and a different data mix but you can be practically sure you will get a smaller Avro payload than the Mainframe payload.

This is because a lot of optimization has gone into the Avro binary encoder (testing was done using Avro 1.7.7).

Taking the same example as in the previous blog post, with this COBOL structure:

05 QUERY-DATA.
              10 CUSTOMER-NAME               PIC X(20).
              10 MAX-REPLIES                 PIC S9(4) COMP VALUE -1.

these 22 bytes of mainframe data :
0xe25c4040404040404040404040404040404040400005

gets translated to just 4 Avro bytes:
0x04532A0A

The corresponding Avro schema being:
{
      "type":"record",
      "name":"QueryData",
      "fields":[{
          "name":"customerName",
          "type":"string"
        },
        {
          "name":"maxReplies",
          "type":"int"
        }
      ],
      "namespace":"com.legstar.avro.samples.lsfileaq"
    }

This is interesting because it means that if you take a large mainframe file and convert it to Avro, you are likely to get a smaller file. About 12% smaller in the particular case used in my testing.

If you are using Hadoop, and have a use case where the same mainframe file is read by several MapReduce jobs, then it could make sense to convert your mainframe file using legstar.avro before you store it in HDFS as Avro encoded.

Reading that file later will be faster as the volume of data read off HDFS will be smaller.