Monday, June 18, 2012

Comparing Protocol Buffers to XML and JSON payloads

Following up on my last blog entry, I was interested in getting some figures comparing payload sizes for data encoded using protocol buffers against XML and JSON.

Since the major argument in favor of protocol buffers is the reduced network impact, I thought it would be interesting to measure that in a mainframe integration environment.

I built my test scenario with a CICS program called LSFILEAQ. LSFILEAQ is part of the legstar test programs and uses a VSAM file that comes with an IBM demo application commonly found in CICS development partitions.

LSFILEAQ takes simple query parameters on input and sends back an array of replies. This is the COBOL structure that describes the LSFILEAQ commarea (its CICS communication area):

        01 DFHCOMMAREA.
           05 QUERY-DATA.
              10 CUSTOMER-NAME               PIC X(20).
              10 MAX-REPLIES                 PIC S9(4) COMP VALUE -1.
                  88 UNLIMITED     VALUE -1.
           05 REPLY-DATA.
              10 REPLY-COUNT                 PIC 9(8) COMP-3.
              10 CUSTOMER OCCURS 1 TO 100 DEPENDING ON REPLY-COUNT.
                  15 CUSTOMER-ID             PIC 9(6).
                  15 PERSONAL-DATA.
                     20 CUSTOMER-NAME        PIC X(20).
                     20 CUSTOMER-ADDRESS     PIC X(20).
                     20 CUSTOMER-PHONE       PIC X(8).
                  15 LAST-TRANS-DATE         PIC X(8).
                  15 FILLER REDEFINES LAST-TRANS-DATE.
                     20 LAST-TRANS-DAY       PIC X(2).
                     20 FILLER               PIC X.
                     20 LAST-TRANS-MONTH     PIC X(2).
                     20 FILLER               PIC X.
                     20 LAST-TRANS-YEAR      PIC X(2).
                  15 LAST-TRANS-AMOUNT       PIC $9999.99.
                  15 LAST-TRANS-COMMENT      PIC X(9).

I ran the program using a query yielding a result of 5 customers and measured the raw size of the commarea in bytes. I came up with 422 bytes. Note that this includes both the input and output parameters. These are untranslated bytes, in z/OS format.

Using the legstar transformers I than generated XML and JSON representations of that same z/OS data. This time, the data is translated to ASCII and formatted using XML or JSON. I than measured the sizes of the corresponding XML and JSON payloads and got:

XML:1960 bytes
JSON:1275 bytes

What these results mean is that if I choose to send XML to CICS instead of the raw z/OS data, I will increase the network load by a factor of 364% (almost 5 times to raw data payload).

If I select the less greedy JSON to encode the payload, the network load increases by 202% (3 times the raw payload).

Now, what would be the equivalent protocol buffers payload?

To answer that, I first wrote a protocol buffer "proto" file using the protocol's Interface Description Language:

package customers;
option java_package = "com.example.customers";
option java_outer_classname = "CustomersProtos";

message CustomersQuery {
  required string customer_name_pattern = 1;
  optional int32 max_replies = 2;
}

message CustomersQueryReply {
  repeated Customer customers = 1;

  message Customer {
    required int32 customer_id = 1;
    required PersonalData personal_data= 2;
    optional TransactionDate last_transaction_date = 3;
    optional double last_transaction_amount = 4;
    optional string last_transaction_comment = 5;

    message PersonalData {
      required string customer_name = 1;
      required string customer_address = 2;
      required string customer_phone = 3;
    }

    message TransactionDate {
      required int32 transaction_year = 1;
      required int32 transaction_month = 2;
      required int32 transaction_day = 3;
    }
  }
}

This is close to the target LSFILEAQ commarea structure.

I then used protobuf-cobol to generate COBOL parsers and writers for each of the protocol buffers messages.

Rather than using the command line generation utility though, I used simple java code that looks like this:

HasMaxSize maxSizeProvider = new HasMaxSize() {

    public Integer getMaxSize(String fieldName, Type fieldType) {
        if (fieldName.equals("customer_name_pattern")) {
            return 20;
        } else if (fieldName.equals("customer_name")) {
            return 20;
        } else if (fieldName.equals("customer_address")) {
            return 20;
        } else if (fieldName.equals("customer_phone")) {
            return 8;
        } else if (fieldName.equals("last_transaction_comment")) {
            return 9;
        }
        return null;
    }

    public Integer getMaxOccurs(String fieldName,
                                JavaType fieldType) {
        if (fieldName.equals("Customer")) {
            return 1000;
        }
        return null;
    }

};
new ProtoCobol()
  .setOutputDir(new File("target/generated-test-sources/cobol"))
  .setQualifiedClassName("com.example.customers.CustomersProtos")
  .addSizeProvider(maxSizeProvider)
  .run();

The "HasMaxSize" provider allows the generated COBOL code to implement the size limitations which are specific to COBOL.

Besides the parsers and writers, protobuf-cobol also generates copybooks for the various messages. This is what we get for the input and output messages:

01  CustomersQuery.
           03  customer-name-pattern    PIC X(20) DISPLAY.
           03  max-replies PIC S9(9) COMP-5.

       01  CustomersQueryReply.
           03  OCCURS-COUNTERS--C.
             05  Customer--C PIC 9(9) COMP-5.
           03  Customer OCCURS 0 TO 1000 DEPENDING ON Customer--C.
             05  customer-id PIC S9(9) COMP-5.
             05  PersonalData.
               07  customer-name PIC X(20) DISPLAY.
               07  customer-address PIC X(20) DISPLAY.
               07  customer-phone PIC X(8) DISPLAY.
             05  TransactionDate.
               07  transaction-year PIC S9(9) COMP-5.
               07  transaction-month PIC S9(9) COMP-5.
               07  transaction-day PIC S9(9) COMP-5.
             05  last-transaction-amount COMP-2.
             05  last-transaction-comment PIC X(9) DISPLAY.

It is not exactly the same as the original commarea but is pretty close.

Using the protobuf-cobol parsers and writers and the same input and output data as for XML and JSON, I than obtained a protocol buffers payload size of 403 bytes (sum of input and output payloads).

Protocol buffers payload is even smaller than the raw z/OS data!

That might sound surprising but is the result of the extremely efficient way protocol buffers encodes data.

As an example, the COBOL group item QUERY-DATA size is 22 bytes on the mainframe. In my testing it contains hexadecimal:

"e25c4040404040404040404040404040404040400005".

The equivalent protocol buffer payload though is only 6 bytes long and contains hexadecimal:

"0a02532a1005".
This confirms that protocol buffers brings important benefits in terms of network traffic. For distributed applications and under heavy load, this is bound to make a big difference with XML and JSON based systems.

Saturday, May 5, 2012

Protocol Buffers for the mainframe


Protocol Buffers is a technology used internally at Google, which was made available as open source in 2008.

The idea is that XML and JSON are too verbose for communication-intensive systems.

XML and JSON bring two important benefits over binary protocols. They are:

  • Human readable (self documenting)
  • Resilient to changes (fields order is usually not imposed, fields can be missing or partially filled)

But these benefits come at a price:

  • Network load (the ratio of overhead metadata over actual data is quite high)
  • Parsing and writing relatively complex structures is CPU intensive

For systems that exchange isomorphic (same structure) data, millions of times a day, this price is too high.

Not surprisingly, that same diagnostic has stopped XML and JSON from being widely used on mainframes, although IBM introduced an XML parser quite early on z/OS.

Protocol Buffers is a binary protocol. In that sense, human readability and self description is lost. But it is resilient to changes. That second property turns out to be very important in heterogeneous, distributed systems.

Mainframes, and particularly COBOL-written applications, are by now completely immersed in heterogeneous, distributed systems. All IT departments, even those who claim to be very mainframe centric, run dozens of java or .Net applications alongside the mainframe (or sometimes even on the mainframe).

The rise of Enterprise mobile applications is bringing yet more heterogeneity in terms of Operating Systems and programming languages. Mainframe COBOL applications will necessarily have to interoperate with these newcomers too.

So after reading a lot about Protocol Buffers (and its competing sibling Thrift , developed originally at facebook then donated to the Apache Foundation), I came to the conclusion that such protocols might be exactly what COBOL on the mainframe need to better interoperate with Java, C++, ObjectiveC, etc.

To keep up with the spirit of LegStar, I started a new open source project called protobuf-cobol. I hope many of you will try it and let me know what you think.






Monday, February 20, 2012

z/OS float numerics and IEEE754


On z/OS, float and double numerics have been available for a long time but are seldom used with languages such as COBOL. The reason is that COBOL is primarily used for business rather than scientific applications and accountants prefer fixed decimal numerics.

In the java and C/C++ worlds though, floats and doubles are often used, even in business applications.

For COBOL to Java integration it is therefore important to understand how such numerics can be interchanged.

z/OS does not encode float and double data items in the usual IEEE754 way expected by Java and C/C++ developers.

To illustrate the differences lets use a COBOL program with a data item defined like this:

   01 W-ZOS-FLOAT  USAGE COMP-1 VALUE -375.256.

If we look at the content of this data item, it's hexadecimal representation is:

   X'C3177419'

or in binary:

   11000011 00010111 01110100 00011001
   ^^-----^ ^------------------------^
   |   |               |-mantissa
   |   |-biased exponent
   |-sign

The sign bit is turned on as we have stored a negative number. This is pretty much the only thing that is common with IEEE754.

The biased exponent is stored on 7 bits. The bias is X'40' (decimal 64). In our case, the decimal value stored in the biased exponent is 67, therefore the exponent is 67 - 64 (bias) = 3. Beware that this is an hexadecimal exponent, not a decimal one.

Finally, the content of the last 3 bytes is the mantissa, in our case: X'177419'. Now, since the exponent is 3, the integer part is X'177' (375 decimal as expected). The fractional part, X'419' is trickier to convert back to decimal. This is because most calculators have decimal to hexadecimal converters that do not work on fractions.

X'419' should be interpreted as 4 * (16**-1) + 1 * (16**-2) + 9 * (16**-3). You can use the calculator again if you observe that multiplying by 16**3, you get 4 *(16**2) + 1 * (16**1) + 9 * (16**0). In other words you can divide X'419'(decimal 1049) by 16**3 (decimal 4096) and you get 0,256103515625 which is our fractional part in decimal.

In summary, z/OS float items are Hexadecimal-based with a 7 bit exponent and 24 bit mantissa.

By contrast, IEEE754 floats are binary based, with an 8 bit exponent and 23 bit mantissa (called significand in the distributed world).

The internal representation of our -375.256 value is therefore:

   X'C3BBA0C5'

or in binary:

   11000011 10111011 10100000 11000101
   ^^-------^^-----------------------^
   |   |               |-mantissa
   |   |-biased exponent
   |-sign

Sign digit is same as z/OS as already mentioned.

The 8 bit biased exponent is 10000111 or 135 decimal. Float exponents in IEEE754 have a 127 decimal bias, the exponent is therefore 135 - 127 (bias) = 8. Beware that this is a binary exponent not a decimal one.

The 23 bit mantissa is 01110111010000011000101, BUT, there is an implied 1 as the most significant bit (msb). The real mantissa is therefore: 101110111010000011000101.

The integer part starts right after the implicit msb. Since we have an 8 exponent, the integer part is: 101110111 or 375 decimal.

Now for the fractional part, 010000011000101, again you can use a regular calculator which gives a decimal value of 8389 but you must divide that by 2**15 (23 bits - 8 exponent bits) or 8389 / 32768 = 0,256011962890625 which is our fractional part in decimal.

As you can see, although z/OS floats and IEEE754 floats are both 4 bytes long, they store numbers in quite a different way so don't attempt to push Java floats directly to COBOL buffers!