Friday, November 7, 2014

Announcing LegStar for Apache Avro

legStar has dealt with mapping COBOL structures to other languages for a long time now. Such a mapping has not always been straightforward because COBOL has a number of constructs that sets it appart, namely:

  • Deep hierarchies. Structures in COBOL tend to be several levels deep with the parent hierarchy providing a namespace system (the same field name can appear in several places in a structure).
  • Variable size arrays (Arrays whose dimension is given at runtime by another field in the structure).
  • Redefines which are similar to C unions.
  • Decimal data types which have a fixed number of digits in the fractional part (fractions whose denominator is a power of ten). I am surprised at how many languages try to get away with double/float as the only way to represent fractions.

Data serialization and RPC

Over the last few years, a series of new data serializing languages for structures have appeared. These derive mainly from the need to perform fast Remote Procedure Calls (RPC) over a network with complex data being passed back and forth. For high frequency calls, XML or JSON would be too expensive.

The RPC serializing languages I came across recently are:

Protocol Buffers (PB) was one for which we developed a LegStar translator which was an interesting experience.

Data serialization and Big Data Analytics

Besides RPC, the Big Data Analytics domain is now generating the need for efficient data serialization as well.

One aspect of dealing with large amounts of data is that you need to process records as fast as possible. Traditionally, in the Business Intelligence domain, data was serialized as SQL records or CSV (comma separated) records.

Not suprisingly all first generation ETL tools dealt primarily with SQL and CSV records.

The problem is that SQL and CSV are very bad at storing complex, structured data. Particularly deep hierarchies, arrays and unions.

So when your Big Data is made of such structured data (for instance data originating from a Mainframe), traditional ETL tools are generally sub optimal.

Apache Hadoop, a popular Big Data system, does not impose SQL or CSV. It offers a native serialization mechanism called Writable which allows passing structured data efficiently between Map and Reduce jobs.

Even better, there is an alternative approach called Apache Avro which is a standalone project similar to Protocol Buffers or Thrift but tightly integrated into Hadoop.

Importing Mainframe data into Hadoop

In a scenario where you need to import very large volumes of data from a Mainframe and perform analytics on them, it would make sense to use the Avro format to pass the records between the various steps of the analytics process (Map and Reduce jobs in Hadoop parlance).

The reason is that Mainframe records, described by a COBOL copybook, can translate to Avro records while preserving the hierarchy, arrays and redefines. This saves a lot of processing otherwise needed to 'flatten' the COBOL structures to a columnar representation (SQL, CSV, ...).

What is left to be done is a COBOL to Avro Translator so I have started a new legstar.avro project.

It is an early phase for the project of course and I intend to decribe it further in future installements of this blog. So if you are interested, please stay tuned.

No comments:

Post a Comment