Tuesday, December 9, 2014

LegStar V2 is underway

LegStar has been around for a while now, 8 years since the first release in 2006.

Since then a lot of things have changed:

  • The java SDK is now in version 8. It was version 4 in 2006 and things like generics were unheard of
  • Programming patterns have become common practice
  • Multi-threading techniques have improved and are better understood
  • Use cases which used to be centered around remote procedure calls to CICS programs, now deal with massive imports of mainframe data for Big Data Analytics


Some parts of LegStar are showing their age and I have finally found some time to start rewriting some of the core features. The LegStar core project is the older in the product so this is where I started.

The legstar-core2 project is where I placed the new developments.

You should not consider this as a replacement to the current LegStar though:

  • There are far less functionalities in legstar-core2 at the moment than in LegStar
  • The API V2 will not be backward compatible 


That second point may come as a surprise for mainframe users but in the world of open source, breaking compatibility is an "art de vivre". A primary benefit is that the code you get is much cleaner and readable when it does not need to deal with legacy. When the project evolves though, we might want to work on a migration guide of some form.

The new legstar-core2 project contains a simplified version of the legstar-cob2xsd module which is the venerable COBOL to XML schema translator. The changes are minor in this module so far. Neither COBOL, nor the XML schema specs have changed much.

From an architecture standpoint the major change is that JAXB is no longer central to the conversion process. So far, we were always going through the JAXB layer even if the target was Talend, Pentaho, JSON, or any other format.

Now the conversion logic has been abstracted out in a legstar-base module. There is also an associated legstar-base-generator module that produces artifacts for legstar-base. The legstar-base module can be considered a complete low-level solution for mainframe to java conversions. This new legstar-base module has replaced the legstar-coxbapi and legstar-coxbrt modules.

JAXB is still supported of course with 2 new modules, legstar-jaxb and legstar-jaxb-generator which cover the old legstar-coxbgen features.

Besides the architectural changes, there are 2 important changes you need to be aware of:

  • The legstar-core2 project is hosted on Github, not on Google code. Therefore source control has moved from SVN to Git.
  • The licence is GNU Affero GPL which is not as business friendly as the LGPL used by LegStar


Again, this is just the beginning on this new project and its likely to make its way to the newest developments first (such as legstar-avro). Over time, I will describe the new features in more details. In the meanwhile please send any feedback.

Fady

Friday, November 7, 2014

Announcing LegStar for Apache Avro

legStar has dealt with mapping COBOL structures to other languages for a long time now. Such a mapping has not always been straightforward because COBOL has a number of constructs that sets it appart, namely:

  • Deep hierarchies. Structures in COBOL tend to be several levels deep with the parent hierarchy providing a namespace system (the same field name can appear in several places in a structure).
  • Variable size arrays (Arrays whose dimension is given at runtime by another field in the structure).
  • Redefines which are similar to C unions.
  • Decimal data types which have a fixed number of digits in the fractional part (fractions whose denominator is a power of ten). I am surprised at how many languages try to get away with double/float as the only way to represent fractions.

Data serialization and RPC

Over the last few years, a series of new data serializing languages for structures have appeared. These derive mainly from the need to perform fast Remote Procedure Calls (RPC) over a network with complex data being passed back and forth. For high frequency calls, XML or JSON would be too expensive.

The RPC serializing languages I came across recently are:

Protocol Buffers (PB) was one for which we developed a LegStar translator which was an interesting experience.

Data serialization and Big Data Analytics

Besides RPC, the Big Data Analytics domain is now generating the need for efficient data serialization as well.

One aspect of dealing with large amounts of data is that you need to process records as fast as possible. Traditionally, in the Business Intelligence domain, data was serialized as SQL records or CSV (comma separated) records.

Not suprisingly all first generation ETL tools dealt primarily with SQL and CSV records.

The problem is that SQL and CSV are very bad at storing complex, structured data. Particularly deep hierarchies, arrays and unions.

So when your Big Data is made of such structured data (for instance data originating from a Mainframe), traditional ETL tools are generally sub optimal.

Apache Hadoop, a popular Big Data system, does not impose SQL or CSV. It offers a native serialization mechanism called Writable which allows passing structured data efficiently between Map and Reduce jobs.

Even better, there is an alternative approach called Apache Avro which is a standalone project similar to Protocol Buffers or Thrift but tightly integrated into Hadoop.

Importing Mainframe data into Hadoop

In a scenario where you need to import very large volumes of data from a Mainframe and perform analytics on them, it would make sense to use the Avro format to pass the records between the various steps of the analytics process (Map and Reduce jobs in Hadoop parlance).

The reason is that Mainframe records, described by a COBOL copybook, can translate to Avro records while preserving the hierarchy, arrays and redefines. This saves a lot of processing otherwise needed to 'flatten' the COBOL structures to a columnar representation (SQL, CSV, ...).

What is left to be done is a COBOL to Avro Translator so I have started a new legstar.avro project.

It is an early phase for the project of course and I intend to decribe it further in future installements of this blog. So if you are interested, please stay tuned.