Monday, February 21, 2011

COBOL in a flat world

We just released a version of LegStar for Talend Open Studio. Talend is a well known ETL tool that also expands to the MDM and ESB spaces.

This is the second ETL tool we interface with. The first one was Pentaho Data Integration a.k.a Kettle for which we released legstar-pdi back in November 2010.

With this experiences behind us it becomes clear that ETL tools are row centric. This means that data flowing from one step to another needs to be modeled as a flat, fixed, list of fields. This exactly maps to a classical database row. Not surprising as ultimately, an ETL, must feed database tables somewhere.

When the starting point is COBOL though, where data structures tend to be very hierarchical, fitting in a flat model is challenging. Here are some considerations to keep in mind.

Name conflicts:

Flattening a simple hierarchy such as:

is relatively easy, it intuitively maps to: [ItemB:String, ItemD:Short].

Now what happens for this one (perfectly valid in COBOL):

In a flat model, field names must be unique so [ItemB:String, ItemB:Short] does not work. You need to disambiguate names and produce something like [ItemB:String, ItemC_ItemB:Short].

Arrays:

Assuming a COBOL data item such as:

we have an new issue since arrays are usually not handled by database schemas.

Now the solution is to expand each item into a different field. Something like [ItemA_0:String, ItemA_1:String, ItemA_2:String, ItemA_3:String, ItemA_4:String].

This is quite wasteful but no real alternatives here.

In COBOL, the DEPENDING ON clause is often used to limit array sizes and processing time. Here we hit another limitation of the flat models, they are usually fixed in the sense that all fields declared must be present in each row.

Filling unused fields with null values is a common technique used to tell downstream steps that fields have no value.

Redefines:

The COBOL REDEFINES clause, a cousin of the C union, is another interesting challenge. Since the flat model is fixed it can't be dynamically changed depending on the REDEFINES alternatives.

The best solution here is to manage a different set of flat fields for each combination of alternatives. This can be demonstrated with a simple example:

Here the COM-SELECT field value determines if COM-DETAIL1 or COM-DETAIL2 is present in the COBOL data.

This would result in 2 field sets (schemas in ETL parlance):

  • set1: [ComSelect:Short, ComName:String]
  • set2: [ComSelect:Short, ComAmount:Decimal]

The number of field sets you need to contemplate depends on the number of alternatives in each REDEFINE group (an ITEM followed by a set of items redefining its location). Furthermore, if a COBOL structure contains multiple such REDEFINE groups, than all combinations are possible. So lets say a COBOL structure has a first group of 3 alternatives and another of 2 alternatives, there are 6 (3 x 2) possible field sets.

Fortunately, the number of REDEFINE groups and the number of alternatives in each groups are usually small.

What this all means is that COBOL structures need somehow to be shoehorned to fit the ETL data model. This is an important difference with ESBs where the data model is usually a much more versatile Java object.

Tuesday, February 8, 2011

See you at Maven central

Maven central has long been restricted to a few very large players such as the Apache foundation.

This has changed recently thanks to a new free offering by Sonatype for OSS projects.

LegStar has been using Maven from the very beginning and more and more users rely on the availability of artifacts in the LegStar Maven repository.

This proprietary repository is not very secure and not always available. Now that it is possible to push artifacts to Maven central I have been busy figuring out how to take advantage of this.

The major issue is that Maven central's policy is to host artifacts only if their dependencies are also in central. And of course LegStar has dependencies on oss libraries which are not in central

One of the bad players is the Eclipse foundation. So far, no complete sets of Eclipse bundles are available as Maven artifacts. There are lengthy discussions going on but no results yet.

Besides Eclipse the only other annoying dependency we have in LegStar is Websphere MQ. Of course this one being proprietary, it will never make it to central.

In order to bootstrap the process of moving to Maven central, I have started to split the modules into separate release units. the first, very limited, release is now in Maven central. You can see the result at this location by entering the legstar keyword.

It feels good to be part of the big league