Monday, February 21, 2011

COBOL in a flat world

We just released a version of LegStar for Talend Open Studio. Talend is a well known ETL tool that also expands to the MDM and ESB spaces.

This is the second ETL tool we interface with. The first one was Pentaho Data Integration a.k.a Kettle for which we released legstar-pdi back in November 2010.

With this experiences behind us it becomes clear that ETL tools are row centric. This means that data flowing from one step to another needs to be modeled as a flat, fixed, list of fields. This exactly maps to a classical database row. Not surprising as ultimately, an ETL, must feed database tables somewhere.

When the starting point is COBOL though, where data structures tend to be very hierarchical, fitting in a flat model is challenging. Here are some considerations to keep in mind.

Name conflicts:

Flattening a simple hierarchy such as:

is relatively easy, it intuitively maps to: [ItemB:String, ItemD:Short].

Now what happens for this one (perfectly valid in COBOL):

In a flat model, field names must be unique so [ItemB:String, ItemB:Short] does not work. You need to disambiguate names and produce something like [ItemB:String, ItemC_ItemB:Short].

Arrays:

Assuming a COBOL data item such as:

we have an new issue since arrays are usually not handled by database schemas.

Now the solution is to expand each item into a different field. Something like [ItemA_0:String, ItemA_1:String, ItemA_2:String, ItemA_3:String, ItemA_4:String].

This is quite wasteful but no real alternatives here.

In COBOL, the DEPENDING ON clause is often used to limit array sizes and processing time. Here we hit another limitation of the flat models, they are usually fixed in the sense that all fields declared must be present in each row.

Filling unused fields with null values is a common technique used to tell downstream steps that fields have no value.

Redefines:

The COBOL REDEFINES clause, a cousin of the C union, is another interesting challenge. Since the flat model is fixed it can't be dynamically changed depending on the REDEFINES alternatives.

The best solution here is to manage a different set of flat fields for each combination of alternatives. This can be demonstrated with a simple example:

Here the COM-SELECT field value determines if COM-DETAIL1 or COM-DETAIL2 is present in the COBOL data.

This would result in 2 field sets (schemas in ETL parlance):

  • set1: [ComSelect:Short, ComName:String]
  • set2: [ComSelect:Short, ComAmount:Decimal]

The number of field sets you need to contemplate depends on the number of alternatives in each REDEFINE group (an ITEM followed by a set of items redefining its location). Furthermore, if a COBOL structure contains multiple such REDEFINE groups, than all combinations are possible. So lets say a COBOL structure has a first group of 3 alternatives and another of 2 alternatives, there are 6 (3 x 2) possible field sets.

Fortunately, the number of REDEFINE groups and the number of alternatives in each groups are usually small.

What this all means is that COBOL structures need somehow to be shoehorned to fit the ETL data model. This is an important difference with ESBs where the data model is usually a much more versatile Java object.

2 comments:

  1. So, how do I tell the toHost transformer to use a particular redefine and avoid the error: P0ER: Host Transform Exception: "No alternative found for choice element P0ErTelephoneChoice"

    ReplyDelete
  2. This is excellent information. I got some new ideas. Thank you for your post.

    flat earth map

    ReplyDelete