Monday, October 25, 2010

LegStar for Pentaho Data Integration (Kettle)

I have spent the last few weeks digging into open source ETL tools.

The 3 most quoted products are:

It turns out Clover is only partially open source. The GUI is proprietary. So I spent more time on PDI and Talend.

PDI is the oldest product and, perhaps as a consequence, has the largest community. You can get a sense of that by comparing the Ohloh page for PDI to the Ohloh page for Talend.

But if you compare new threads per day on the PDI forum to that of the Talend forum, you can see that Talend is doing good too.

I decided to try out PDI first and developed a proof of concept implementation of LegStar for PDI

You can see the result on Google code as usual.

I have to say that I am very impressed with PDI, a product originally called Kettle and developed by Matt Casters.

PDI comes with a framework for people to develop additional plugins. For those who are interested, there is an excellent blog entry by Slawomir Chodnicki to get started.

I was able to reuse part of the PDI internal test framework to automate testing of my plugin. I have automated unit tests and integration tests

It is also quite easy to deploy new plugins in PDI. It is a matter of packaging the plugin as a jar, and dropping it to a particular location.

As usual, it is extremely helpful that the product is open source. In particular, I could easily debug my plugin in Eclipse, stepping through PDI code as well as my code.

My only regrets with PDI is that there is little Maven support and that the code is often not commented. This being said, that did not prevent me from using Maven for all lifecycle phases of my plugin and was able to find my way into the PDI code which is usually readable enough.

PDI also has support for parallel processing and clustering that I did not explore yet. I am looking forward to playing with these features next.

Tuesday, October 12, 2010

VSAM to CSV using LegStar

I was wondering how hard it would be to translate a z/OS VSAM file content to a CSV file using LegStar.

So I started with a very trivial case:

CICS comes with a sample VSAM KSDS file called FILEA (see hlq.CICS.SDFHINST(DFHDEFDS)). FILEA is an 80 characters, fixed sized record file.

The records in FILEA are described by a COBOL copybook called DFH0CFIL (in hlq.CICS.SDFHSAMP). The content of DFH0CFIL looks like this:

My first step was to extract the VSAM file content to a sequential file using a JCL like this one on z/OS:

I then downloaded the sequential file to my workstation using regular FTP in binary mode:

On the java side now, I created a LegStar Transformer from the DFH0CFIL COBOL copybook. This results in a FilerecTransformers class.

The following code snippet, is what I needed to write to get my CSV file:

Of course this is a very contrived example, both because the VSAM file is fixed size and the record data is only made of characters.

In a more realistic case, the records will contain all sorts of numerics: compressed, zoned or edited and chances are that some redefines and arrays will complexify the setting. This is where LegStar should really become useful.

There are other interesting questions when you get to that point:

  • Assuming you need to regularly extract the content of a file, how do you automate a distributed process like this one?
  • How would you do that reliably; meaning you don't process the same data twice or miss part of it?

ETL (Extract Transfer Load) tools are typically focused on these issues. I became quite curious about them recently.