Sunday, November 27, 2011

LegStar for PDI, a primer

Introduction

Mainframes, such as IBM z/OS, are still operating in many large corporations and will probably continue to do so for the foreseeable future.

Mainframe file systems contain huge amounts of data that are still waiting to be unlocked and made available to modern applications.

Traditionally, mainframe data is processed by batch programs often written in COBOL. Usually several batch programs are organized as sequential steps in a flow. On z/OS, the flow description language is JCL, a rather complex and proprietary language.

Assuming you would like to exploit mainframe data but would rather not write COBOL or JCL, this article shows an approach harnessing the power of ETL tools such as Pentaho Data Integration. With this type of solution, mainframe data can be made available to a very large and growing set of technologies.

The primary destination of mainframe data is BI systems, data warehouses and so forth. Such systems impose constraints on data models in order to achieve the usability and performance levels that users expect when they run complex queries. Mainframe data on the other hand, being optimized for OLTP activity and storage optimization, is rarely organized in a BI friendly way. Hence the need to transform it.

In this article we will walk you through a rather common use case where mainframe data is optimized to reduce storage and needs to be normalized with the help of ETL technology.

While this type of transformation has been possible in the past, the novelty here is that we can now achieve identical results for a fraction of the cost, using Open Source technologies.

Use case

In our use case the source data is stored in a mainframe sequential file (QSAM in mainframe parlance).

Records in such files are not delimited by a special character, such as carriage return or line feed, as is common on distributed systems. Furthermore, a record content is a mix of characters and non-characters. Characters are usually encoded in EBCDIC, while non-characters represent various forms of numeric data. Numeric data is often encoded in mainframe specific formats such as compressed numerics (COMP-3).

Mainframe file records are often variable in size. This was important to save storage resources at a time when these were very expensive.

Although there are yet more difficulties involved in interpreting mainframe data, it should be clear by now that you can’t do so without some meta data that describes records. On mainframes, such metadata is often a COBOL copybook. A copybook is a fragment of COBOL code that describes a data structure (very similar to a C structure). This is a sample of such a copybook we will be using as a use case:

       01  CUSTOMER-DATA.
           05 CUSTOMER-ID                    PIC 9(6).
           05 PERSONAL-DATA.
              10 CUSTOMER-NAME               PIC X(20).
              10 CUSTOMER-ADDRESS            PIC X(20).
              10 CUSTOMER-PHONE              PIC X(8).
           05 TRANSACTIONS.
              10 TRANSACTION-NBR             PIC 9(9) COMP.
              10 TRANSACTION OCCURS 0 TO 5
                 DEPENDING ON TRANSACTION-NBR.
                 15 TRANSACTION-DATE         PIC X(8).
                 15 TRANSACTION-AMOUNT       PIC S9(13)V99 COMP-3.
                 15 TRANSACTION-COMMENT      PIC X(9).

Things to notice about this record description are:

This is 5 levels deep hierarchy
PIC X(n) denotes text fields containing EBCDIC encoded characters
CUSTOMER-ID, TRANSACTION-NBR and TRANSACTION-AMOUNT are 3 different forms of numerics
The array described with OCCURS and DEPENDING ON is a variable size array whose actual size is given by the TRANSACTION-NBR variable.

Now assuming we take a peek, with an hexadecimal editor, at a file record containing data described by this COBOL copybook, we would see something like this:

    00000000h: F0 F0 F0 F0 F0 F1 C2 C9 D3 D3 40 E2 D4 C9 E3 C8 ;
    00000010h: 40 40 40 40 40 40 40 40 40 40 C3 C1 D4 C2 D9 C9 ;
    00000020h: C4 C7 C5 40 40 40 40 40 40 40 40 40 40 40 F3 F8 ; 
    00000030h: F7 F9 F1 F2 F0 F6 00 00 00 00                   ;

This first record size is only 58 bytes long. This is because the TRANSACTION-NBR field contains a value of zero, hence there are no array items stored.

The second record though, which starts at offset 59, looks like this:

    0000003ah: F0 F0 F0 F0 F0 F2 C6 D9 C5 C4 40 C2 D9 D6 E6 D5 ; 
    0000004ah: 40 40 40 40 40 40 40 40 40 40 C3 C1 D4 C2 D9 C9 ;
    0000005ah: C4 C7 C5 40 40 40 40 40 40 40 40 40 40 40 F3 F8 ; 
    0000006ah: F7 F9 F1 F2 F0 F6 00 00 00 04 F3 F0 61 F1 F0 61 ; 
    0000007ah: F1 F0 00 00 00 00 00 03 68 2C 5C 5C 5C 5C 5C 5C ; 
    0000008ah: 5C 5C 5C F3 F0 61 F1 F0 61 F1 F0 00 00 00 00 00 ; 
    0000009ah: 17 59 3C 5C 5C 5C 5C 5C 5C 5C 5C 5C F3 F0 61 F1 ; 
    000000aah: F0 61 F1 F0 00 00 00 00 00 11 49 2C 5C 5C 5C 5C ; 
    000000bah: 5C 5C 5C 5C 5C F1 F0 61 F0 F4 61 F1 F1 00 00 00 ; 
    000000cah: 00 00 22 96 5C 5C 5C 5C 5C 5C 5C 5C 5C 5C       ;

This one is 158 bytes long because the TRANSACTION-NBR field contains a value of 4. There are 4 items in the variable size array.

As you can see, there is not a single byte of excess storage in that file!

Now let us assume this file was transferred in binary mode to our workstation and that we want to create an excel worksheet out of its content.

Building a simple PDI Transform

To transform the mainframe file into an Excel worksheet, we will be using Pentaho Data Integration (PDI) and the LegStar plugin for PDI. LegStar for PDI provides the COBOL processing capabilities that are needed to transform the mainframe data into PDI rows.

The PDI community edition product is open source and freely available from this download link. This article was written using version 4.1.0.

LegStar for PDI is also an open source product, freely available at this download link. For this article, we used release 0.4.

Once you download legstar-pdi, you need to unzip the archive to the PDI plugins/steps folder. This will add the z/OS File Input plugin to PDI standards plugins.

The PDI GUI designer is called Spoon, it can be started using the spoon.bat or spoon.sh scripts.

On the first Spoon screen we create a new Transformation (using menu option File→New→Transformation).

From the designer palette’s Input folder, we drag and drop the z/OS file input step:

This will be the first step of our transformation process. Let us double click on it to bring up the settings dialog:

On the File tab, we pick up the z/OS binary File which was transferred to our workstation. Since this is a variable length record file, we check the corresponding box on the dialog.

This file does not contain record descriptor words. RDWs are 4 bytes that z/OS adds to each variable record. When these are present, LegStar can more efficiently process the file.

The z/OS character set is the EBCDIC encoding used for text fields. For french EBCDIC for instance, with accented characters and Euro sign, you would pick up IBM01147.

We now select the COBOL tab and copy/paste the COBOL structure that describes our file records:

At this stage, we need to click the Get Fields button which will start the process of translating the COBOL structure into a PDI row.

A PDI row is a list of fields, similar to a database row. The row model is fundamental in ETL tools as it nicely maps to the RDBMS model.

The Fields tab shows the result:

As you can see, several things happened:

The COBOL hierarchy has been flattened (LegStar has a name conflict resolution mechanism)
Data items, according to their COBOL type, have been mapped to Strings, Integers or BigNumbers with the appropriate precision
The array items have been flattened using the familiar _n suffix where n is the item index in the array

We are now done with setting up the z/OS file input step but before we continue building our PDI Transformation, it is a good idea to use one of the great features in PDI, which is the Preview capability. The Preview button should now be enabled, if you click on it and select a number of rows you would like to preview, you should see this result:

Time to go back to the PDI Transformation, add an Excel output step and create a hop between the z/OS file input step and the Excel output step:

You can now run the Transformation, this will require that you save your work. In our case we named our Transformation rcus-simple.ktr. Ktr files are XML documents that completely describe the PDI Transformation.

There is a launch dialog on which we simply chose to click launch and then the result showed up as this:

As you can see, 10000 records were read off the z/OS file and an identical number of rows were written in the Excel worksheet (plus a header row).

It is time to take a look at the Excel worksheet we created:

Everything is in there but you might notice the variable size array results in a lot of oolumns and a lot of empty cells since we need to fill all columns. Indexed column names and sparsely filled cells result in a worksheet that is hard to play with.

This reveals the fundamental issue with mainframe data models, they were not intended for end users to see. So putting such raw data in an excel worksheet is unlikely to be satisfactory.

This is where ETL tools take all their meaning. To illustrate the point we will next enhance our transformation to get rid of the variable size array effect.

Enhancing the PDI Transformation

Our first enhancement it to reduce the number of columns. We will apply a normalization transformation that is best described with an example.

This is a partial view of our current result row:

CustomerId	CustomerName	…	TransactionAmount_0	TransactionAmount_1	TransactionAmount_2	TransactionAmount_3	…
2	FRED BROWN	…	36,82	175,93	114,92	229,65	…

The COBOL array flattening has had the effect of multiplying the number of columns. Here would be a more desirable, normalized, view of that same data:

CustomerId	CustomerName	…	TransactionIdx	TransactionAmount	…
2	FRED BROWN	…	0	36,82	…
2	FRED BROWN	…	1	175,93	…
2	FRED BROWN	…	2	114,92	…
2	FRED BROWN	…	3	229,65	…

What happened here is that Columns were traded for Rows. Instead of 5 TransactionAmount_n columns there is a single one and the new TransactionIdx column identifies each transaction. The result is a normalized table in the sense of the first normal form in RDBMS theory.

Normalizing has an effect on volumes of course but the result is much easier to manipulate with traditional RDBMS semantics.

Let us now modify our PDI Transformation and introduce a Normalizer step (Palette’s Transform category):

Setting up the Normalizer involves specifying the new TransactionIdx column and then mapping the indexed columns to a TransactionIdx value and the single column that will replace each repeatable group:

If we now run our transformation, this is how the result looks like:

This is already much nicer and easier to manipulate. From the original 20 columns, we are now down to 9.

The PDI execution statistics should show that 50000 rows were created in the Excel worksheet out from the 10000 that we read from the z/OS file. This is an negative effect of normalizing that we should now try to alleviate.

You might notice that the Excel worksheet still contains a large number of empty cells corresponding to empty transactions.

Our next step will be to get rid of these empty transactions. For that purpose, we will use the PDI Filter Rows step (under the palette’s Flow category).

The Filter step will be setup to send empty transaction rows to the trash can and forward rows with transactions to the Excel worksheet. The PDI equivalent of a trash can is the Dummy step, also found under the Flow category so we go ahead an add it to the canvas too:

Let us now double click on the Filter step to bring up the setting dialog:

Here we specify that the filter condition is true if the TransactionDate column is not empty. Back to the canvas, we can now create 2 hops, one for the true condition that will lead to the Excel worksheet and one for the false condition which brings to the Dummy step:

We are now ready to execute the PDI Transformation. The metrics should display something like this:

Now, only 25163 rows made it to the Excel worksheet while 24837 were trashed. The resulting Excel worksheet is finally much closer to what an end user might expect:

Conclusion

In this article we have seen how mainframe data, which is by construct obscure and hardly usable by non-programmers, can be transformed into something as easy to manipulate as an Excel worksheet.

Of course, the example given remains simplistic compared to true life COBOL structures and mainframe data organizations but we have seen a small part of the PDI capabilities, some of which are pretty powerful.

Historically the type of features that you have seen in this article were only available from very expensive and proprietary products. The fact that you can now do a lot of the same things, entirely with open source software, will hopefully trigger many more opportunities to exploit the massive untapped mainframe data.

We hope that readers with mainframe background, as well as readers with open systems background, will find this useful and come out with new ideas for Open Source mainframe integration solutions.

Friday, November 18, 2011

A LegStar Commercial License

Entreprises still intensively using mainframes these days tend to be very large. Although most of these companies are more and more comfortable using open source software, they are not comfortable at all with running unsupported code in production.

That creates a specific problem to LegStar since it is both Open Source an primarily of use for such large companies.

After receiving several requests from customers at LegSem we started looking into building some kind of commercial offering around LegStar.

Of course, we are not the first open source company to bang our head against this thorny issue, infoworld has a honest and funny write-up about this.

So this is what we came up with:

Committed to Open Source:

First, LegStar is and will remain Open Source
The core features will stay under the permissive LGPL as they are today and the advanced features will be GPL
There won't be an entreprise product separate from the Open Source one. This means every single line of code will be available under an Open Source license

For Customers who need it:

Besides the Open Source licenses, there will be a Commmercial License available. Yes, this a dual-licensing scheme
The Commercial License describes the level of support that LegSem, or a business partner, provides
LegSem will also shield Commercial Licensees from backward compatibility issues

This last point requires a little bit of explanation:

As we add new features or fix bugs in LegStar we generally pay attention to backward compatibility but we don't necessarily test it. As a result, new releases are not guaranteed to be backward compatible with your own developments. With the commercial offering, we will do additional, less frequent, releases that we will explicitly check for backward compatibility. These extra releases will only be available to commercial licensees.

LegSem is a small company and will not cover the entire world with that commercial license. What we intend to do is to work with business partners in territories where we are not present. So if your company would be interested in entering this type of deal with us please send a mail at contact@legsem.com.

Thursday, October 13, 2011

Finally a LegStar JCA Connector

This article refers to JCA, the Java Connector Architecture, not the Java Cryptography Architecture.

When we considered J2EE several years ago as a potential target for mainframe integration, I was horrified and decided not to pursue that route. This has proven the right choice and the J2EE reputation was so bad that the name was later changed to Java EE.

The idea that overly complex technologies provide great opportunities to sell services and consulting is not new. Some very large companies made a fortune this way. Some of these same companies were very active during the J2EE specifications...

Of course J2EE completely failed to become the universal web engine it was meant to be, but it made it to most large Entreprise Systems. Or at least part of the technology was adopted by large IT departements and some of the complexity alleviated thanks to frameworks like Spring.

So the reality today is that most companies still using mainframes also use Java EE.

Another interesting development concerns the latest Java EE specifications (5 and 6). I don't know if the J2EE designers were humbled by their past failures but I have to say they did a much better job this time. Today, the technology is almost usable without Spring. For instance, you can inject a JNDI resource with a single annotation which would have taken about 10 lines of Java code previously. If you want to learn more about the latest Java EE, I recommend reading Antonio Goncalves's blog.

So with Java EE specs getting better and almost all mainframe shops using it, I thought it was time for LegStar to start supporting Java EE containers. The result is part of a series of extensions to LegStar that I am working on.

The first deliverable is a JCA Resource Adapter that uses CICS Sockets as its underlying connectivity.

It is conformant to the JCA 1.0 specifications but does not implement some of the CCI chapter (which is optional anyway). In practice we do not support CCI Records which are awkward indexed or mapped objects. In my view, Records are not suitable to represent COBOL structures. With the LegStar JCA connector, Instead of Records, you simply use regular LegStar Transformers.

I am convinced that separating Resource Adapter fonctionalities (Connection pooling, Transactions and Security) from Transformation is a much better approach than what most JCA connectors available on the market do. For one thing, you can reuse the Transformers outside Java EE which is good because, even if Java EE is largely used by mainframe shops, they also use many other java based technologies (ESBs, ETLs, ...).

I intend to support the other LegStar transports (HTTP, WebSphere MQ) in the future as well as some other transports I have in mind. There will be a different JCA Resource Adapter for each transport.

One thing you might notice is that the licence has changed for these extensions. While the core LegStar project remains LGPL for the time being, the extensions use the more restrictive GPL. LGPL is often used when OSS projects start and need to get the wider acceptance possible, while GPL is for more mature OSS projects. LegStar is 7 years old now and if you look at what Google has to say about open source mainframe integration:

You will agree with me that it has gained sufficient traction to become part of the higher class GPL projects out there.

This is all thanks to you of course since I haven't devised a scheme to get Google to better rank my sites and never paid Google a dime either :-).

Saturday, July 23, 2011

Mapping Dates from COBOL to Java

Date (and Time) types are difficult to map between COBOL and java. Here a few comments on why this is hard.

Dates in COBOL

For a very long time, the COBOL compiler on IBM mainframes didn't know anything about dates. Developers typically used structures and managed dates semantics by hand. The Y2K issue was largely a consequence of this situation since the compiler couldn't solve the problem all by itself.

Recently ("recently" on the mainframe time scale means in the last 15 years or so), IBM introduced the DATE FORMAT keyword. This is used to further qualify a regular numeric or alphabetic data item. It introduces some restrictions on what you can store in the data item but fundamentally, the data item remains a numeric (binary, compressed or zoned decimal) or an alphanumeric.

The DATE FORMAT keyword must be followed by a pattern using Y and X characters (no relationship to chromosomes) such as YYXXXX or YYYYXXXX.

YY designates a "windowed" date. This is relative to a century window given by the YEARWINDOW compiler option.
YYYY is an "expanded" date. This is a regular century + year date.

MOVE of a windowed date to an expanded date will expand it as expected. There are also intrinsic functions such as DATEVAL and UNDATE to convert dates to non date data items.

For more details, see the programming reference in IBM COBOL documentation.

Dates in Java

If dates are messy in COBOL, well, they are even messier in Java.

At the beginning there was an overly simple java.util.Date (with a weird java.sql.Date variation with 9 methods, 6 of which are deprecated!).

Then came java.sql.Timestamp which inherits from java.util.Date but about which the documentation states: "it is recommended that code not view Timestamp values generically as an instance of java.util.Date". It is as if to say: Timestamp inherits from Date, but please don't use that fact in your programs...

But worst, came the dreaded java.util.Calendar. In this article, Joshua Bloch, an ex Sun engineer, says:

As an extreme example of what not to do, consider the case of java.util.Calendar. Very few people understand its state-space -- I certainly don't -- and it's been a constant source of bugs for years.

java.util.Calendar was designed at a time (long gone now) when Java was to overtake the programming world, so this monster was born and several, often used, java.util.Date methods were deprecated as a consequence.

To further complexify things, JAXB has introduced the javax.xml.datatype.XMLGregorianCalendar in order to map the XML Schema date and time.

All in all, at least 5 different ways of representing a Date

Mapping COBOL dates to Java Dates

So far, Legstar has taken the, rather lame, approach of not attempting to map COBOL dates to Java dates automatically.

When starting from COBOL, this means you will never get a java Date/Timestamp/Calendar property, even if the corresponding COBOL data item is marked with a DATE FORMAT. Date semantics are lost in translation...

When starting from Java though, chances are that some property is a Date/Timestamp/Calendar. Actually Date/Calendar, because with Timestamp you would get "error: java.sql.Timestamp does not have a no-arg default constructor" from JAXB.

LegStar maps Java Date/Calendar properties to a COBOL PIC X(32).

At runtime, the PIC X(32) content is assumed to follow the form: "yyyy-mm-dd hh:mm:ss.fffffffff" known as the JDBC timestamp escape format.

This is rudimentary but following this conversation, the latest LegStar has introduced a new annotation, inspired from Patrick's comment: @CobolJavaTypeAdapter(value = SomeClass.class) where SomeClass is code that a developer implements. (Here is a sample).

An extension mechanism is probably the best way to handle COBOL to Java date binding.

Sunday, April 10, 2011

Migrating off the CTG?

CTG, the CICS Transaction Gateway, is an IBM product that has been around for a long time and is largely deployed in mainframe shops.

CTG is usually used from a Java client to make outbound, synchronous, calls to a CICS program. In this case, the API offered to Java clients is known as ECI (External Call Interface ). Because ECI is how most users access the CTG some of them refer to the CTG itself as ECI.

CTG is a separate product from CICS itself. Some people confuse both (probably because the names are so close: CICS Transaction Server and CICS Transaction Gateway).

CTG runs in a Java VM that could sit on z/OS (in the Unix System Services environment) but most frequently runs off a distributed server. It is also frequently used in conjunction with WebSphere and as such, can be thought of as middleware between WebSphere and CICS.

CTG’s major strength is transaction support and J2EE integration. It supports the JCA connector architecture and provides 2-phase commit and XA support.

Although CTG might sound like the definitive solution for CICS/java integration, a number of users are considering moving away from it.

I can see 3 reasons why this is happening:

The emergence of integration standards such as Web Services (and the slow erosion of older standards such as J2EE/JCA)

Loose coupling as a preferred architecture pattern over tight coupling

The Open Source revolution which drives more and more enterprises to consider Open Source software as a valid alternative to closed products

An uncomfortable position within IBM

Today, CTG is competing internally with other IBM integration features particularly when it comes to SOA. You can of course create Web Services on top of JCA but there are more efficient alternatives within IBM.

CICS itself has built-in support for SOAP Web Services. This feature, known as CICS Web Services, offers programmers a standard way to call CICS programs without the need for CTG.

WebSphere, when it is installed on z/OS, also now offers CICS integration directly (an option known as OLA for Optimized Local Adapters).

WebSphere MQ can also be used to support Web Services.

CTG seems to have less and less room within the IBM offering.

Not a serious contender for loose coupling

CTG has support for asynchronous calls but it is hard to justify the heavyweight 2-phase commit support in such a setting.

Asynchronous calls can’t support 2-phase commit and therefore could run with a much more lightweight infrastructure.

Furthermore, IBM has WebSphere MQ as its preferred asynchronous programming interface so it is hard to see how CTG could be justified to support a loosely coupled architecture.

Not Open Source

CTG is not Open Source and is a complex black box that users struggle to operate.

It is hard to tell how much this costs companies but I have met with very frustrated customers.

CTG is not free of charge but it is often bundled with WebSphere, CICS and other products and it is difficult to isolate its cost from the rest.

The true cost becomes apparent though for customers migrating off WebSphere because the bundle offer does not apply anymore.

Poor and confusing tooling

So far we have discussed the runtime aspects of CTG but how about the development time tooling? How do you map the target COBOL data structures to java beans?

Here there is some confusion. This mainly stems from IBM changing names and branding several times over the last few years.

The COBOL mapping tools are usually an option buried deep into a much larger developer product:

Enterprise Application Builder (EAB) within VisualAge for Java (No longer supported)

J2C within RAD (Rational Application Developer for WebSphere Software which used to be named WSAD for WebSphere Application Developer)

J2C(?) within RDz (Rational Developer for z/OS which seems to have been recently renamed Rational Developer for zEnterprise)

I am not sure about J2C being part of RDz. RDz seems to cover CICS Web Services with tooling that looks quite different from J2C. I know CICS had a “Web Service Assistant” a sort of command line utility to map COBOL to XML. RDz might have some GUI on top of that.

Although these tools are Eclipse based, they are not open source. There is usually a per seat license to be purchased.

There are several issues with J2C:

Users complain that the java beans mapping COBOL structures that J2C produces are awkward, the morphology does not always match the original COBOL structure (flattening), property names derived from COBOL data items are unfriendly (use of double underscores), arrays are turned to java arrays instead of more flexible java.util.List, no support for REDEFINES, …

J2C also produces java classes that intermingle java to COBOL transformation with remote execution of CICS programs. This tight coupling of language translation features with RPC mechanism was customary 10 years ago but has proven quite limited since then. This is because integration today is a lot more than calling CICS programs. Mainframe data can originate from files, messages, non IBM transaction monitors, etc… Also data might have to be processed asynchronously, as part of a flow (BPEL) or even in batch which was completely missed by the JCA specs.

If LegStar was architected on the same principles, it would be impossible to support ESB’s or ETL’s for instance.

LegStar might help

To summarize the CTG situation today, it has relatively limited tooling, is tightly related to JCA and has a weak position within the IBM offering.

For users considering Open Source and loosely coupled alternatives to CTG, LegStar could come handy.

At LegSem, with help from some of our pioneer users, we are building a service offering to help customers move away from RAD/CTG.

Our first step is to develop migration tools that can exploit meta data produced by tools such as RAD.

We are also working on an ECI transport for users who want to adopt LegStar COBOL Transformers (to replace the J2C beans) but would rather stick to CTG to get to their mainframe.

If you are interested, drop us a message at admin (at) legsem (dot) com or use this form.

Saturday, March 26, 2011

When it comes to Mainframes, nothing is simple

The IEEE Software review has published an article by Belgian researchers who made an attempt at re-engineering a mainframe application using automated tooling they knew worked in other environments.

Although the article mentions important lessons learned, it proved once more that it is extremely difficult to reconstruct mainframe applications knowledge automatically from static code analysis.

They typically ran into an issue, I for one, encountered several times, where an apparently autonomous set of COBOL programs actually calls some Assembler magic that turns control to other hidden programs only identified at runtime. There is little you can't do on a mainframe using Assembler. One thing you can do, is load executables dynamically.

During the 80's, the heydays of Mainframe programming, every IT shop had one or more Assembler guru's. With much looser budget controls than today, it was possible to spend considerable time developing in-house frameworks, optimizing performances and so forth. Not to say that this only had negative effects, these optimized assembler routines probably saved large amounts of money by reducing CPU consumption, a major parameter on IBM bills.

Most code analysis tools are COBOL centric. One reason for that is that COBOL is not that hard to parse. To my knowledge, there are no automatic code analysis tools for MVS Assembler (DataTek has an impressive tool for Assembler to COBOL but it usually requires some level of human assistance). That's probably because the level of complexity such a tool would require would be several levels of magnitude higher than COBOL.

For a typical Mainframe shop, the volume of MVS Assembler programs is much smaller than COBOL. That might explain why vendors would have a hard time cost justifying writing a very complex Assembler code analysis tool.

So the result of such a situation is that COBOL code analysis tools fail on multilanguage boundaries. Of course there are other multilanguage boundaries on Mainframes for instance: COBOL to BMS or MFS macros, COBOL to JCL, COBOL to SQL, COBOL to CICS, COBOL to IMS not to mention all third party tools one can find on Mainframes. One thing for sure, I have never seen a pure COBOL application.

The problem is that without reliable code analysis tools, you can't reconstruct knowledge in a bottom up approach. What you end up doing is taking the top down approach, by chasing the last application expert on site, hoping that he hasn't retired yet.

The complexity revealed by this IEEE article explains a good part of the Mainframe applications longevity. I know IBM prefers the reliability, availability, security explanation. But I know many IT shops who would have happily migrated off their mainframes if it was easy.

LegStar is also affected by this complexity of course. I often have a hard time explaining to Java developers why the Java side of LegStar works so easily while the Mainframe side, which has much less code in it, is often much harder to get to work properly...

Monday, February 21, 2011

COBOL in a flat world

We just released a version of LegStar for Talend Open Studio. Talend is a well known ETL tool that also expands to the MDM and ESB spaces.

This is the second ETL tool we interface with. The first one was Pentaho Data Integration a.k.a Kettle for which we released legstar-pdi back in November 2010.

With this experiences behind us it becomes clear that ETL tools are row centric. This means that data flowing from one step to another needs to be modeled as a flat, fixed, list of fields. This exactly maps to a classical database row. Not surprising as ultimately, an ETL, must feed database tables somewhere.

When the starting point is COBOL though, where data structures tend to be very hierarchical, fitting in a flat model is challenging. Here are some considerations to keep in mind.

Name conflicts:

Flattening a simple hierarchy such as:

is relatively easy, it intuitively maps to: [ItemB:String, ItemD:Short].

Now what happens for this one (perfectly valid in COBOL):

In a flat model, field names must be unique so [ItemB:String, ItemB:Short] does not work. You need to disambiguate names and produce something like [ItemB:String, ItemC_ItemB:Short].

Arrays:

Assuming a COBOL data item such as:

we have an new issue since arrays are usually not handled by database schemas.

Now the solution is to expand each item into a different field. Something like [ItemA_0:String, ItemA_1:String, ItemA_2:String, ItemA_3:String, ItemA_4:String].

This is quite wasteful but no real alternatives here.

In COBOL, the DEPENDING ON clause is often used to limit array sizes and processing time. Here we hit another limitation of the flat models, they are usually fixed in the sense that all fields declared must be present in each row.

Filling unused fields with null values is a common technique used to tell downstream steps that fields have no value.

Redefines:

The COBOL REDEFINES clause, a cousin of the C union, is another interesting challenge. Since the flat model is fixed it can't be dynamically changed depending on the REDEFINES alternatives.

The best solution here is to manage a different set of flat fields for each combination of alternatives. This can be demonstrated with a simple example:

Here the COM-SELECT field value determines if COM-DETAIL1 or COM-DETAIL2 is present in the COBOL data.

This would result in 2 field sets (schemas in ETL parlance):

set1: [ComSelect:Short, ComName:String]
set2: [ComSelect:Short, ComAmount:Decimal]

The number of field sets you need to contemplate depends on the number of alternatives in each REDEFINE group (an ITEM followed by a set of items redefining its location). Furthermore, if a COBOL structure contains multiple such REDEFINE groups, than all combinations are possible. So lets say a COBOL structure has a first group of 3 alternatives and another of 2 alternatives, there are 6 (3 x 2) possible field sets.

Fortunately, the number of REDEFINE groups and the number of alternatives in each groups are usually small.

What this all means is that COBOL structures need somehow to be shoehorned to fit the ETL data model. This is an important difference with ESBs where the data model is usually a much more versatile Java object.

Tuesday, February 8, 2011

See you at Maven central

Maven central has long been restricted to a few very large players such as the Apache foundation.

This has changed recently thanks to a new free offering by Sonatype for OSS projects.

LegStar has been using Maven from the very beginning and more and more users rely on the availability of artifacts in the LegStar Maven repository.

This proprietary repository is not very secure and not always available. Now that it is possible to push artifacts to Maven central I have been busy figuring out how to take advantage of this.

The major issue is that Maven central's policy is to host artifacts only if their dependencies are also in central. And of course LegStar has dependencies on oss libraries which are not in central

One of the bad players is the Eclipse foundation. So far, no complete sets of Eclipse bundles are available as Maven artifacts. There are lengthy discussions going on but no results yet.

Besides Eclipse the only other annoying dependency we have in LegStar is Websphere MQ. Of course this one being proprietary, it will never make it to central.

In order to bootstrap the process of moving to Maven central, I have started to split the modules into separate release units. the first, very limited, release is now in Maven central. You can see the result at this location by entering the legstar keyword.

It feels good to be part of the big league

Thursday, January 6, 2011

Batch integration with CICS, ETL integration with ESB

On mainframes, integration between batch and online processes is not simple. In an IBM CICS environment for instance, if files and databases need to be shared between batch and CICS (and they most always do), you have to deal with a number of issues such as:

Batch processes might update numerous records. It is usually inefficient to commit those changes after each record change. Large number of uncommitted changes means large numbers of locks which in turn affect online activity by slowing down response times or even leading to time out errors.
Batch processes often deal with files in addition to databases. The most widely used technique to restart a batch after a failure is to backup these files before the batch is started and restore them in case of failure so that the batch can be restarted. Any online activity that dealt with the same files between the batch start and failure worked on uncommitted data.

To avoid such issues, many mainframe shops initially segregated batch and online activities in different time frames. Typically CICS systems were brought down in the evening while batch processes were running and restarted in the morning.

Because of this strong separation between online and batch activity, very little integration between batch and CICS systems was developed. There was some degree of code reuse with COBOL copybooks but you would never see binary reuse (a batch program calling a CICS program for instance). Actually this was not even possible before IBM introduced the EXCI technology.

The mainframe nightly batch window rapidly came under pressure though. As mainframes grew larger and merged with one another, they started serving populations across many time zones. It is not uncommon that users are all over the world. This forced mainframe shops to rearchitect their batch processes and lead IBM to develop new technologies such as EXCI and VSAM Record Level Sharing.

At the same time, the need for business logic reuse between batch and CICS became more important because systems became more complex. Mainframe developers resorted to database triggers, stored procedures or WebSphere MQ triggers, probably beyond the original intent of such technologies, because there was no other way of sharing logic at the binary level.

I am seeing some similarity with the ETL versus ESB situation in the Java world. These systems do very little to integrate with one another today.

By nature, ETL is similar to batch in the sense that it deals with large amounts of records and multiple file systems and databases. ESB is closer to online as it deals with smaller transactional events.

Both ETL and ESB products claim to be transformation engines though and indeed the term "Transformation" is widely used in both type of products documentation. So you might wonder why it is almost impossible to reuse a transformation written for an ETL in an ESB.

If IBM was forced to introduce EXCI for Batch to CICS communication, I wouldn't be surprised if users forced ETL and ESB vendors to integrate with one another more closely.

I don't mean that ETL and ESB technologies need to be tightly integrated, after all they do different jobs, but yet it would be nice if some level of transformation reusability can be achieved.