Click Here to go Back.
Presents

A Generic Approach to EDI Using XML

I. Overview

EDI (Electronic Data Interchange) is the mechanism by which data is transferred electronically between systems. This usually takes the form of files packed in record format. Data records are typically fixed length data streams separated by control characters or record markers.

In most cases businesses solve the problem of bringing data into or out of their system via custom programs or applications. In almost all cases the programs are completely custom to the data being transferred. Moreover, in almost all cases the BUSINESS LOGIC is embedded within these programs. What this usually means is that a data field that has a relationship to another field within the same record (or has to be a certain type) is operated on within the custom program. Worse, the relationship to the data storage device (typically a database) is hardcoded within the program. Ultimately what this means is that if a program writes data to table MYTABLE, then adding, subtracting, or changing a column definition in MYTABLE requires a change to the program. Finally, most of these applications are also plain and standard Windows programs -- i.e. a human operator is required to click a button, initiate transfer, etc. Besides being difficult to automate, this style of program solution virtually eliminates cross platform use as well as eventual paradigm changes like web based transaction processing.

There is a much easier way to solve EDI issues.

A better approach to EDI should have the following attributes by definition:

  1. Ability to change with new requirements without reprogramming.
  2. Centralized business logic.
  3. Ability to modify business logic without programming.
  4. Full automation capability.

The processing approach outlined in this document meets these requirements. Alstonlabs.com has created a series of field proven applets and techniques that can reduce your EDI development time to practically zero.


II. The Approach

This approach relies on dual asynchronous state machines and (possible) multiple process timers to allow full automation. Any data file that comes to the business can be handled with this setup; each of the steps data undergoes is handled by applets (small applications that handle a single function.) The thrust of this approach is twofold: First, none of the data needs to be handled by humans and can be completely automated with built-in error recovery / handling. Second, this method is multi-tier by design thus allowing web-based implementation and other web-centric uses beyond the original scope of the design. This processing method meets the overall goal of keeping all business rules at the target database side and all non target-specific processing generic.

In general what happens in this process is as follows:

(Not business specific)

  • 1. Recognize the data file; separate the records to like fixed-length units.
  • 2. Turn the fixed length data to XML.
  • 3. Load the XML to a temp table on the database.
(Business specific)
  • 4. Execute a stored procedure to handle the data; optionally generate report.
(Not business specific)
  • 5. Seek report data and create HTML report.
  • 6. Transit the report (email, fax, web, etc.)

Note that the only business rules are in the stored procedures that handle the data and create reports (step 4); even the post-process phases aren't business specific.

The key to the process is the use of XML: data in files is parsed generically and assigned to tags, and these tags allow a generic loader to populate specific temp table columns on the target. After the load, what's sitting on the target is now a DB friendly version of the original file contents; this allows stored procedures etc. to operate on the data. Until then the meaning of the data has been ignored, thus the data itself is actually "handled" solely at the target. In this manner business specific logic is separated from the processing and handled from a central location. If you have n file processes operating simultaneously, each works the same way, and all business logic resides on the target.


III. AUTOMATION: Dual Asynchronous State Machines

The handling of any data is controlled by state machines. There are two types: load-process and post-process. The load-process state machine takes care of data file recognition through data file load. Post-process handles any reporting or other post-processing needs. Generally the load-process state machine is essentially a timer applet that via scripting will start programs either with file detection. This state machine does not communicate with the database. The post-process state machine is similar but tied to the database via an EDI_STATE table. This table contains a snapshot of the current state and the state machine contains the scripting necessary to control the state. The completion of a state is updated either automatically (completion of a program startup) or programmatically (by a stored proc if this is executed.)

The asynchronous nature of the state machines allows the load-process to do data handling for processes not tied to a database server, as well as handle processes for multiple database servers simultaneously. In this manner the post-process state control is handled by the data target such that each implementation can be specific to the target type. In other words, it's possible to use a single load-process state machine to send data to different data servers without being "tied" to a specific server; each server would have a post-process state machine to handle its specific requirements. In addition the nature of this state machine implementation allows for cross-platform use.

Since these state machines are also applets it is possible to easily create new and custom state machine controllers. For instance a post process state machine could be required in situations where there is no data server, or a post-process state machine may be coded to run on a UNIX server. Therefore a load-process state machine could well be loading data to SQL Server running on NT as well as Oracle running on POSIX, and each post-process state machine would be specific to its environment.

These state machines are not 100% implemented by the timer applets; the timer functionality can execute a "macroscopic" state that can in turn cause other states to execute when completed. Refer to section VI.

A WORD ABOUT XML:

XML (eXtensible Markup Language) is a way of marking data with tags similar to HTML code. It is ASCII text. Records from a data file that are written as XML would have defining tags surrounding the bytefields that denote a field within the record as follows:

record 1: <col1>xyz</col1><col2>abc</col2><col3>333</col3>etc.
record 2: <col1>xde</col1><col2>abb</col2><col3>233</col3>etc.

It is rather straighforward. Although some may point out that XML is more narrowly defined as there are standards bodies such as W3C, the definitions of XML with respect to W3C and other "web friendly" standards define XML within the context of transmitting data using HTTP's GET and POST methods via CGI (hence restrictions on ASCII characters such as ampersands, etc.) Since the data discussed here isn't being transmitted via HTTP but is being passed via disk files, we are free to use XML in the purest form and allow passage of all legal ASCII characters.


IV. AUTOMATION DETAIL: Load-Process State Machine

The load-process state machine is file-centric, that is, it detects file existence based on extension. It is scriptable in how often it wakes as well as what to do when a file is located. The reason the load process is file-centric is that as a data file comes in, it may need to undergo a number of steps to prepare it for the generic loader, and once ready, it will need to be loaded to a known table via a given copy of the generic loader. As each of the file steps are taken, the original file is erased after the newer one is written. This is because the "data preparation" applets are simpler I/O programs that read data, manipulate it, and output it.

A typical process might appear as follows:

  1. separate records
  2. turn records to XML
  3. load the XML to the database
  4. execute a stored proc to handle the data

Fig 1. Front End Automation

Refer to Fig 1. In this generic example:

  • State 1 is the "Record Separator" step, which gleans out all records of the desired processiing type.
  • State 2 is the "Convert to XML" step, which converts the record data, field by field, into XML tagged data.
  • State 3 is the "Load XML to Database" step, which takes the tagged data and loads a temp table in the database.
  • State 4 is the S.P. step (stored procedure), which takes the data from the temp table, applies the necessary busines rules/conversions/etc. to the data, and loads the results to the permanent tables.

Now while it's possible to wake up the state machine every few seconds and execute one of these steps in order of appearance, if any process takes longer than the wake interval then this could be problematic. The preferred method of scripting a file process is in Reverse Order. This helps prevent problems with timing. In other words, the FIRST thing that happens when waking up is to try the stored procedure, then that data load, then XML conversion, and so on. This ensures that the file in question actually exists, etc.

For most purposes you will want to set the wakeup interval to longer than the longest processing time. Typically the loading of data takes the longest time (network traffic, inherent inefficiency of DB technology, etc.) so you would set the wakeup interval at longer than the worst case you're likely to encounter.

Here's an example scripting for a "cleared check file" process:

1) c:\test\files;*.PCC;c:\test\fl2xml.exe
2) c:\test\incoming;*.CCF;c:\test\recsep2.exe

In line 1 the "c:\test\files" directory is searched for any file matching a *.PCC file extension. If a file is found (for argument let's say that it finds "myfile.PCC") then the program "c:\test\fl2xml.exe" is executed with the *.PCC filename as an argument. Here's what the command line would look like:

c:\test\fl2xml.exe c:\test\files\myfile.PCC

Line 2 works the same way. The order of execution is in order of appearance in the script (line 1 is done first, in other words.) Overall any *.CCF file (CCF = "cleared check file") found in the "incoming" directory is processed by the RECSEP2 program (RECSEP = RECord SEParator.) The resulting *.PCC file RECSEP2 outputs to the "files" directory is then turned into XML via FL2XML program. Since these are the only two states in the script, the states following FL2XML are "tightly coupled" (see section VI.)

Notes:

Since a state machine by definition is simply a "dumb" device (i.e. if the state is 1 then do this, if the state is 2 then do that) and not aware of what a given state is trying to accomplish, it can handle variation easier. For example, if normally the first 2 states of any process are devoted to turning a data file into XML and the file provider switches to XML format, the state machine would run as before except that state 3 (load XML to server) is now the first state; the original preceeding states are now moot.

It is not always necessary to define each state via the timer. The timer part of the load-process state machine controls a macroscopic function; any of the programs executed can actually chain "micro" states as needed. For instance in the above example (cleared check process) the FL2XML "fixed-length to XML" applet is set to call the XML2DB "xml to database" applet when it completes and the XML2DB applet is set to call the "xmlProcessClearedCheques" stored procedure when it completes. Each of these states is still discrete and could be fired via the state machine timer if need be. Therefore the load-process state machine is in practice a combination of state controls by the timer and process applets.


V. AUTOMATION DETAIL: Post-Process State Machine

This state machine is "target sensitive" in that to operate efficiently it is tied to a Server/Database. The typical implementation uses a table (EDI_STATE) that contains merely a process ID and a state integer. The integer is a flag bitfield allowing up to 32 states, each of which is user defined. The process ID field is also user defined; e.g. process 1 might be the cleared check file process, process 2 deals with files from the Alaska Eskimo Fishing School, and so on. In most cases 8 flags is more than sufficient:

Example Bitfields: cleared check file process

0x01 -- generate HTML report
0x02 -- report disposistion (if needed)
0x04 -- undefined for now
0x08
0x10
0x20
0x40
0x80

As the timer portion wakes up, the lowest nonzero state is executed first, and then the states for that process ID are ignored until the next wakeup.

Here's part of the script for the timer that generates the HTML report. The first entry is the process ID (1 = cleared check process.) The data between the semicolons is (in order) state 1 through 32. In other words, each line in the script is a delimited list of state actions, with each state defined as the ordinal column position.

  |<-                              STATE 1                              ->| 

1;c:\test\htmrpt.exe c:\test\ccf.qry c:\test\files\ccfr c:\test\emailer.exe;;;
^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^
procid STATE 1 STATE 2 - UP

In this case at state 1 HTML reporter applet HTMRPT is started and told to use CCF.QRY as the query, then name any output file with a prefix of CCFR ("c:\test\files\ccfr.") Since HTMRPT has a counter, if it was set to 3 then this will result in a file named CCFR4.HTM. This file is then handed to the EMAILER applet which will then read its own INI file to determine who the report is to be emailed to. As above with the discussion of macroscopic states we could break these steps up to be discrete if required.


VI. Macroscopic vs. Microscopic States: Coupling Issues

As discussed this EDI process allows states to be "loosely" coupled to any following states (macroscopic) or "tightly" coupled as required. The type of coupling depends on the nature of the problem to solve and somewhat on the target.

For example, SQL Server 6.5 locks pages even for a SELECT statement, so it's usually not a good idea to have two processes that could be intensively selecting or writing to the same tables simultaneously: they could collide. As a result processes that could conceivably collide should not use tight coupling. Of course the reverse is true of processes that do not write to often-accessed tables.

In the ongoing cleared check file process example, tight coupling is used because the data in this file will be run against a seldom used table and therefore unlikely to collide with other processes.

In other cases, the decision to use tight coupling will depend on what you are trying to accomplish at that state. For example processes that write to often used tables may have a tightly coupled stored procedure that runs directly after the dataload temp table is loaded; the procedure may do nothing more than set a flag in the EDI_STATE table to indicate a successful load.

In addition, processes that aren't necessarily loading data to a database or manipulating new data can be tightly coupled. For example emailing of reports can be initiated by the report generator directly after the report is written to disk. Reports that needs to be sent as a group however would probably be best served with loose coupling.


VII. EDI_STATE Table

An EDI_STATE table in the simplest form is merely a Process ID and State integer. In the creation of the table the assumption is that states apply to entire processes, but this is not required. States can also be used with discrete objects inside the system. For example you can apply an internal ID as the ProcessID and track/control the processing of entire related objects (e.g. a credit card tracking system would typically have information in multiple tables yet still maintain an entity as "a card.")

The most useful variation would likely be the addition of "next anticipated state" and "last used state." When used properly this allows you to peek at a process and see what it is doing.

For example, if using EDI_STATE to control flow of a credit card application and in state 5 a credit check is run, this is the state configuration:

last_used = 4
current = 5
next_expected = 6

In the case of a process variation such as credit is already on file, state 5 is not required. An application usder this condition would then show up as:

last_used = 4
current = 6
next_expected = 7

This "signature" difference can be used to make it simpler to track process flow.


VIII. Generic Process Construction Issues

Applets discussed thus far in this process are designed to be used in as many copies as required. For example there is but one ( 1 ) FL2XML program whose job is to convert fixed length data to XML. To use it, you need to make a copy of the EXE and the associated configuration files into the directory structure that will be used to process a given file type. When executed Fl2XML will look at the configuration file that details how the fixed length bytefields are to be broken up and what tags to associate with these bytefields. This configuration will change of course and is mapped to a file type. Of course the particular copy of FL2XML that is run at a given time is controlled by the startup; c:\dir1\fl2xml.exe will start the copy of FL2XML in the c:\dir1\ directory whereas c:\data21\fl2xml.exe will start FL2XML in the c:\data21\ directory. The proper place to put the configuration file for each copy is of course the directory that contains the copy, thus even if the two copies are executed simultaneously the copy in c:\dir1\ will be doing a different job than the one in c:\data21\.

Since all of the applets constructed to date are done in this same manner, process control is implemented via multiple copies and location control of the individual copies.

Note also that control of applet instantiation in this manner is not only very network friendly but also cross-platform friendly. Applets do not have to be located in clusters; the only hard and fast rule is that a given applet needs to have the associated configuration file located in the same directory as the applet. Admittedly the logistics of clustering applet copies in one single directory devoted to processing a file type are simpler, but this is not required for the process to work.


IX. Process Change -- Applet Construction

The processing discussed in this document is that which has already been done and is working. However please note that this process is engineered to be as flexible as possible and grow as needed. Because the processing elements are all applets, the proper course of action at any given time is to create applets as needed that do specific tasks. For example there is no existing applet specifically meant to do FTPing of data although should the need arise then creating one should be limited strictly to the functionality of FTP and not specific to what is being transferred.

As this process is broken into component parts, it is NOT required that all applets run on the same machine, against the same database, or even the same operating system.

In general, applets should follow these rules:

  1. Applets must be command line capable and never require user input. By definition automation is the process making decisions; as soon as you require human input there is no automation. The alstonlabs.com applets also are coded such that for testing purposes they can present a user interface, but the UI is not required for them to run.
  2. Applets should be able to handle either multiple command line parameters or have an associated INI file that can be scripted. This allows an applet to be flexible.
  3. Applets must clean up after themselves. If an applet wakes up to find and read file A and writes file B then it should do something with file A so that A isn't processed again. Delete it or archive it, whatever, just so long as the process isn't duplicated.
  4. Applets must attempt to be minimally recoverable. When possible this means that deletion or removal of an input file happens at the same time as the closing of an output file. If the power plug gets pulled in the middle of what an applet is creating, when the state resumes the applet will still find the input file. If however the input file is deleted after it is read and the applet does something and the power plug is pulled before the output is written, then when the state is resumed, there is not only no output file, but the input file is missing as well leaving no way to recreate this output.


Use the link below to email us if you need more information prior to the site update...

Home E-Mail
Copyright ©2000 by AlstonLabs.com - ALL RIGHTS RESERVED