I. Overview
EDI (Electronic Data Interchange) is the mechanism by which data is transferred electronically between systems. This usually takes the form of files packed in record format. Data records are typically fixed length data streams separated by control characters or record markers.
In most cases businesses solve the problem of bringing data into or out of their system via custom programs or applications. In almost all cases the programs are completely custom to the data being transferred. Moreover, in almost all cases the BUSINESS LOGIC is embedded within these programs. What this usually means is that a data field that has a relationship to another field within the same record (or has to be a certain type) is operated on within the custom program. Worse, the relationship to the data storage device (typically a database) is hardcoded within the program. Ultimately what this means is that if a program writes data to table MYTABLE, then adding, subtracting, or changing a column definition in MYTABLE requires a change to the program. Finally, most of these applications are also plain and standard Windows programs -- i.e. a human operator is required to click a button, initiate transfer, etc. Besides being difficult to automate, this style of program solution virtually eliminates cross platform use as well as eventual paradigm changes like web based transaction processing.
There is a much easier way to solve EDI issues.
A better approach to EDI should have the following attributes by definition:
The processing approach outlined in this document meets these requirements.
Alstonlabs.com has created a series of field proven applets and
techniques that can reduce your EDI development time to practically zero.
This approach relies on dual asynchronous state
machines and (possible) multiple process timers to allow full automation.
Any data file
that comes to the business can be handled with this setup; each of the
steps data undergoes is handled by applets (small applications that handle
a single function.) The thrust of this approach is twofold: First, none of
the data needs to be handled by humans and can be completely automated with
built-in error recovery / handling. Second, this method is multi-tier by
design thus allowing web-based implementation and other web-centric uses
beyond the original scope of the design. This processing method meets the
overall goal of keeping all business rules at the target database side and
all non target-specific processing generic.
In general what happens in this process is as follows:
(Not business specific)
Note that the only business rules are in the stored procedures that handle the
data and create reports (step 4); even the post-process phases aren't business
specific.
The key to the process is the use of XML: data in files is parsed generically
and assigned to tags, and these tags allow a generic loader to populate
specific temp table columns on the target. After the load, what's sitting on
the target is now a DB friendly version of the original file contents; this
allows stored procedures etc. to operate on the data. Until then the meaning
of the data has been ignored, thus the data itself is actually "handled"
solely at the target. In this manner business specific logic is separated
from the processing and handled from a central location. If you have n
file processes operating simultaneously, each works the same way, and all
business logic resides on the target.
The handling of any data is controlled by state machines. There are two
types: load-process and post-process. The load-process state machine takes
care of data file recognition through data file load. Post-process handles
any reporting or other post-processing needs. Generally the
load-process state machine is essentially a timer applet that via scripting
will start programs either with file detection. This state machine does not
communicate with the database. The
post-process state machine is similar but tied to the database via an
EDI_STATE table. This table contains a snapshot of the current state and the
state machine contains the scripting necessary to control the state. The
completion of a state is updated either automatically (completion of a program
startup) or programmatically (by a stored proc if this is executed.)
The asynchronous nature of the state machines allows the load-process to do
data handling for processes not tied to a database server, as well as handle
processes for multiple database servers simultaneously. In this manner the
post-process state control is handled by the data target such that each
implementation can be specific to the target type. In other words, it's
possible to use a single load-process state machine to send data to different
data servers without being "tied" to a specific server; each server would have
a post-process state machine to handle its specific requirements. In addition
the nature of this state machine implementation allows for cross-platform
use.
Since these state machines are also applets it is possible to easily create
new and custom state machine controllers. For instance a post process state
machine could be required in situations where there is no data server, or a
post-process state machine may be coded to run on a UNIX server. Therefore a
load-process state machine could well be loading data to SQL Server running
on NT as well as Oracle running on POSIX, and each post-process state machine
would be specific to its environment.
These state machines are not 100% implemented by the timer applets; the timer
functionality can execute a "macroscopic" state that can in turn cause other
states to execute when completed. Refer to section VI.
XML (eXtensible Markup Language) is a way of marking data
with tags similar to HTML code. It is ASCII text. Records from a data file that
are written as XML would have defining tags surrounding the bytefields that
denote a field within the record as follows:
It is rather straighforward. Although some may point out that XML is more narrowly
defined as there are standards bodies such as W3C, the definitions of XML with respect
to W3C and other "web friendly" standards define XML within the context of transmitting
data using HTTP's GET and POST methods via CGI (hence restrictions on ASCII characters
such as ampersands, etc.) Since the data discussed here isn't being transmitted via
HTTP but is being passed via disk files, we are free to use XML in the purest form
and allow passage of all legal ASCII characters.
The load-process state machine is file-centric, that is, it detects file
existence based on extension. It is scriptable in how often it wakes as well
as what to do when a file is located. The reason the load process is file-centric
is that as a data file comes in, it may need to undergo a number of
steps to prepare it for the generic loader, and once ready, it will need to
be loaded to a known table via a given copy of the generic loader. As each of
the file steps are taken, the original file is erased after the newer one is
written. This is because the "data preparation" applets are simpler I/O
programs that read data, manipulate it, and output it.
A typical process might appear as follows:
II. The Approach
(Business specific)
(Not business specific)
III. AUTOMATION: Dual Asynchronous State Machines
A WORD ABOUT XML:
record 1: <col1>xyz</col1><col2>abc</col2><col3>333</col3>etc.
record 2: <col1>xde</col1><col2>abb</col2><col3>233</col3>etc.
IV. AUTOMATION DETAIL: Load-Process State Machine
Refer to Fig 1. In this generic example:
Now while it's possible to wake up the state machine every few seconds and execute one of these steps in order of appearance, if any process takes longer than the wake interval then this could be problematic. The preferred method of scripting a file process is in Reverse Order. This helps prevent problems with timing. In other words, the FIRST thing that happens when waking up is to try the stored procedure, then that data load, then XML conversion, and so on. This ensures that the file in question actually exists, etc.
For most purposes you will want to set the wakeup interval to longer than the longest processing time. Typically the loading of data takes the longest time (network traffic, inherent inefficiency of DB technology, etc.) so you would set the wakeup interval at longer than the worst case you're likely to encounter.
Here's an example scripting for a "cleared check file" process:
1) c:\test\files;*.PCC;c:\test\fl2xml.exe
2) c:\test\incoming;*.CCF;c:\test\recsep2.exe
In line 1 the "c:\test\files" directory is searched for any file matching a *.PCC file extension. If a file is found (for argument let's say that it finds "myfile.PCC") then the program "c:\test\fl2xml.exe" is executed with the *.PCC filename as an argument. Here's what the command line would look like:
c:\test\fl2xml.exe c:\test\files\myfile.PCC
Line 2 works the same way. The order of execution is in order of appearance in the script (line 1 is done first, in other words.) Overall any *.CCF file (CCF = "cleared check file") found in the "incoming" directory is processed by the RECSEP2 program (RECSEP = RECord SEParator.) The resulting *.PCC file RECSEP2 outputs to the "files" directory is then turned into XML via FL2XML program. Since these are the only two states in the script, the states following FL2XML are "tightly coupled" (see section VI.)
Notes:
Since a state machine by definition is simply a "dumb" device (i.e. if the state is 1 then do this, if the state is 2 then do that) and not aware of what a given state is trying to accomplish, it can handle variation easier. For example, if normally the first 2 states of any process are devoted to turning a data file into XML and the file provider switches to XML format, the state machine would run as before except that state 3 (load XML to server) is now the first state; the original preceeding states are now moot.
It is not always necessary to define each state via the timer. The timer part
of the load-process state machine controls a macroscopic function; any of the
programs executed can actually chain "micro" states as needed. For instance in
the above example (cleared check process) the FL2XML "fixed-length to XML"
applet is set to call the XML2DB "xml to database" applet when it completes
and the XML2DB applet is set to call the "xmlProcessClearedCheques" stored
procedure when it completes. Each of these states is still discrete and could
be fired via the state machine timer if need be. Therefore the load-process
state machine is in practice a combination of state controls by the timer and
process applets.
V. AUTOMATION DETAIL: Post-Process State Machine
This state machine is "target sensitive" in that to operate efficiently it is tied to a Server/Database. The typical implementation uses a table (EDI_STATE) that contains merely a process ID and a state integer. The integer is a flag bitfield allowing up to 32 states, each of which is user defined. The process ID field is also user defined; e.g. process 1 might be the cleared check file process, process 2 deals with files from the Alaska Eskimo Fishing School, and so on. In most cases 8 flags is more than sufficient:
Example Bitfields: cleared check file process
0x01 -- generate HTML report
0x02 -- report disposistion (if needed)
0x04 -- undefined for now
0x08
0x10
0x20
0x40
0x80
As the timer portion wakes up, the lowest nonzero state is executed first, and then the states for that process ID are ignored until the next wakeup.
Here's part of the script for the timer that generates the HTML report. The first entry is the process ID (1 = cleared check process.) The data between the semicolons is (in order) state 1 through 32. In other words, each line in the script is a delimited list of state actions, with each state defined as the ordinal column position.
|<- STATE 1 ->|
1;c:\test\htmrpt.exe c:\test\ccf.qry c:\test\files\ccfr c:\test\emailer.exe;;;
^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^
procid STATE 1 STATE 2 - UP
In this case at state 1 HTML reporter applet HTMRPT is started and told to use
CCF.QRY as the query, then name any output file with a prefix of CCFR
("c:\test\files\ccfr.") Since HTMRPT has a counter, if it was set to 3 then
this will result in a file named CCFR4.HTM. This file is then handed to the
EMAILER applet which will then read its own INI file to determine who the
report is to be emailed to. As above with the discussion of macroscopic states
we could break these steps up to be discrete if required.
VI. Macroscopic vs. Microscopic States: Coupling Issues
As discussed this EDI process allows states to be "loosely" coupled to any following states (macroscopic) or "tightly" coupled as required. The type of coupling depends on the nature of the problem to solve and somewhat on the target.
For example, SQL Server 6.5 locks pages even for a SELECT statement, so it's usually not a good idea to have two processes that could be intensively selecting or writing to the same tables simultaneously: they could collide. As a result processes that could conceivably collide should not use tight coupling. Of course the reverse is true of processes that do not write to often-accessed tables.
In the ongoing cleared check file process example, tight coupling is used because the data in this file will be run against a seldom used table and therefore unlikely to collide with other processes.
In other cases, the decision to use tight coupling will depend on what you are trying to accomplish at that state. For example processes that write to often used tables may have a tightly coupled stored procedure that runs directly after the dataload temp table is loaded; the procedure may do nothing more than set a flag in the EDI_STATE table to indicate a successful load.
In addition, processes that aren't necessarily loading data to a database or
manipulating new data can be tightly coupled. For example emailing of reports
can be initiated by the report generator directly after the report is written
to disk. Reports that needs to be sent as a group however would probably be
best served with loose coupling.
VII. EDI_STATE Table
An EDI_STATE table in the simplest form is merely a Process ID and State integer. In the creation of the table the assumption is that states apply to entire processes, but this is not required. States can also be used with discrete objects inside the system. For example you can apply an internal ID as the ProcessID and track/control the processing of entire related objects (e.g. a credit card tracking system would typically have information in multiple tables yet still maintain an entity as "a card.")
The most useful variation would likely be the addition of "next anticipated state" and "last used state." When used properly this allows you to peek at a process and see what it is doing.
For example, if using EDI_STATE to control flow of a credit card application and in state 5 a credit check is run, this is the state configuration:
last_used = 4
current = 5
next_expected = 6
In the case of a process variation such as credit is already on file, state 5 is not required. An application usder this condition would then show up as:
last_used = 4
current = 6
next_expected = 7
This "signature" difference can be used to make it simpler to track process
flow.
VIII. Generic Process Construction Issues
Applets discussed thus far in this process are designed to be used in as many copies as required. For example there is but one ( 1 ) FL2XML program whose job is to convert fixed length data to XML. To use it, you need to make a copy of the EXE and the associated configuration files into the directory structure that will be used to process a given file type. When executed Fl2XML will look at the configuration file that details how the fixed length bytefields are to be broken up and what tags to associate with these bytefields. This configuration will change of course and is mapped to a file type. Of course the particular copy of FL2XML that is run at a given time is controlled by the startup; c:\dir1\fl2xml.exe will start the copy of FL2XML in the c:\dir1\ directory whereas c:\data21\fl2xml.exe will start FL2XML in the c:\data21\ directory. The proper place to put the configuration file for each copy is of course the directory that contains the copy, thus even if the two copies are executed simultaneously the copy in c:\dir1\ will be doing a different job than the one in c:\data21\.
Since all of the applets constructed to date are done in this same manner, process control is implemented via multiple copies and location control of the individual copies.
Note also that control of applet instantiation in this manner is not only very
network friendly but also cross-platform friendly. Applets do not have to be
located in clusters; the only hard and fast rule is that a given applet needs
to have the associated configuration file located in the same directory as the
applet. Admittedly the logistics of clustering applet copies in one single
directory devoted to processing a file type are simpler, but this is not
required for the process to work.
IX. Process Change -- Applet Construction
The processing discussed in this document is that which has already been done and is working. However please note that this process is engineered to be as flexible as possible and grow as needed. Because the processing elements are all applets, the proper course of action at any given time is to create applets as needed that do specific tasks. For example there is no existing applet specifically meant to do FTPing of data although should the need arise then creating one should be limited strictly to the functionality of FTP and not specific to what is being transferred.
As this process is broken into component parts, it is NOT required that all applets run on the same machine, against the same database, or even the same operating system.
In general, applets should follow these rules: