Team LiB
Previous Section Next Section

Validating an EDI Document

This next example is taken from a posting on the Sun site. A programmer needs help validating an Electronic Data Interchange (EDI) document. He needs to make sure that the String ISA always occurs before the String IEA, and that each occurs only once. He provided the sample input ISA*XX*XXXXXXXXXXXXXXX*XX*XXXXXXXXXXXXXXX*030130*0912*~IEA*1*000005900~.

This problem is a candidate for the push technique, because it's fairly clear that I'll have to push the data into a pattern. To simplify the problem, I decide to deal in the abstract a bit. Instead of the strings ISA and IEA, I decide to use the @ sign and the # sign. Furthermore, I decide that everything—all the stuff in between @ and #—is a number. These are just logical placeholders, for my own benefit. I want to be able abstract away some of the messy details.

Note�

If you happened to have liked mathematics in school, you'll notice that this is similar to the algebraic technique of factoring out messy subexpressions and referring to them using a simple variable.

Now I'll see if I can take this anywhere with the reasoning in Table 5-6.

Table 5-6: Pulling a General Regex Pattern from @45#78

Step

What I Did

Why I Did It

Justification

Resulting Pattern

Step 1

Nothing

Initial state

N/A

@45#87

Step 2

Substituted [^@] for 4

To get a more generic description

The only distinguishing feature of 4 is that it's not @, hence [^@].

@[^@]5#7

Step 3

Substituted [^@] for 5

To get a more generic description

The only distinguishing feature of 5 is that it's not @.

@[^@][^@]#7

Step 4

Swapped in [^@]* for [^@][^@]

To get a more generic description

[^@]* is a superset of [^@].

@[^@]*#7

Step 5

Swapped in ([^@][^#]) for 7

To get a more generic description

The only distinguishing feature of 7 is that it's not @ or #.

@[^@]*#([^@][^#])8

Step 6

Swapped in ([^@][^#]) for 8

To get a more generic description

The only distinguishing feature of 8 is that it's not @ or #.

@[^@]*#([^@][^#])([^@][^#])

Step 7

Swapped in ([^@][^#])* for ([^@][^#])([^@][^#])

To get a more generic description

([^@][^#])* is a superset of ([^@][^#])([^@] [^#]).

@[^@]*#([^@][^#])*

I think I've taken that about as far as I can. Now I'll start stepping away from the abstract and heading back toward what I actually wanted. Table 5-7 breaks down my reasoning.

Table 5-7: Pulling an EDI Regex out of @[^@]*#([^@][^#])*

Step

What I Did

Why I Did It

Justification

Resulting Pattern

Step 8

Nothing

Initial state

N/A

@[^@]*#([^@][^#])*

Step 9

Substitute ISA for @

To get a more specific description

@ was always just a stand-in for ISA.

ISA[^ISA]*#([^ISA][^#])*

Step 10

Substitute IEA for #

To get a more specific description

# was always just a stand-in for IEA.

ISA[^ISA]*IEA([^ISA][^IEA])*

Step 11

Added ?: inside ([^ISA][^IEA])

To improve efficiency

I don't need a capturing group.

ISA[^ISA]*IEA(?:[^ISA][^IEA])*


Team LiB
Previous Section Next Section