5.3 HTML-Related Examples

In Chapter 2, we saw an extended example that converted raw text to HTML (see Section 2.3.6), including regular expressions to pluck out email addresses and http URLs from the text. In this section, we'll do a few other HTML-related tasks.

5.3.1 Matching an HTML Tag

It's common to see <[^>]+> used to match an HTML tag. It usually works fine, such as in this snippet of Perl that strips tags:

$html =~ s/<[^>]+>//g;

However, it matches improperly if the tag has '>' within it, as with this perfectly valid HTML: <input name=dir value=">">. Although it's not common or recommended, HTML allows a raw '<' and '>' to appear within a quoted tag attribute. Our simple <[^>]+> doesn't allow for that, so, we must make it smarter.

Allowed within the '<···>' are quoted sequences, and "other stuff" characters that may appear unquoted. This includes everything except '>' and quotes. HTML allows both single- and double-quoted strings. It doesn't allow embedded quotes to be escaped, which allows us to use simple regexes "[^"]*" and '[^']*' to match them.

Putting these together with the "other stuff" regex [^'">], we get:

<("[^"]*"|'[^']*'|[^'">])*>

That may be a bit confusing, so how about the same thing shown with comments in a free-spacing mode:

     <                    #  Opening "<"

        (                 #    Any amount of . . .

           "[^"]*"        #      double-quoted string,

           |              #      or . . .

          '[^']*'         #      single-quoted string,

           |              #      or . . .

           [^'">]         #      "other stuff"

        )*                #

     >                    #  Closing ">"

The overall approach is quite elegant, as it treats each quoted part as a unit, and clearly indicates what is allowed at any point in the match. Nothing can be matched by more than one part of the regex, so there's no ambiguity, and hence no worry about unintended matches "sneaking in," as with some earlier examples.

Notice that * rather than + is used within the quotes of the first two alternatives? A quoted string may be empty (e.g., 'alt=""'), so * is used within each pair of quotes to reflect that. But don't use * or + in the third alternative, as the [^'">] is already directly subject to a quantifier via the wrapping (···)*. Adding another quantifier, yielding an effective ([^'">]+)*, could case a very rude surprise that I don't expect you to understand at this point; it's discussed in great detail in the next chapter (see Section 6.1.4).

One thought about efficiency when used with an NFA engine: since we don't use the text captured by the parentheses, we can change them to non-capturing parentheses (see Section 3.4.5.2). And since there is indeed no ambiguity among the alternatives, if it turns out that the final > can't match when it's tried, there's no benefit going back and trying the remaining alternatives. Where one of the alternatives matched before, no other alternative can match now from the same spot. So, it's okay to throw away any saved states, and doing so affords a faster failure when no match can be had. This can be done by using (?>···) atomic grouping instead of the non-capturing parentheses (or a possessive star to quantify whichever parentheses are used).

5.3.2 Matching an HTML Link

Let's say that now we want to match sets of URL and link text from a document, such as pulling the marked items from:

     ···<a href="http://www.oreilly.com">O'Reilly And Associates</a>···

Because the contents of an <A> tag can be fairly complex, I would approach this task in two parts. The first is to pluck out the "guts" of the <A> tag, along with the link text, and then pluck the URL itself from those <A> guts.

A simplistic approach to the first part is a case-insensitive, dot-matches-all application of <a\b([^>]+)>(.*?)</a>, which features the lazy star quantifier. This puts the <A> guts into $1 and the link text into $2. Of course, as earlier, instead of [^>]+ I should use what we developed in the previous section. Having said that, I'll continue with this simpler version, for the sake of keeping that part of the regex shorter and cleaner for the discussion.

Once we have the <A> guts in a string, we can inspect them with a separate regex. In them, the URL is the value for the href= value attribute. HTML allows spaces on either side of the equal sign, and the value can be quoted or not, as described in the previous section. A solution is shown as part of this Perl snippet to report on links in the variable $Html:

     # Note: the regex in the while(...) is overly simplistic—see text for discussion

     while ($Html =~ m{<a\b([^>]+)>(.*?)</a>}ig)

     {

       my $Guts = $1; # Save results from the match above, to their own . . .

       my $Link = $2; # . . . named variables, for clarity below.



       if ($Guts =~ m{

                      \b HREF           #  "href" attribute

                      \s* = \s*         #  "=" may have whitespace on either side

                      (?:               #  Value is···

                        "([^"]*)"       #    double-quoted string,

                        |               #    or···

                        '([^']*)'       #    single-quoted string,

                        |               #    or···

                        ([^'">\s]+)     #    "other stuff"

                      )                 #

                     }xi)

       {

         my $Url = $+; # Gives the highest-numbered actually-filled $1, $2, etc.

         print "$Url with link text: $Link\n";

       }

     }

Some notes about this:

This time, I added parentheses to each value-matching alternative, to capture the exact value matched.
Because I'm using some of the parentheses to capture, I've used non-capturing parentheses where I don't need to capture, both for clarity and efficiency.
This time, the "other stuff" component excludes whitespace in addition to quotes and '>', as whitespace separates "attribute=value" pairs.
This time, I do use + in the "other stuff" alternative, as it's needed to capture the whole href value. Does this cause the same "rude surprise" as if we used + in the "other stuff" alternative in Section 5.3? No, because there's no outer quantifier that directly influences the class being repeated. Again, this is cover ed in detail in the next chapter.

Depending on the text, the actual URL may end up in $1, $2, or $3. The others will be empty or undefined. Perl happens to support a special variable $+ which is the value of the highest-numbered $1, $2, etc. that actually captured text. In this case, that's exactly what we want as our URL.

Using $+ is convenient in Perl, but other languages offer other ways to isolate the captured URL. Normal programming constructs can always be used to inspect the captured groups, using the one that has a value. If supported, named capturing (see Section 3.4.5.3) is perfect for this, as shown in the VB.NET example in Section 5.3.4. (It's good that .NET offers named capture, because its $+ is broken see Section 9.3.2.1.)

5.3.3 Examining an HTTP URL

Now that we've got a URL, let's see if it's an http URL, and if so, pluck it apart into its hostname and path components. Since we know we have something intended to be a URL, our task is made much simpler than if we had to identify a URL from among random text. That much more difficult task is investigated a bit later in this chapter.

So, given a URL, we merely need to be able to recognize the parts. The hostname is everything after ^http:// but before the next slash (if there is another slash), and the path is everything else: ^http://([^/]+)(/.*)?$

Actually, a URL may have an optional port number between the hostname and the path, with a leading colon: ^http://([^/:]+(:(\d+))?)(/.*)?$

Here's a Perl snippet to report about a URL:

     if ($url =~ m{^http://([^/:]+(:(\d+))?)(/.*)?$}i)

     {

       my $host = $1;

       my $port = $3 || 80;  # Use $3 if it exists; otherwise default to 80.

       my $path = $4 || "/"; # Use $4 if it exists; otherwise default to "/".

       print "host: $host\n";

       print "port: $port\n";

       print "path: $path\n";

     } else {

       print "not an http url\n";

     }

5.3.4 Validating a Hostname

In the previous example, we used [^/:]+ to match a hostname. Yet, in Chapter 2 (see Section 2.3.6.7), we used the more complex [-a-z]+(\.[-a-z]+)*\.(com|edu|···|info). Why the difference in complexity for finding ostensibly the same thing?

Well, even though both are used to "match a hostname," they're used quite differently. It's one thing to pluck out something from a known quantity (e.g., from something you know to be a URL), but it's quite another to accurately and unambiguously pluck out that same type of something from among random text. Specifically, in the previous example, we made the assumption that what comes after the 'http://' is a hostname, so the use of [^/:]+ merely to fetch it is reasonable. But in the Chapter 2 example, we use a regex to find a hostname in random text, so it must be much more specific.

Now, for a third angle on matching a hostname, we can consider validating hostnames with regular expressions. In this case, we want to check whether a string is a well-formed, syntactically correct hostname. Officially, a hostname is made up of dot-separated parts, where each part can have ASCII letters, digits, and hyphens, but a part can't begin or end with a hyphen. Thus, one part can be matched with a case-insensitive application of [a-z0-9]|[a-z0-9][-a-z0-9]*[-a-z0-9]. The final suffix part ('com', 'edu', 'uk', etc.) has a limited set of possibilities, mentioned in passing in the Chapter 2 example. Using that here, we're left with the following regex to match a syntactically valid hostname:

Link Checker in VB.NET

This Program reports on links within the HTML in the variable Html :

Imports System.Text.RegularExpressions · · · ' Set up the regular expressions we'll use in the loop Dim A_Regex as Regex = New Regex( _ "<a\b(?<guts>[^>]+)>(?<Link>.*?)</a>", _ RegexOptions.IgnoreCase) Dim GutsRegex as Regex = New Regex( _ "\b HREF (?# 'href' attribute )" & _ "\s* = \s* (?# '=' with optional whitespace )" & _ "(?: (?# Value is ... )" & _ " ""(?<url>[^""]*)"" (?# double-quoted string, )" & _ " | (?# or ... )" & _ " '(?<url>[^']*)' (?# single-quoted string, )" & _ " | (?# or ... )" & _ " (?<url>[^'"">\s]+) (?# 'other stuff' )" & _ ") (?# )", _ RegexOptions.IgnoreCase OR RegexOptions.IgnorePatternWhitespace) ' Now check the 'Html' Variable . . . Dim CheckA as Match = A_Regex.Match(Html) ' For each match within . . . While CheckA.Success ' We matched an <a> tag, so now check for the URL. Dim UrlCheck as Match = _ GutsRegex.Match(CheckA.Groups("guts").Value) If UrlCheck.Success ' We've got a match, so have a URL/link pair Console.WriteLine("Url " & UrlCheck.Groups("url").Value & _ " WITH LINK " & CheckA.Groups("Link").Value) End If CheckA = CheckA.NextMatch End While _____________________________________________________________________

A few things to notice:

VB.NET programs using regular expressions require that first Imports line to tell the compiler what object libraries to use.
I've used (?#···) style comments because it's inconvenient to get a newline into a VB.NET string, and normal '#' comments carry on until the next newline or the end of the string (which means that the first one would make the entire rest of the regex a comment). To use normal #··· comments, add &chr(10) at the end of each line (see Section 9.3.1.2).
Each double quote in the regex requires '""' in the literal string (see Section 3.3.1.1).
Named capturing is used in both expressions, allowing the more descriptive Groups("url") instead of Groups(1), Groups(2), etc.

     ^

       (?:i)  # apply this regex in a case-insensitive manner.

       # One or more dot-separated parts···

       (?: [a-z0-9]\. | [a-z0-9][-a-z0-9]*[-a-z0-9]\. )+

       # Followed by the final suffix part···

       (?: com|edu|gov|int|mil|net|org|biz|info|name|museum|coop|aero|[a-z][a-z] )

     $

Something matching this regex isn't necessarily valid quite yet, as there's a length limitation: individual parts may be no longer than 63 characters. That means that the [-a-z0-9]* in there should be [-a-z0-9]{0,61}.

There's one final change, just to be official. Officially, a name consisting of only one of the suffixes (e.g., 'com', 'edu', etc.) is also syntactically valid. Current practice seems to be that these "names" don't actually have a computer answer to them, but that doesn't always seem to be the case for the two-letter country suf- fixes. For example, Anguilla's top-level domain 'ai' has a web server: http://ai/ shows a page. A few others like this that I've seen include cc, co, dk, mm, ph, tj, tv, and tw.

So, if you wish to allow for these special cases, change the central (?:···)+ to (?:···)*. These changes leave us with:

     ^

       (?:i)  # apply this regex in a case-insensitive manner.

       # One or more dot-separated parts···

       (?: [a-z0-9]\. | [a-z0-9][-a-z0-9]{0,61}[-a-z0-9]\. )*

       # Followed by the final suffix part···

       (?: com|edu|gov|int|mil|net|org|biz|info|name|museum|coop|aero|[a-z][a-z] )

     $

This now works just dandy to validate a string containing a hostname. Since this is the most specific of the three hostname-related regexes we've developed, you might think that if you remove the anchors, it could be better than the regex we came up with earlier for plucking out hostnames from random text. That's not the case. This regex matches any two-letter word, which is why the less-specific regex from Chapter 2 is better in practice. But, it still might not be good enough for some purposes, as the next section shows.

5.3.5 Plucking Out a URL in the Real World

Working for Yahoo! Finance, I write programs that process incoming financial news and data feeds. News articles are usually provided to us in raw text, and my programs convert them to HTML for a pleasing presentation. (Read financial news at http://finance.yahoo.com and see how I've done.) It's often a daunting task due to the random "formatting" (or lack thereof) of the data we receive, and because it's much more difficult to recognize things like hostnames and URLs in raw text than it is to validate them once you've got them. The previous section alluded to this; in this section, I'll show you code we actually use at Yahoo! to solve the issues we've faced.

We look for several types of URLs to pluck from the text — mailto, http, https, and ftp URLs. If we find 'http://' in the text, we're pretty certain that's the start of a URL, so we can use something simple like http://[-\w]+(\.\w[-\w]*)+ to match up through the hostname part. We're using the knowledge of the text (raw English text provided as ASCII) to realize that it's probably okay to use -\w instead of [-a-z0-9]. \w also matches an underscore, and in some systems also matches the whole of Unicode letters, but we know that neither of these really matter to us in this particular situation.

However, often, a URL is given without the http:// or mailto: prefix, such as:

     ···visit us at www.oreilly.com or mail to orders@oreilly.com.

In this case, we need to be much more careful. What we use is quite similar to the regex from the previous section, but it differs in a few ways:

     (?i: [a-z0-9] (?:[-a-z0-9]*[a-z0-9])? \. )+ # sub domains

     # Now ending .com, etc. For these, we require lowercase

     (?-i: com\b

         | edu\b

         | biz\b

         | org\b

         | gov\b

         | in(?:t|fo)\b  # .int or .info

         | mil\b

         | net\b

         | name\b

         | museum\b

         | coop\b

         | aero\b

         | [a-z][a-z]\b  # two-letter country codes

     )

In this regex, (?i:···) and (?-i:···) are used to explicitly enable and disable caseinsensitivity for specific parts of the regex (see Section 3.4.4.2). We want to match a URL like 'www.OReilly.com', but not a stock symbol like 'NT.TO' (the stock symbol for Nortel Networks on the Toronto Stock Exchange—remember, we process financial news and data, which has a lot of stock symbols). Officially, the ending part of a URL (e.g., '.com') may be upper case, but we simply won't recognize those. That's the balance we've struck among matching what we want (pretty much every URL we're likely to see), not matching what we don't want (stock symbols), and simplicity. I suppose we could move the (?-i:···) to wrap only the country codes part, but in practice, we just don't get uppercased URLs, so we've left this as it is.

Here's a framework for finding URLs in raw text, into which we can insert the subexpression to match a hostname:

     \b

     # Match the leading part (proto://hostname, or just hostname)

     (

         # ftp://, http://, or https:// leading part

         (ftp|https?)://[-\w]+(\.\w[-\w]*)+

       |

         # or, try to find a hostname with our more specific sub-expression

         full-hostname-regex

     )

     

     # Allow an optional port number

     ( : \d+ )?

     

     # The rest of the URL is optional, and begins with / . . .

     (

         / path-part

     )?

I haven't talked yet about the path part of the regex, which comes after the hostname (e.g., the underlined part of http://www.oreilly.com/catalog/regex/). The path part turns out to be the most difficult text to match properly, as it requires some guessing to do a good job. As discussed in Chapter 2, what often comes after a URL in the text is also allowed as part of a URL. For example, with

     Read his comments at http://www.oreilly.com/ask_tim/index.html. He ...

we can look and realize that the period after 'index.html' is English punctuation and should not be considered part of the URL, yet the period within 'index.html' is part of the URL.

Although it's easy for us humans to differentiate between the two, it's quite difficult for a program, so we've got to come up with some heuristics that get the job done as best we can. The approach taken with the Chapter 2 example is to use negative lookbehind to ensure that a URL can't end with sentence-ending punctuation characters. What we've been using at Yahoo! Finance was originally written before negative lookbehind was available, and so is more complex than the Chapter 2 approach, but in the end it has the same effect. It's shown in the listing below. The approach taken for the path part is different in a number of respects, and the comparison with the Chapter 2 example in Section 2.3.6.6 should be interesting. In particular, the Java version of this regex in the sidebar in Section 5.4.1 provides some insight as to how it was built.

In practice, I doubt I'd actually write out a full monster like this, but instead I'd build up a "library" of regular expressions and use them as needed. A simple example of this is shown with the use of $HostnameRegex in Section 2.3.6.7, and also in the sidebar in Section 5.4.1.

Regex to pluck a URL from financial news

     \b

     # Match the leading part (proto://hostname, or just hostname)

     (

          # ftp://, http://, or https:// leading part

          (ftp|https?)://[-\w]+(\.\w[-\w]*)+

       |

          # or, try to find a hostname with our more specific sub-expression

          (?i: [a-z0-9] (?:[-a-z0-9]*[a-z0-9])? \. )+ # sub domains

          # Now ending .com, etc. For these, require lowercase

         (?-i: com\b

             | edu\b

             | biz\b

             | gov\b

             | in(?:t|fo)\b # .int or .info

             | mil\b

             | net\b

             | org\b

             | [a-z][a-z]\b # two-letter country codes

         )

     )

     

     # Allow an optional port number

     ( : \d+ )?

     

     # The rest of the URL is optional, and begins with / . . .

     (

     

           /

          # The rest are heuristics for what seems to work well

          [^;"'<>()\[\]{}\s\x7F-\xFF]*

          (?:

              [.,?]+ [^;"'<>()\[\]{}\s\x7F-\xFF]+

           )*

     )?

< Free Open Study >