Team LiB
Previous Section Next Section

FAQs

Q: 

What does the pattern <((?i)TITLE>)(.*?)</\1 break down to?

The answer is given in Table 4-1 . Of particular interest is the subgroup (.*?) . Notice that this is a reluctant qualifier, thus it will only match as little as possible before seeing the next -title- element. The difference here is that given -title-first title-/title--title-second title--/title- , the pattern will only extract first title . However, without the reluctant qualifier, it would extract first  title-/title--title-second title- .  The Pattern -((?i)TITLE-)(.*?)-/(/1)     Regex   Description    * In English: Extract the contents of first occurrence of the TITLE element and be willing to match any case version of TITLE , including Title , title , and so on.  - The character - followed by ( A group consisting of (?i) A case-insensitive comparison of T The character T followed by I The character I followed by T The character T followed by L The character L followed by E The character E followed by - The character -  ) Close group ( Followed by a group consisting of . Any character * Repeated any number of times ? Matched reluctantly ) Close group, followed by - The character - followed by / The character / followed by \1 The first group, which matched (?i)TITLE-

Q: 

How do I know if my regex is too complex?

The first goal of any regex pattern is, of course, that it works accurately and efficiently enough. The second goal is that it be legible. How do you know if it's legible? My advice is comment it with as much detail as you feel it needs, and then pass it to a few developers who are likely to have to decipher it. If they follow it (or better yet, if they're able to modify it), then it's probably clear enough. If not, then you may want to consider refactoring.

Answers

A: 

The answer is given in Table 4-1. Of particular interest is the subgroup (.*?). Notice that this is a reluctant qualifier, thus it will only match as little as possible before seeing the next <title> element. The difference here is that given <title>first title</title><title>second title></title>, the pattern will only extract first title. However, without the reluctant qualifier, it would extract first title</title><title>second title>.
Table 4-1: The Pattern <((?i)TITLE>)(.*?)</(/1)

Regex

Description

<

The character < followed by

(

A group consisting of

(?i)

A case-insensitive comparison of

T

The character T followed by

I

The character I followed by

T

The character T followed by

L

The character L followed by

E

The character E followed by

>

The character >

)

Close group

(

Followed by a group consisting of

.

Any character

*

Repeated any number of times

?

Matched reluctantly

)

Close group, followed by

<

The character < followed by

/

The character / followed by

\1

The first group, which matched (?i)TITLE>

* In English: Extract the contents of first occurrence of the TITLE element and be willing to match any case version of TITLE, including Title, title, and so on.

A: 

The first goal of any regex pattern is, of course, that it works accurately and efficiently enough. The second goal is that it be legible. How do you know if it's legible? My advice is comment it with as much detail as you feel it needs, and then pass it to a few developers who are likely to have to decipher it. If they follow it (or better yet, if they're able to modify it), then it's probably clear enough. If not, then you may want to consider refactoring.


Team LiB
Previous Section Next Section