Advanced Regular Expression Tips

Get to know named groups and backreferences.

asp:FeatureCompanion

LANGUAGES: VB .NET

TECHNOLOGIES: Regular Expressions | Named Groups | Backreferences

 

Advanced Regular Expression Tips

Get to know named groups and backreferences.

 

By Steven Smith

 

In the September 2002 issue of asp.netPRO, I covered the essentials of using regular expressions in ASP.NET. (Click here to read the article, "Create No-Sweat Regular Expressions.") Regular expressions actually have much more functionality than I can cover in a single article - in fact, several books have been written on the subject. Two of the more advanced features that bear mentioning here are named groups and backreferences.

 

Any time parentheses are used in a regular expression, the contents denote a group assigned an integer value. A group is simply a standalone portion of a regular expression. The entire regular expression is always the first group, numbered zero (0). The .NET Framework classes support named groups using the syntax (?<name>...). When named groups and unnamed groups both appear in the same expression, the groups' numbering can get tricky (see the .NET SDK Framework documentation for more details on this). Using groups, you can extract a particular substring out of a matching expression. The Match object exposes a GroupCollection property you can use to traverse all groups in an expression match, either by name or by integer index.

 

As a simple example, Figure 1 demonstrates how to extract the area codes from a listing of telephone numbers. In this case, the list is a hard-coded string, but you can see how this easily could be extended to apply to a text file or the contents of a Web page. The expression names the area code as group number one (1) with this piece of syntax: (?<1>\(\d{3}\)). Everything between (?<1> and the last parenthesis, ), is considered to be group one (1). Also note the use of group zero (0) in this example - remember, this matches the entire expression. In this case, each entire phone number at the start of a line should be matched by group zero.

 

<%@ Page language="vb" %>

<%@ Import Namespace="System.Text.RegularExpressions" %>

<script runat="server">

Protected Sub Page_Load()

  'Populate our phone number data

  Dim strPhoneNumbers As String

  strPhoneNumbers = "Customer Phone Numbers:" & _

   System.Environment.NewLine

  strPhoneNumbers += "(555) 123-4567" & _

   System.Environment.NewLine

  strPhoneNumbers += "(555)345-2345"

  Response.Write("Input Data:<br /><pre>")

  Response.Write(strPhoneNumbers & "</pre><hr>")

  Response.Write("Output Data:<br />")

  'Parse the data and output results

  ListAreaCodes(strPhoneNumbers)

End Sub

 

'Extracts and Response.Writes all area codes,

'their index within the input string,

'and the phone number they are extracted from

'for a given input string.

Sub ListAreaCodes(inputString As String)

  Dim r As Regex

  Dim m As Match

 

  'Match phone numbers and make area code group #1.

  'Use Compiled option so expression is not re-parsed

  'inside loop below.

  r = New Regex("^(?<1>\(\d{3}\))\s*\d{3}-\d{4}", _

   RegexOptions.Compiled Or RegexOptions.Multiline)

 

  m = r.Match(inputString)

  'Loop through each match and write it to the screen

  While m.Success

    Response.Write("Found area code " & _

      m.Groups(1).Value _

     & " at " & m.Groups(1).Index.ToString() & _

     " for phone number " & _

     m.Groups(0).Value & "<br />")

    m = m.NextMatch()

  End While

End Sub

</script>

Figure 1. This is an example of using groups to extract a substring from a match. The area codes are extracted from a list of phone numbers into a MatchCollection and output to the screen.

 

Backreferences provide a way for an expression to refer back to an already matched item within the expression. For instance, to match repeating word characters, you would use an expression such as (\w)\1. This would match the "bb" in "bubble," for example. Another powerful use of backreferences combined with named groups is in re-arranging text within a match using the Regex.Replace method, as shown here:

 

Function MDYToDMY(input As String) As String

  Return Regex.Replace(input, _

   "\b(?<month>\d{1,2})/(?<day>\d{1,2})/" & _

   "(?<year>\d{2,4})\b", "${day}-${month}-${year}")

End Function

 

In this function, the Regex.Replace method converts text from U.S. to European date formats (for example, MM/DD/YYYY to DD-MM-YYYY). This is accomplished easily using groups (in this case, named groups instead of numbered groups) and referencing the groups in the replace method call.

 

Steven A. Smith is president of ASPAlliance.com and head trainer at ASPSmith.com, which provides .NET training. He is a co-author of ASP.NET By Example (Que) and a speaker at several .NET conferences each year. E-mail him at mailto:[email protected].

 

 

 

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish