C# Regular Expressions - the basics

C#

The is a brief overview of the main classes and methods in the System.Text.RegularExpressions namespace. It doesn't cover Regex patterns in any real depth at all, but gives an introduction to the power of regular expressions using C#.

The C# Regex Class - System.Text.RegularExpressions

The Regex class contains the regular expression pattern and has a number of methods. The most commonly used of these are:

  • IsMatch(string) - Returns True or False to indicate whether the the pattern is matched in the string passed as an argument
  • Match(string) - Returns 0 or 1 Match object, depending on whether the string contains a match .
  • Matches(string) - Returns a MatchCollection object containing zero or more Match objects, which contain all matches or none in the string that is passed as an argument
  • Replace(pattern, string) - Replaces all instances of the regular expression pattern with the string
  • Split(string, pattern) - Takes the pattern as a delimiter and returns an array of strings.

Most of these methods can be statically overloaded, so a new Regex object doesn't necessarily need to be created each time.

Other classes in System.Text.RegularExpressions

In addition to Regex, Match and MatchCollection there are 6 other classes:

  • Capture - Represents the text captured by a single set of parentheses (....) surrounding a subexpression
  • CaptureCollection - Represents a collection of Capture objects
  • Group - Represents the result of a single capturing group of paired parentheses
  • GroupCollection - Represents a collection of Group objects
  • RegexCompilationInfo - Provides information for the compiler to use to compile the regular expression to an assembly.
  • RegexOptions - Specifies which if the available options are or are not set.

RegexOptions

These are the most common options

  • None - specifies that no options are set
  • IgnoreCase - matching is case-insensitive (default is case-sensitive)
  • Multiline - Treats each line as a separate line for matching purposes. Consequently ^ matches the beginning of each line position, and $ matches the end of each line position
  • Compiled - specifies whther the pattern is compiled to an assembly. Slower start-up, but faster for repeated use.
  • IgnorePatternWhiteSpace - ignores unescaped whitespace with reference to the pattern, and allows comments preceded by #
  • ExplicitCapture - Changes the capturing behaviour of parentheses
  • SingleLine - Forces the period character to match every character. Default behaviour is that it does not match \n.

Regular Expression Patterns

Patterns are the key to successful use of regular expressions. Depending on the complexity of the problem that regex is used to solve, these can vary from a straightforward human readable string pattern, to a complex combination of string characters, metacharacters, grouping and capturing characters. A cheat sheet of the characters used in C# regular expressions can be found here. The real skill is in creating a pattern that does the job, so a thorough understanding of the usage of metacharacters and in particular, the quantifiers is essential to efficient use of regular expressions. There is no real shortcut to learning this aspect of the discipline, although I found that unpicking regular expression patterns that I find employed successfully elsewhere a good start. One place where a lot of these can be found is regexlib.com, a repository for patterns donated by people bitten by the Regex bug.

I have also found that employing an iterative testing methodology is a useful tool. Some people call this good old-fashioned trial and error. It amounts to the same thing. Basically this revolves around building a complex pattern in small steps - testing each step against some sample data. That way, I can see the effect each change to the pattern has, and adjust accordingly in small steps, rather than starting over each time the test doesn't give me the desired result.

IsMatch() and Match() methods

The following sample employs the pattern [a-z]\d, which looks for a letter between in the range a-z, and a digit. If there is a match (tested with IsMatch) then the match which has been captured in a Match object, mc has its value written to the browser. The result of IsMatch (True or False) is also written regardless.

string input = "A12";

//Instantiating Regex object
Regex re = new Regex(@"[a-z]\d", RegexOptions.IgnoreCase);
Match m = re.Match(input);
if (re.IsMatch(input))
{
  Response.Write(m.Value + "<br />");
}

Response.Write(re.IsMatch(input) + "<br />");

//Static method - no regex object instantiated
Match m2 = Regex.Match(input, @"[a-z]\d");
if (Regex.IsMatch(input, @"[a-z]\d"))
{
  Response.Write(m2.Value + "<br />");
}

Response.Write(Regex.IsMatch(input, @"[a-z]\d") + "<br />");

Both approaches are shown here - instantiating an object and using the static method. The first attempt is successful, and A1 is written, along with True. The second attempt results in False being written, because the IgnoreCase option was omitted from the Regex object. The input string is in uppercase, while the pattern is in lowercase.

Matches()

The Matches method returns a MatchCollection containing zero or more Match objects. Again, both approaches are shown (including the overloaded static method), but this time, the result is the same for both.

string input = "A12 B34 C56 D78";

//Instantiating Regex Object
Regex re = new Regex(@"[a-z]\d", RegexOptions.IgnoreCase);
MatchCollection mc = re.Matches(input);
foreach (Match mt in mc)
{
  Response.Write(mt.ToString() + "<br />");
}
Response.Write(mc.Count.ToString() + "<br />");


//Static overload - no Regex Object instantiated
MatchCollection mc2 = Regex.Matches(input, @"[a-z]\d", RegexOptions.IgnoreCase);
foreach (Match mt in mc)
{
  Response.Write(mt.ToString() + "<br />");
}
Response.Write(mc2.Count.ToString() + "<br />");

Both samples use the same pattern as the Match and IsMatch examples, and will collect each succesful match in a Match object, which is held in the MatchesCollection. The collection is iterated over in a foreach loop, with the contents of each Match written to the browser. The MatchCollection.Count property is also referenced, and its value output to the browser. In both cases, the output is:

A1
B3
C5
D7
4

Replace()

Keeping with the same pattern which matches a letter then a number, Replac()e is used to change all matched values with the replacement string, which in the example below is "XX". Again, the overloaded static method is shown.

string input = @"A12 B34 C56 D78 E12 F34 G56 H78";

//Instantiating Regex Object
Regex re = new Regex(@"[a-z]\d", RegexOptions.IgnoreCase);
string newString = re.Replace(input, "XX");
Response.Write(newString);

//Static overload - no Regex Object instantiated
string newString2 = Regex.Replace(input, @"[A-Z]\d", "XX");
Response.Write(newString2);

Both will result in a changed string showing XX2 XX4 XX6 XX8 XX2 XX4 XX6 XX8

Split()

Split() will break a string apart at the positions of the delimiter which is specified as the pattern. In the case below, the pattern is \s, which is a single space character. This could just as easily have been done using the String.Split() method. The result is the same with both, an array of elements each containing a character sequence that was separated by the delimiter. By default, the delimiter is dispensed with.

string input = "AB1 DE2 FG3 HI4 JK5 LM6 NO7 PQ8 RS9"

//Instantiating Regex Object
Regex re = new Regex(@"\s");
string[] parts = re.Split(input);
foreach (string part in parts)
{
	Response.Write(part + "<br />");
}

//Static overload - no Regex Object instantiated
string[] parts2 = Regex.Split(input, @"\s");
foreach (string part2 in parts2)
{
	Response.Write(part2 + "<br />");
}

A foreach loop is used to iterate over the elements of the array and will result in the following output
AB1
DE2
FG3
HI4
JK5
LM6
NO7
PQ8
RS9