<?xml version="1.0"?>
<?xml-stylesheet type="text/xml" href="presenter.xsl"?>
<slideshow>
	<title>Regular Expressions</title>
	<slide>
		<title>Regular Expressions</title>
		<body>
			<p>By Karl Voelker</p>
		</body>
	</slide>
	<slide>
		<title>A Crucial Note</title>
		<body>
			<ul>
				<li>Many variations on regular expression syntax exist.</li>
				<li>Most of the syntax you see here is commonly supported, 
					but I make no guarantees.</li>
			</ul>
		</body>
	</slide>
	<slide>
		<title>What is a Regular Expression?</title>
		<body>
			<p>A pattern that:</p>
			<ul>
				<li>Matches strings</li>
				<li>Extracts parts of strings</li>
				<li>Alters strings</li>
			</ul>
			<p>A concept that:</p>
			<ul>
				<li>Is old, within Computer Science</li>
				<li>Completely changes how you deal with - 
					<em>and think about</em> - text</li>
				<li>Is often abbreviated to "regex" or "regexp"</li>
			</ul>
		</body>
	</slide>
	<slide>
		<title>Regular Expression Basics</title>
		<body>
			<p>Each character in a regex either:</p>
			<ul>
				<li>Matches <em>something</em>, usually zero or more 
					characters</li>
				<li>Modifies the way in which the previous character works</li>
			</ul>
		</body>
	</slide>
	<slide>
		<title>A Simple Match</title>
		<body>
			<p>Regex: <code>A</code></p>
			<ul>
				<li>This matches the first letter "A" found in the input</li>
				<li>We only find out one fact: whether or not the input 
					matched the regex</li>
				<li>Later, we learn why simply <em>matching</em> is useful</li>
			</ul>
		</body>
	</slide>
	<slide>
		<title>The Star</title>
		<body>
			<p>Regex: <code>B*</code></p>
			<ul>
				<li>This matches zero or more consecutive "B"s</li>
				<li>Do these example inputs match? If so, how many 
					characters do they match?
					<ul>
						<li>"B"</li>
						<li>"BBBBBBB"</li>
						<li>"ABBBCDE"</li>
						<li>"ABCB"</li>
						<li>"Q"</li>
						<li>""</li>
					</ul>
				</li>
			</ul>
		</body>
	</slide>
	<slide>
		<title>More Quantifiers</title>
		<body>
			<p>These are applied like the star.</p>
			<table>
				<tr><td><tt>+</tt></td>
					<td>One or more</td></tr>
				<tr><td><tt>?</tt></td>
					<td>Zero or one</td></tr>
				<tr><td><tt>{x}</tt></td>
					<td>X</td></tr>
				<tr><td><tt>{x,}</tt></td>
					<td>X or more</td></tr>
				<tr><td><tt>{,x}</tt></td>
					<td>Zero through X</td></tr>
				<tr><td><tt>{x,y}</tt></td>
					<td>X through Y</td></tr>
			</table>
		</body>
	</slide>
	<slide>
		<title>Character Classes</title>
		<body>
			<p>These expressions each match any of a set of characters:</p>
			<table>
				<tr><td><tt>.</tt></td><td>Anything but newlines</td></tr>
				<tr><td><tt>\d</tt></td><td>Digit</td></tr>
				<tr><td><tt>\s</tt></td>
					<td>Whitespace (<tt>\n</tt>, <tt>\r</tt>, <tt>\t</tt>, 
					space)</td></tr>
				<tr><td><tt>\w</tt></td>
					<td>Word (alphabetics, digits, underscore)</td></tr>
				<tr><td><tt>\D</tt>, <tt>\S</tt>, <tt>\W</tt></td>
					<td>Opposite of <tt>\d</tt>, <tt>\s</tt>, 
					and <tt>\w</tt>, respectively</td></tr>
			</table>
			<p>Note that when applied to Unicode input, these will act in an 
				internationalized fashion.</p>
		</body>
	</slide>
	<slide>
		<title>More Character Classes</title>
		<body>
			<p>Some of the most powerful characters classes are the ones 
				you create for yourself. Surround the characters to match 
				with square braces:</p>
			<table>
				<tr><td><tt>[abcd]</tt></td>
					<td>Match "a", "b", "c", or "d"</td></tr>
				<tr><td><tt>[\d\s]</tt></td>
					<td>Match any digit or whitespace</td></tr>
			</table>
			<p>You can also use character ranges, or invert the entire 
				character class:</p>
			<table>
				<tr><td><tt>[A-Zaeiou]</tt></td>
					<td>Match any English vowel or uppercase letter</td></tr>
				<tr><td><tt>[^_]</tt></td>
					<td>Match anything but an underscore</td></tr>
				<tr><td><tt>[^0-9A-Fa-f]</tt></td>
					<td>Match anything but valid hexadecimal digits</td></tr>
			</table>
		</body>
	</slide>
	<slide>
		<title>Pop Quiz</title>
		<body>
			<p>Have you been paying attention? What will these match?</p>
			<ul>
				<li><code>\w+ \w+</code></li>
				<li><code>a.c.e*</code></li>
				<li><code>dir?e</code></li>
				<li><code>\d{1,3},\d{3},\d{3}</code></li>
				<li><code>a?\s+a?</code></li>
				<li><code>[A-Z][a-z]+</code></li>
			</ul>
		</body>
	</slide>
	<slide>
		<title>Boundary Matches</title>
		<body>
			<p>These special matches don't match characters:</p>
			<table>
				<tr><td><tt>^</tt></td>
					<td>Match the beginning of the input</td></tr>
				<tr><td><tt>$</tt></td>
					<td>Match the end of the input</td></tr>
				<tr><td><tt>\b</tt></td>
					<td>Match a "word boundary" - between a word character 
						and a non-word character</td></tr>
				<tr><td><tt>\B</tt></td>
					<td>Match a "non-word boundary" - between two word or 
						two non-word characters</td></tr>
			</table>
		</body>
	</slide>
	<slide>
		<title>Options</title>
		<body>
			<p>These options are hacks that alter the regular expression 
				language:</p>
			<table>
				<tr><th>Option</th><th>Description</th></tr>
				<tr><td><tt>m</tt></td>
					<td><tt>^</tt> and <tt>$</tt> match the beginning and 
						end of any line, not the entire string</td></tr>
				<tr><td><tt>s</tt></td>
					<td><tt>.</tt> does not exclude newlines</td></tr>
			</table>
		</body>
	</slide>
	<slide>
		<title>Extracting Sub-Patterns</title>
		<body>
			<ul>
				<li>The part of the input that matched a sub-pattern can be 
					extracted</li>
				<li>Usually, sub-patterns come out as an array</li>
			</ul>
			<p>Example:</p>
			<ul>
				<li>Regex: 
					<tt>href="([^"])"&gt;([^&gt;]+)&lt;</tt>
					</li>
				<li><tt>Input: &lt;a href="http://karlv.net/"&gt;Click 
					Here&lt;/a&gt;</tt></li>
				<li>Sub-pattern matches: "<tt>http://karlv.net/</tt>", 
					"<tt>Click Here</tt>"</li>
			</ul>
		</body>
	</slide>
	<slide>
		<title>Escaping Metacharacters</title>
		<body>
			<ul>
				<li>Metacharacters are those that do something special</li>
				<li>Escape them with <tt>\</tt></li>
				<li>To escape many characters, surround with <tt>\Q</tt> 
					and <tt>\E</tt></li>
			</ul>
			<p>For example:</p>
			<ul>
				<li><tt>\.</tt> matches a period</li>
				<li><tt>\+</tt> matches a plus sign</li>
				<li><tt>\Q129.21.60.1\E</tt> matches 
					<tt>"129.21.60.1"</tt></li>
			</ul>
		</body>
	</slide>
	<slide>
		<title>Greed</title>
		<body>
			<ul>
				<li>By default, all matching is <em>greedy</em>: the match 
				which covers the most characters is chosen.</li>
				<li>Put <tt>?</tt> after the quantifiers to make a match 
					non-greedy.</li>
			</ul>
			<p>For example: suppose you want to extract the first paragraph 
				in an HTML document.</p>
			<ul>
				<li><tt>&lt;p&gt;(.*)&lt;/p&gt;</tt> won't work if there 
					are multiple paragraphs</li>
				<li><tt>&lt;p&gt;(.*?)&lt;/p&gt;</tt> will always work</li>
			</ul>
			<p>The greedy match covers all the characters from the 
				opening tag of the first paragraph to the closing tag of 
				the last paragraph.</p>
		</body>
	</slide>
	<slide>
		<title>Alternation and Grouping</title>
		<body>
			<p>Use the alternation operator to separate two patterns, 
				either of which may match:</p>
			<dl>
				<dt><tt>^(a+b+c+|d{4,6})$</tt></dt>
				<dd>Matches "abc" and "dddd" but not "abcdddd"</dd>
				<dt><tt>^(cat|dog)$</tt></dt>
				<dd>Matches "cat" or "dog"</dd>
			</dl>
			<p>Use parentheses to surround sub-patterns. This also affects 
				the reach of the alternation operator:</p>
			<dl>
				<dt><tt>^(a|b)(c|d)$</tt></dt>
				<dd>Matches "ac", "ad", "bc", or "bd"</dd>
				<dt><tt>^(cat|dog)s$</tt></dt>
				<dd>Matches "cats" or "dogs"</dd>
			</dl>
		</body>
	</slide>
	<slide>
		<title>Validation</title>
		<body>
			<p>Whenever your program accepts input:</p>
			<ul>
				<li>You should know what inputs are valid.</li>
				<li>A regex can easily check inputs for validity.</li>
				<li>Your validation regex must:
					<ol>
						<li>Match any valid input</li>
						<li>Not match any invalid input</li>
					</ol>
				</li>
				<li>Sometimes, it is easier to match all invalid inputs 
					instead</li>
			</ul>
			<p><em>Do not risk the security of your software 
				on my regular expressions! I take no responsibility.</em></p>
		</body>
	</slide>
	<slide>
		<title>Validation Example: Integers</title>
		<body>
			<table>
				<tr><th>Valid inputs</th><th>Regex</th></tr>
				<tr><td>Non-negative integers</td>
					<td><tt>^[0-9]+$</tt></td></tr>
				<tr><td>Positive integers</td>
					<td><tt>^[1-9]+[0-9]*$</tt></td></tr>
				<tr><td>Integers</td>
					<td><tt>^-?[0-9]+$</tt></td></tr>
				<tr><td>Integers (better)</td>
					<td><tt>^(-?[1-9]+[0-9]*|0)$</tt></td></tr>
			</table>
			<ul>
			<li>Why is the last example better?</li>
				<!-- it prevents leading zeroes -->
			<li>What do these examples have in common?</li>
				<!-- the answer: ^ ... $ -->
			</ul>
		</body>
	</slide>
	<slide>
		<title>Substitution</title>
		<body>
			<ul>
				<li>Substitution replaces the matching part of the input</li>
				<li>When giving the replacement, you can refer to 
					sub-pattern matches with <tt>\n</tt></li>
			</ul>
			<p>For example:</p>
			<ul>
				<li>Regex: <tt>\b(cat|dog|bird|lunatic)\b</tt></li>
				<li>Replacement: <tt>\1s</tt></li>
				<li><tt>"cat"</tt> becomes <tt>"cats"</tt></li>
				<li><tt>"lunatic"</tt> becomes <tt>"lunatics"</tt></li>
			</ul>
			<p>Another example:</p>
			<ul>
				<li>Regex: <tt>\b(\w+)\s+(\w+)\b</tt></li>
				<li>Replacement: <tt>\2, \1</tt></li>
				<li><tt>"Karl Voelker"</tt> becomes 
					<tt>"Voelker, Karl"</tt></li>
			</ul>
		</body>
	</slide>
	<slide>
		<title>In Perl</title>
		<body>
			<code>
my $input = "some data from somewhere";
my @extracts = $input =~ /regex (with) (parens)/;
			</code>
			<p>How to learn more:</p>
			<ul>
				<li>Run <tt>perldoc perlreref</tt></li>
				<li>Read <a 
				href='http://search.cpan.org/~rgarcia/perl/pod/perlreref.pod'>
				this page</a></li>
			</ul>
		</body>
	</slide>
	<slide>
		<title>In Ruby</title>
		<body>
			<code>
input = "some data from somewhere";
extracts = /regex (with) (parens)/.match(input);
			</code>
			<p>Learn more at <a 
			href='http://www.ruby-doc.org/core/classes/Regexp.html'>
			this page</a></p>
		</body>
	</slide>
	<slide>
		<title>In Python</title>
		<body>
			<code>
input = "some data from somewhere"
extracts = re.compile("regex (with) (parens)").search(input)
			</code>
			<p>Learn more at <a 
			href='http://docs.python.org/lib/module-re.html'>
			this page</a></p>
		</body>
	</slide>
	<slide>
		<title>In C++</title>
		<body>
			<p>You need to have the Boost.Regex library installed 
				to use this example.</p>
			<code>
char *input = "some input from somewhere";
cmatch m;
regex r("regex (with) (parens)");
regex_match(input, m, r);
			</code>
			<p>Learn more at <a 
			href='http://www.boost.org/libs/regex/doc/index.html'>
			this page</a></p>
		</body>
	</slide>
	<slide>
		<title>In Java</title>
		<body>
			<code>
String input = "some data from somewhere";
Pattern p = Pattern.compile("regex (with) (parens)");
Matcher m = p.matcher(input);
			</code>
			<p>Learn more at <a href='http://java.sun.com/javase/6/docs/api/java/util/regex/package-summary.html'></a>.</p>
		</body>
	</slide>
	<slide>
		<title>In C#</title>
		<body>
			<code>
String input = "some data from somewhere";
Match extracts = Regex.Match("regex (with) (parens)", input);
			</code>
			<p>Learn more at <a href=
			'http://msdn2.microsoft.com/en-us/library/aa719739(VS.71).aspx'>
			this page</a></p>
		</body>
	</slide>
	<slide>
		<title>In JavaScript</title>
		<body>
			<code>
var input = "some data from somewhere";
var extracts = /regex (with) (parens)/.match(input);
			</code>
			<p>Learn more at <a 
			href='http://developer.mozilla.org/en/docs/Core_JavaScript_1.5_Guide:Regular_Expressions'>this page</a></p>
		</body>
	</slide>
	<slide>
		<title>In UNIX</title>
		<body>
			<code>
echo "some data from somewhere" | egrep 'regex (with) (parens)'
			</code>
			<p>To learn more, run <tt>man grep</tt>.</p>
		</body>
	</slide>
	<slide>
		<title>In PHP :(</title>
		<body>
			<code>
$input = "some data from somewhere";
preg_match("/regex (with) (parens)/", $input, $extracts);
			</code>
			<p>Learn more at <a 
			href='http://us2.php.net/manual/en/ref.pcre.php'>this page</a></p>
		</body>
	</slide>
	<slide>
		<title>The End</title>
		<body>
			<p>Please check around your seats for brain matter before 
				leaving.</p>
		</body>
	</slide>
</slideshow>
