Regex and sed

Regular expressions (or regexes) are a concise way to describe patterns of text. While they have a more precise meaning in theoretical CS, software engineers tend to use the term quite broadly. First, we’ll learn about how to use regex to find matches using grep.

grep syntax

To use regular expressions in grep, we’ll want to use the -E flag. You can still use other flags (e.g. -i for case-insensitive matches).

Many characters in grep have special meanings - these are called metacharacters. One that we’ve already learned about is ., which matches any character; this will highlight all characters in a file:

grep -E "." file.txt

In contrast, adding more strings will only match parts of lines that match the entire pattern. For example, this regular expression matches parts of lines that start with hello, have any character, then end with world.

grep -E "hello.world" file.txt

You can escape metacharacters with \; to match a literal period (followed by com), try:

grep -E "\.com" file.txt

Anchors

Next, we learned about “anchors”: special characters that match the beginning or end of a line (^ and $) or a word (\< and \>).

Anchor	What it matches	Example
`^`	Start of line	`^cat` matches `cat` and `caterpillar`, but not `orange cat` or `(cat)`
`$`	End of line (not including newline)	`cat$` matches `cat` and `tomcat`, but not `cat.` or `cat!`
`\<`	Start of word	`\<cat` matches `caterpillar fur` and `here cat here`, but not `tomcats`
`\>`	End of word	`cat\>` matches `brown tomcat` and `muscat`, but not `tomcats rock`

You can combine anchors together; using ^ and $ together is helpful for matching entire lines.

Alternating and repeating characters

Syntax	What it matches	Example
`\|`	Either pattern (to left or right)	`com\|edu` matches either `com` or `edu`
`*`	0 or more copies of the character before it	`0*` matches the empty string, `0`, `00`, `000`, …
`+`	1 or more copies of the character before it	`1+` matches `1`, `11`, `111`, …
`?`	0 or 1 copies of the character before it	`2?` matches the empty string or `2`
`()`	Group characters together as one character (capture group)	`(01)+` matches `01`, `0101`, …

Note that using * is dangerous: it matches everything (including things you may not want to match).

Character sets

The [] syntax creates a character set, which matches one of any of the characters between the [ and ]. For example, the following two commands are equivalent:

grep -E "(a|b|c|d|e)"

grep -E "[abcde]"

Character sets support special syntax with - (“ranges”) and ^ (negation):

Character set	Description
`[A-Z]`	All uppercase alphabet characters
`[a-z]`	All lowercase alphabet characters
`[0-9]`	All digits
`[A-Za-z]`	All uppercase or lowercase alphabet characters
`[^a]`	All characters that are not `a`
`[^a-z]`	All characters that are not lowercase alphabet characters

Note that ^ has a different meaning than the start anchor ^, In addition, outside of ^ and -, regex metacharacters do not have their special meanings inside []; for example, [.?!] matches one of ., ?, and !, not any character.

Using ^ and - in character sets is a bit tricky. To quote grep’s man page:

To include a literal [ place it first in the list. Similarly, to include a literal ^ place it anywhere but first. Finally, to include a literal - place it last.

Occurrence ranges

The {} syntax matches the previous character a specific number of times.

Syntax	Description
`{n}`	Matches the previous character exactly `n` times
`{,n}`	Matches the previous character up to `n` times, inclusive
`{a,b}`	Matches the previous character between `a` and `b` times, inclusive

Backreferences

Backreferences let you capture patterns and look for them later. They work with capture groups (()) and are one-indexed.

For example, if we wanted to match lines containing a three-letter word, a space, and then the same three-letter word, we would do:

grep -E "(...) \1"

Backreferences only match the exact same characters as before; the above example does not match any three-letter word followed by a space and then another three-letter word.

sed

If grep is a fancy “find” of the command line, sed (stands for stream editor) is the “find-and-replace” of the command line.

We will always use sed with the -r flag. The general syntax looks like

sed -r 's/REGEX/TEXT/'

REGEX is a pattern or regular expression that we want to match
- this is the same syntax with grep - which makes grep helpful to test with!
TEXT stands for the text that we want to replace the matched text with
- outside of backreferences, special characters here are interpreted literally: they do not have their regular expression meaning
the -r flag stands for regular expression
sed takes its input from a file(s) or standard input

For example,

sed -r 's/UW/University of Washington/' schools.txt

Would replace the first instance of UW with University of Washington for each line in schools.txt.

You can add a g after the last / to do a “global” replace, which replaces every instance - not just the first one per line.

sed -r 's/UW/University of Washington/g' schools.txt

Since the / has a special meaning in sed, you can escape / with \.

In-place changes

By default, sed outputs changes to standard output but does not edit the original file.

You can change this behaviour by using the -i flag, which changes the file in place. -i requires an argument that is a file extension; sed will create a backup file with this extension, before your changes.

For example,

sed -ri.bak 's/cats/dogs/' best_animals.txt

first, make a backup file called best_animals.txt.bak
then, replace the first instance of cats with dogs in each line of best_animals.txt
will not output anything to standard output

sed Backreferences

sed becomes particularly powerful with backreferences: we can now edit lines depending on what we captured. In the pre-lecture, we saw the example artists.txt:

Duckworth, Kendrick Lamar
Swift, Taylor Alison
Grande-Butera, Ariana
Ma, Yo-Yo
Bryan, Zachary Lane
Cottrill, Claire Elizabeth
Graham, Aubrey Drake
Amstutz, Kayleigh Rose
Jónsdóttir, Laufey Lín Bing

We can reformat this to put each artists first and middle names before their last name with:

sed -r 's/^(.*), (.*)$/\2 \1/' artists.txt

Giving us:

Kendrick Lamar Duckworth
Taylor Alison Swift
Ariana Grande-Butera
Yo-Yo Ma
Zachary Lane Bryan
Claire Elizabeth Cottrill
Aubrey Drake Graham
Kayleigh Rose Amstutz
Laufey Lín Bing Jónsdóttir