Regex and sed
Regular expressions (or regexes) are a concise way to describe patterns of text. While they have a more precise meaning in theoretical CS, software engineers tend to use the term quite broadly. First, we’ll learn about how to use regex to find matches using grep
.
grep syntax
To use regular expressions in grep
, we’ll want to use the -E
flag. You can still use other flags (e.g. -i
for case-insensitive matches).
Many characters in grep
have special meanings - these are called metacharacters. One that we’ve already learned about is .
, which matches any character; this will highlight all characters in a file:
grep -E "." file.txt
In contrast, adding more strings will only match parts of lines that match the entire pattern. For example, this regular expression matches parts of lines that start with hello
, have any character, then end with world
.
grep -E "hello.world" file.txt
You can escape metacharacters with \
; to match a literal period (followed by com
), try:
grep -E "\.com" file.txt
Anchors
Next, we learned about “anchors”: special characters that match the beginning or end of a line (^
and $
) or a word (\<
and \>
).
Anchor | What it matches | Example |
---|---|---|
^ | Start of line | ^cat matches cat and caterpillar , but not orange cat or (cat) |
$ | End of line (not including newline) | cat$ matches cat and tomcat , but not cat. or cat! |
\< | Start of word | \<cat matches caterpillar fur and here cat here , but not tomcats |
\> | End of word | cat\> matches brown tomcat and muscat , but not tomcats rock |
You can combine anchors together; using ^
and $
together is helpful for matching entire lines.
Alternating and repeating characters
Syntax | What it matches | Example |
---|---|---|
| | Either pattern (to left or right) | com|edu matches either com or edu |
* | 0 or more copies of the character before it | 0* matches the empty string, 0 , 00 , 000 , … |
+ | 1 or more copies of the character before it | 1+ matches 1 , 11 , 111 , … |
? | 0 or 1 copies of the character before it | 2? matches the empty string or 2 |
() | Group characters together as one character (capture group) | (01)+ matches 01 , 0101 , … |
Note that using *
is dangerous: it matches everything (including things you may not want to match).
Character sets
The []
syntax creates a character set, which matches one of any of the characters between the [
and ]
. For example, the following two commands are equivalent:
grep -E "(a|b|c|d|e)"
grep -E "[abcde]"
Character sets support special syntax with -
(“ranges”) and ^
(negation):
Character set | Description |
---|---|
[A-Z] | All uppercase alphabet characters |
[a-z] | All lowercase alphabet characters |
[0-9] | All digits |
[A-Za-z] | All uppercase or lowercase alphabet characters |
[^a] | All characters that are not a |
[^a-z] | All characters that are not lowercase alphabet characters |
Note that ^
has a different meaning than the start anchor ^
, In addition, outside of ^
and -
, regex metacharacters do not have their special meanings inside []
; for example, [.?!]
matches one of .
, ?
, and !
, not any character.
Using ^
and -
in character sets is a bit tricky. To quote grep
’s man
page:
To include a literal
[
place it first in the list. Similarly, to include a literal^
place it anywhere but first. Finally, to include a literal-
place it last.
Occurrence ranges
The {}
syntax matches the previous character a specific number of times.
Syntax | Description |
---|---|
{n} | Matches the previous character exactly n times |
{,n} | Matches the previous character up to n times, inclusive |
{a,b} | Matches the previous character between a and b times, inclusive |
Backreferences
Backreferences let you capture patterns and look for them later. They work with capture groups (()
) and are one-indexed.
For example, if we wanted to match lines containing a three-letter word, a space, and then the same three-letter word, we would do:
grep -E "(...) \1"
Backreferences only match the exact same characters as before; the above example does not match any three-letter word followed by a space and then another three-letter word.
sed
If grep
is a fancy “find” of the command line, sed
(stands for stream editor) is the “find-and-replace” of the command line.
We will always use sed
with the -r
flag. The general syntax looks like
sed -r 's/REGEX/TEXT/'
REGEX
is a pattern or regular expression that we want to match- this is the same syntax with
grep
- which makesgrep
helpful to test with!
- this is the same syntax with
TEXT
stands for the text that we want to replace the matched text with- outside of backreferences, special characters here are interpreted literally: they do not have their regular expression meaning
- the
-r
flag stands for regular expression sed
takes its input from a file(s) or standard input
For example,
sed -r 's/UW/University of Washington/' schools.txt
Would replace the first instance of UW
with University of Washington
for each line in schools.txt
.
You can add a g
after the last /
to do a “global” replace, which replaces every instance - not just the first one per line.
sed -r 's/UW/University of Washington/g' schools.txt
Since the /
has a special meaning in sed
, you can escape /
with \
.
In-place changes
By default, sed
outputs changes to standard output but does not edit the original file.
You can change this behaviour by using the -i
flag, which changes the file in place. -i
requires an argument that is a file extension; sed
will create a backup file with this extension, before your changes.
For example,
sed -ri.bak 's/cats/dogs/' best_animals.txt
- first, make a backup file called
best_animals.txt.bak
- then, replace the first instance of
cats
withdogs
in each line ofbest_animals.txt
- will not output anything to standard output
sed Backreferences
sed
becomes particularly powerful with backreferences: we can now edit lines depending on what we captured. In the pre-lecture, we saw the example artists.txt
:
Duckworth, Kendrick Lamar
Swift, Taylor Alison
Grande-Butera, Ariana
Ma, Yo-Yo
Bryan, Zachary Lane
Cottrill, Claire Elizabeth
Graham, Aubrey Drake
Amstutz, Kayleigh Rose
Jónsdóttir, Laufey Lín Bing
We can reformat this to put each artists first and middle names before their last name with:
sed -r 's/^(.*), (.*)$/\2 \1/' artists.txt
Giving us:
Kendrick Lamar Duckworth
Taylor Alison Swift
Ariana Grande-Butera
Yo-Yo Ma
Zachary Lane Bryan
Claire Elizabeth Cottrill
Aubrey Drake Graham
Kayleigh Rose Amstutz
Laufey Lín Bing Jónsdóttir