CSE 374, Lecture 7: sed

Summary of grep

We use the program "grep" to search for patterns in files.
The 'pattern' used in grep is a regular expression, a particular formal grammar.
Regular expressions use special characters to express particular types of patterns: . * ? + ^ $ ( ) [ ] { } \ |
The output of grep is lines that have one or more substring that match the pattern.
There are also a few options in grep that might be useful to you:
- -v to reverse the output (print non-matching lines instead of matching lines)
- -c to count the number of matching lines
There are multiple "flavors" of regular expressions including grep and "grep -E" or extended grep. Different flavors require you to escape different special characters. You may use whatever variant you like, but in exams you should always be consistent.

But what if we actually want to CHANGE or ADD some input based on the pattern? We can use a program called "sed" to accomplish this task.

Intro to sed

The "sed" program has the name because it is a "stream editor". sed processes one line at a time and performs basic text transformations. Note that multi-line transformations are possible but painful with sed, so they are not suggested.

You can use sed by giving it any options you would like (see man page for options), a command that directs sed what kinds of transformations to make on the input, and a file name (if not present, sed will use stdin or whatever input stream is used instead of a file).

    $ sed [OPTIONS] [COMMAND] [FILE]

While sed can do a wide variety of interesting and powerful transformations (check out the man page, or search Google for a sed tutorial), we'll use it today to do substitutions: replacing one piece of text with another. The substitution command looks like 's/original/replacement/', where you can specify 'original' as a regular expression.

    $ echo "The original copy is the original" > test.txt
    $ sed 's/original/replacement/' test.txt
    The replacement copy is the original

    # Alternatively, you can redirect the input stream from echo instead of a file.
    $ echo "The original copy is the original" | sed 's/original/replacement/'
    The replacement copy is the original

Note that only the first instance of "original" per line was replaced. If you add "g" onto the end of the command, which stands for "global", then you will substitute ALL of the instances of the pattern on the line. The most common way you will use sed is with the 's/.../.../g' command.

    $ echo "The original copy is the original" | sed 's/original/replacement/g'
    The replacement copy is the replacement

This example only has a single line, but sed will run the substitution command on every line of the input and print out all lines (regardless of whether they matched the pattern) to the output.

Also note that by default, sed uses stdout for its output. This means that the original file is NOT modified by the sed command. If you do want to replace the original file with the substituted version, you can use the "-i" option (stands for "in-place"). Be VERY careful with -i - just like the mv or rm commands, you can't undo it if you get it wrong.

    $ sed -i 's/original/replacement/g' test.txt
    $ cat test.txt
    The replacement copy is the original

Exercise: phone numbers

In the last section, we learned to write a regular expression to match any format of phone number. What if we want to rewrite the file to put all phone numbers in a standard format?

Let's say phone numbers are stored in a file people.txt:

    M, Joe        4253921211
    P, Tina       (206) 123-4567
    V, Sue        310-459-1094
    J, Tom        206 772 7341
    A, Anne       206.858.0109

I want to put all numbers in the format (xxx) xxx-xxxx.

First, we can just make sure we match all phone numbers and replace with the word "test".

    $ sed 's/(\?[0-9]\{3\})\?[- .]\?[0-9]\{3\}[- .]\?[0-9]\{4\}/test/g' people.txt
    M, Joe        test
    P, Tina       test
    V, Sue        test
    J, Tom        test
    A, Anne       test

Then we can use "capture groups" to capture the strings that represent each group of numbers. Then we can use backreferences to those capture groups on the "replacement" side of the command:

                  first capture           2nd capture         3rd capture     replacement
                        v                      v                  v                v
    $ sed 's/(\?\([0-9]\{3\}\))\?[- .]\?\([0-9]\{3\}\)[- .]\?\([0-9]\{4\}\)/(\1) \2-\3/g' people.txt
    M, Joe        (425) 392-1211
    P, Tina       (206) 123-4567
    V, Sue        (310) 459-1094
    J, Tom        (206) 772-7341
    A, Anne       (206) 858-0109

What if we want to remove the phone numbers? We can use an empty replacement:

    $ sed 's/(\?[0-9]\{3\})\?[- .]\?[0-9]\{3\}[- .]\?[0-9]\{4\}//g' people.txt
    M, Joe        
    P, Tina       
    V, Sue        
    J, Tom        
    A, Anne

How about swapping the order of the first name and last name at the beginning of each line? We can use backreferences but swap the order (\2 and then \1).

    $ sed 's/^\([A-Z][a-zA-Z]*\), \([A-Z][a-zA-Z]*\) /\2 \1 /g' people.txt
    Joe M        4253921211
    Tina P       (206) 123-4567
    Sue V        310-459-1094
    Tom J        206 772 7341
    Anne A       206.858.0109

More details

There are a number of other types of commands (besides substitution) which you can use the man page or a tutorial to look into (Google for "sed tutorial") such as "p" and "d".
If you prefix the command with "-e", you can provide multiple substitutions. Those substitutions are applied to every line in the order given.
```
$ sed -e 's/orig1/replacement1/g' -e 's/orig2/replacement2/g' file.txt
```
You can use a different delimeter than the default slash "/". You might do this because usually you need to "escape" the delimiter in your regular expression in order to match the literal delimeter character. So if your pattern has a bunch of slashes in it, you might use something like an underscore as a delimiter instead.
```
$ sed 's_original_replacement_g' file.txt
```
Sed is really good at one-line replacements but it is possible (although painful) to do multi-line replacements. You can look up sed's "hold buffer" which it uses for preserving information across lines. However if you really want to do multi-line, more complicated things, look into the "awk" program which is much better at that.
Sed is very powerful but we will only be doing the basics in HW3 and exams. A one-liner is plenty for our purposes.