We highly recommend downloading the files for this Q&A and following along with us:

wget https://courses.cs.washington.edu/courses/cse391/24au/lectures/6/questions6.zip
unzip questions6.zip

For many of these problems, you may find it helpful to refer to the Regex Syntax on our reference page.

  1. Suppose that we have a file named words.txt with the following contents

    These are some words
    a11 0f the5e w0rds c0n7ain number5
    S0me 0f these w0rds do
    

    • Write a command that identifies all words which are exactly four characters long.

    • Write a command that identifies all words which are exactly four characters long and contain only letters (both upper and lowercase).

    • Write a command that identifies all words which are at least four characters long and contain only letters (both upper and lowercase).

    Solutions

    grep -E "\<....\>" words.txt
    
    grep -E "\<[a-zA-Z]{4}\>" words.txt
    
    grep -E "\<[a-zA-Z]{4,}\>" words.txt
    

  2. Suppose that we have the following file named vegetables.txt 🥦

    broccoli
    asparagus
    potato
    lettuce
    zucchini
    brocccccccoli
    

    • Come up with a grep command that correctly identifies all vegetables that have two or more consecutive c’s in their name.

    • Come up with a grep command that correctly identifies all vegetables that have two instances of c anywhere in their name.

    Solutions

    grep -E "cc+" vegetables.txt
    # or
    grep -E "c{2,}" vegetables.txt
    
    grep -E ".*c.*c.*" vegetables.txt
    

  3. Using the file from Q2: Come up with a grep command that correctly identifies all vegetables that have two or more consecutive repeated letters in their name.

    Solutions
    grep -E "([a-z])\1+" vegetables.txt
    
  4. Suppose we have a file kitkats.txt with the following contents:

    kit kat
    kat kit
    my favorite part of the kit is the kat
    cats do not like kit kats
    this line only has kit
    this line only has kat
    
    Write a command that finds all lines which contain kit and kat in any order.

    Solutions
    grep -E "kit" kitkats.txt | grep -E "kat"
    
  5. Suppose that we have the following file named emails.txt. This file contains a user’s first and last name, followed by a comma, and then their email address. What is a grep command that determines which users have exactly their last name as their email

    larry ruzzo, ruzzo@cs.washington.edu
    zorah fung, zfung@yahoo.com
    hunter schafer, hschafer@uw.edu
    bennet goeckner, goeckner@math.uw.edu
    ruth anderson, andersonr@gmail.com
    
    In other words, our command should correctly identify that Larry Ruzzo and Bennet Goeckner have their last names as their email address.

    Solutions
    grep -E "[a-z]+ ([a-z]+), \1@[a-z]+\.[a-z]+" emails.txt
    
  6. The backend team at faang needs your help - we have lots of new products and they’re flying off the shelves like crazy (apparently you can sell happiness). In order to track all these transactions, each sale is assigned a unique ticket id. A ticket id is defined by the following properties:

    • It must contain exactly 16 letters (upper or lowercase) and numbers
    • To improve readability, the letters may optionally be grouped into segments that are multiples of length four delimited by dashes. However, the string may not end with a dash.

    The following are valid ticket ids:

    1234567891011112
    1234-4567-8910-1112
    aBcD-Ef79-8122-fd01
    aBcDEf798122-fd01
    
    The following are not valid ticket ids:
    12345                            #too short
    1233333333333333333333333333     #too long
    1234-4567-8910-11?2              #illegal character
    1234567891011112-                #ends with dash
    

    • Come up with a grep command that identifies valid ticket id’s in the file tickets.txt

    • Write a command that identifies how many unique valid ticket id’s are in the file ticket.txt.

    • Challenge: Come up with a grep command that identifies valid ticket id’s with the added constraint that if there is a single dash, all groups of four must be separated by a dash. (i.e. Now aBcDEf798122-fd01 is not a valid ticket id).

    Solutions

    grep -E "^([a-zA-Z0-9]{4}-?){3}[a-zA-Z0-9]{4}$" tickets.txt
    
    grep -E "^([a-zA-Z0-9]{4}-?){3}[a-zA-Z0-9]{4}$" tickets.txt | sort | uniq | wc -l
    
    grep -E "^[a-zA-Z0-9]{4}(-?)[a-zA-Z0-9]{4}\1[a-zA-Z0-9]{4}\1[a-zA-Z0-9]{4}$" tickets.txt