Due Tuesday 11/14 at 1:00 pm. No late submissions accepted.
This assignment focuses on using regular expressions and related commands such as
grep. A set of files you will need for this assignment are available in the file
hw6.zip, found on the Homework page.
wget https://courses.cs.washington.edu/courses/cse391/23au/homework/hw6/hw6.zip unzip hw6.zip
Submit your answers to Gradescope. For all tasks, determine a single bash shell statement that will perform the operation(s) requested, not the output that the command produces. Each solution must be a one-line shell statement, but you may use input/output redirection operators such as
|. Please be sure to write the entire command (including the command name and the input file) not just the regular expression required for the problem. Remember that the
man pages are your friend!
Task 1: Warm-up Bash Shell Commands with
The following are warm-up exercises to get some practice with regular expressions and
grep -E or
egrep. For these exercises you’ll be searching from a file called
names.txt that contains both single-word usernames as well as full names. You may assume the file contains one name per line. Write your commands on the indicated lines in the
task1.sh file in the
Write the command that finds all usernames from
names.txt that contain at least one numeric character.
Write the command that finds all usernames from
names.txt that are exactly 4 characters long and consist only of alphabetic characters.
Write the command that finds all first and last names from
names.txt. A name is an uppercase letter followed by one or more lowercase letters and first and last names are separated by a single space. Your command should not match against names that contain a middle name or initials.
This problem is very simplified, since writing a regex to capture all possible names is quite the difficult task. In real life, you probably wouldn’t try to write a regex for this and just let any text count.
For this problem, you should only accept names using the pattern we described.
Task 2: Bash Shell Commands with
After the few weeks at FAANG, management has discovered that we need to start actually selling products to stay in business. You’ll be helping develop regular expressions to help in customer account creation and billing. For each item below, determine a single bash shell statement that will perform the operation(s) requested. Write your commands on the indicated lines in the
task2.sh file in the
Problem 4: Email Validation¶
As part of account creation, customers must provide a valid email address. Write a command that prints out all valid email addresses from
emails.txt. For the purposes of this assignment, we will use as simplified version of valid email address is defined as follows:
- Between 1 and 16 lowercase letters, uppercase letters, or digits
- A domain (e.g.,
gmail) that consists of any number of lowercase letters
- A period
- A tld (top level domain, e.g.,
.com) that consists of 2 or more lowercase letters
You may assume that
emails.txt contains one email per line.
Problem 5: Password Validation¶
Customers also must choose a strong password when creating an account. Write a command that prints out all valid passwords in the file
passwords.txt. A valid password at FAANG is defined as follows:
- Must be at least 8 characters long
- Must contain at least one uppercase character
- Must contain at least one lowercase character
- Must contain at least one digit.
You may assume that
passwords.txt contains one password per line.
Problem 6: Credit Card validation¶
In order to process payments for orders, FAANG needs to validate credit cards entered on the site. Since we are a small company, we will only be accepting Visa and Mastercard (Sorry American Express, your fees are too high!) Write a command that identifies all well-formatted credit cards in a file called
cards.txt that FAANG accepts. The format of these cards are as follows:
- Master Card
- Begins with a 5
- Exactly 16 digits long, grouped into sets of 4 separated by a space
- Begins with a 4
- Between 13 and 16 digits long, grouped into sets of 4 (last may be shorter) separated by a space
You may assume that cards.txt contains one credit card per line.
True credit card validation also requires using a formula to validate the contents of the credit card number, in addition to basic formatting. You do not need to do this for our assignment.
Task 3: Bash Shell Commands with
grep to process genetic data¶
The board is concerned about FAANG’s product-market fit. As it turns out, dirt and peas (especially those not in a pod) don’t have high market value. Therefore, in addition to continuing to sell its pre-existing products, FAANG has decided to enter the biotech space and process genomic data.
For these problems, you will be processing DNA data from the file
dna.txt. Data is printed in the file in pairs of lines. The first line in the pair is the name of the DNA sequence and the second line is the DNA sequence itself. The following provides you with some context for the task, but an understanding of DNA is not required for this assignment.
DNA consists of long chains of chemical compounds called nucleotides. Four nucleotides are present in DNA: Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). Certain regions of the DNA are called genes. Most genes encode instructions for building proteins (they’re called “protein-coding” genes). These proteins are responsible for carrying out most of the life processes of the organism. Nucleotides in a gene are organized into codons. Codons are groups of three nucleotides and are written as the first letters of their nucleotides (e.g., TAC or GGA). Each codon uniquely encodes a single amino acid, a building block of proteins.
For these problems you will be identifying protein-encoding genes as well as other attributes of the genetic data in the file. Note that all matches for these problems will be case-insensitive. Write your commands on the indicated lines in the
task3.sh file in the
Print all of the DNA sequences in the file (all of the non-names).
Print all of the names of DNA sequences in the file.
Hint: This will require use of a
grep option, in addition to regular expressions.
Print the full DNA sequences that contain the word “CAT”, preceded by the name of the sequence.
Hint: man grep to look for options that alter what gets printed for a match.
Print all of the DNA sequences that start and end with the same codon. (For this case, do not worry about matching start and end codons that don’t have the same casing as each other.)
Print all of the valid genes in the file. That is, all of the DNA sequences that can be divided into codons. (i.e. A sequence of A, C, G and Ts and has a length that is a multiple of 3).
Protein-coding genes are genes that begin with ATG, end with one of TAA, TAG, or TGA, and are at least 5 codons long. Print all protein-encoding genes in the file.
Recall that the regular expression syntax* shown in class. You can reference our provided glossaries for features you will need: