CSE 154

Lecture 25: Validation and Regular Expressions

Administrivia

HW6 is due Wednesday

CP8 is due today - consider opting in the CP8 showcase! Even if it's a neat dataset setup.sql file you've created :)

For CP8, remember to omit any Cloud 9 links - all links to your files should be relative. Also make sure your files are zipped correctly - some students have not been correctly zipping files from Cloud9. To check, try unzipping your .zip folder locally.

Cookies on Wednesday! (Real ones :))

Validation

Many websites offer features that allow users to interact with the page. Unfortunately, not all users will behave as expected.

Why is it important to spend the time prioritizing validation as web developers?

Which types of website features do you think could be most common for validating user input? (think of your own experience as users for various websites)

User Input Validation

User input validation is the process of ensuring that any user input is well-formed and correct (valid).

What are some examples of validation on the web you can think of? Consider general user input use cases as well as specific websites of which you may have provided user input on.

  • Preventing blank values (e-mail address)
  • Ensuring the type of values (e.g. integer, real number, currency, phone number, Social Security Number, postal address, email address, data, credit card number, ...)
  • Ensuring the format and range of values (ZIP code must be a 5-digit integer)
  • Ensuring that values fit together (user types email twice, and the two must match)

A Real-World Example Form that Uses Validation

Validation Form Example

Client vs. Server-Side Validation

Validation can be performed:

  • Client-side (before any user input is submitted to a server)
    • Can lead to a better user experience, but not secure (why not?)
  • Server-side (in PHP code, after user input is sent to the server)
    • Needed for truly secure validation, but slower
  • Both
    • Best mix of convenience and security, but requires most effort to program

Remember This?

(Lab06 PHP Array Mystery) Consider the following PHP code:


            function array_mystery($arr) {
              for ($i = 1; $i < count($arr); $i++) {
                $arr[$i] = $arr[$i] + $arr[$i - 1];
              }
              return $arr;
            }
            

PHP

Indicate in the right-hand column what values would be stored in the returned array after the function array_mystery executes if the array in the left-hand column is passed as a parameter to array_mystery. Include your answers in the format of [a, b, c] where a, b, and c are numbers in the array result (for a 3-element array).

  1. [8] :
  2. [6, 3] :
  3. [1, 2, 3, 4] :
  4. [7, 10, 12, 12, 17] :

Another Example of User Input Validation


City:
State:
ZIP:

HTML

City:

State:

ZIP:

output

Let's validate this form's data on the server...

Basic Server-Side Validation


          $city = $_POST["city"];
          $state = $_POST["state"];
          $zip = $_POST["zip"];
          if (!$city || strlen($state) != 2 || strlen($zip) != 5) {
            print "Error, invalid city/state/zip submitted.";
          }
          

PHP

Basic idea: Examine parameter values, and if they are bad, show an error message and abort. But there are some limitations given what we've learned so far in this course.

  • How do you test for integers vs. real numbers vs. strings?
  • How do you test for a valid credit card number?
  • How do you test that a person's name has a middle initial?
  • (How do you test whether a given string matches a particular complex format?)

Regular Expressions


          /^[a-zA-Z_\-]+@(([a-zA-Z_\-])+\.)+[a-zA-Z]{2,4}$/
          

Regular expression ("regex"): a description of a pattern of text

  • Can test whether a string matches the expression's pattern
  • Can use a regex to search/replace characters in a string

Regular expressions are extremely power but tough to read (the above regular expression matches email addresses)

Regular expressions occur in many places:

  • Java: Scanner, String's split method (CSE 143 rando grammar generator)
  • Supported by PHP, JavaScript, and other languages
  • Many text editors (TextPad, Sublime, Vim, etc.) allow regexes in search/replace
  • The site Rubular is useful for testing a regex

Regular Expressions... Take Some Practice

regex comic

XKCD 1171

Notecard Directions

As we review some of the different ways we can construct regular expressions, write two of your own "regular expression" problems. No need to write answers, but give it a shot! We may feature some creative ideas in section tomorrow :)

Basic Regular Expressions

/abc/

In PHP, regexes are strings that begin and end with /

The simplest regexes simply match a particular substring

The above regular expression matches any string containing "abc"

  • Match: "abc", "abcdef", "defabc", ".=.abc.=.", ...
  • Don't Match: "fedcba", "ab c", "PHP", ...

Wildcards: .

A . matches any character except a \n line break

  • /.ow.l./ matches "Mowgli", "Powell", etc.

A trailing i at the end of a regex (after the closing /) signifies a case-insensitive match

  • /cal/i matches "Pascal", "California", "GCal", etc.

Special Characters: |, (), \

| means OR

  • /abc|def|g/ matches "abc", "def", or "g"

() are for grouping

  • /iP(ad|hone)/ matches "iPad" or "iPhone"

\ starts an escape sequence

  • Many characters must be escaped to match them literally: /\$.[]()^*+?
  • /<br \/>/ matches lines containing <br /> tags

Quantifiers: *, +, ?

* means 0 or more occurrences

  • /abc*/ matches "ab", "abc", "abcc", "abccc", ...
  • /a(bc)*/ matches "a", "abc", "abcbc", "abcbcbc", ...
  • /a.*a/ matches "aa", "aba", "a8qa", "a!?xyz__9a", ...

+ means 1 or more occurrences

  • /Hi!+ there/ matches "Hi! there", "Hi!!! there!", ...
  • /a(bc)+/ matches "abc", "abcbc", "abcbcbc", ...

? means 0 or 1 occurrences

  • /a(bc)?/ matches only "a" or "abc"

More Quantifiers: {min, max}

{min, max} means between min and max occurrences (inclusive)

  • /a(bc){2,4}/ matches "abcbc", "abcbcbc", or "abcbcbcbc"

min or max may be omitted to specify any number

  • {2,} means 2 or more
  • {,6} means up to 6
  • {3} means exactly 3

Practice Exercise

When you search Google, it shows the number of pages of results as the number of "o"s in the word "Google".

What regex matches such words with an even number of 'o's ("Google", "Goooogle", "Goooooogle", ...?

Your regex should not match strings with fewer than two o's and shold be case-sensitive (only the first letter should be capitalized) (try it)

Solution: G(oo)+gle or Go{2}+gle both work!

Anchors: ^ and $

^ represents the beginning of the string or line; $ represents the end

  • /Doggy/ matches all strings that contain Doggy
  • /^Doggy/ matches all strings that start with Doggy
  • /Doggy$/ matches all strings that end with Doggy
  • /^Doggy$/ matches the exact string "Doggy" only
  • /^Mo.*Doggy$/ matches "MoDoggy", "Mowgli Doggy", "Mowgli is my Doggy", ... but not "Doggy Mowgli is my Doggy", "Mowgli" or "my Doggy"

(on the other slides, when we say, /PATTERN/ matches "text", we really mean that it matches any string that contains the text)

Character Sets: []

[] groups characters into a character set; will match any single character from the set

  • /[bcd]art/ matches strings containg "bart", "cart", and "dart"
  • equivalent to /(b|c|d)art/ but shorter

Inside [], many of the modifier keys act as normal characters

  • /what[!*?]*/ matches "what", "what!", "what?**!", "what??!", etc.

Practice: What regex matches strings containing a lowercase vowel? (try it!)

Practice: What regex matches strings containing consecutive vowels? (try it!)

Character ranges: [start-end]

Inside a character set, specify a range of characters with -

  • /[a-z]/ matches any lowercase letter
  • /[a-zA-Z0-9]/ matches any lowercase or uppercase letter or digit

An initial ^ inside a character set negates it

  • /[^abcd]/ matches any character other than a, b, c, or d

Inside a character set, - must be escaped to be matched

  • /[+\-]?[0-9]+/ matches an optional + or -, followed by at least one digit

Practice: What regular expression matches UW Student ID numbers? (non-negative 7 digit numbers) (try it!)

Practice: What regular expression matches camelCasing? (match only trings with at least one capital letter; only alphabetical characters are allowed) (try it!)

Practice: What regular expression matches a sequence of only consonants (non-vowel letters) assuming that the string consists only of lowercase letters? (try it!)

Escape Sequences

Special escape sequence characters sets

  • \d matches any digit (same as [0-9]); \D any non-digit ([^0-9])
  • \w matches any word characters (same as [a-zA-Z0-9]); \W any non-word character
  • \s matches any whitespace character ( , \t, \n, etc.); \S any non-whitespace character

Practice: What regular expression matches any string that contains a tab (\t) character?

Practice: What regular expression matches names in a "Last, First M." format, with any number of spaces?

Useful Regex Quick Reference

regex quick reference

We will pass out reference sheets in section tomorrow!

Regular Expressions in PHP

Regex syntax: strings that begin and end with /, such as "/[AEIOU]+/"

function description
preg_match(regex, string) returns TRUE if string matches regex
preg_replace (regex, replacement, string) returns a new string with all substrings that match regex replaced by replacement
preg_split (regex, string) returns an array of strings from given string broken apart using given regex as delimiter (like explode but more powerful)

PHP Validation with Regexes

$state = $_POST["state"];
if (!preg_match("/^[A-Z]{2}$/", $state)) {
   print "Error, invalid state submitted.";
} 

PHP

preg_match and regexes help you to validate parameters

Websites often don't want to give a descriptive error message here (why?)

Regular Expression PHP Examples

# replace vowels with stars
$str = "the quick brown fox";
$str = preg_replace("/[aeiou]/", "*", $str);
                  # "th* q**ck br*wn f*x"

# break apart into words
$words = preg_split("/[ ]+/", $str);
               # ("th*", "q**ck", "br*wn", "f*x")

# capitalize words that had 2+ consecutive vowels
for ($i = 0; $i < count($words); $i++) {
   if (preg_match("/\*{2,}/", $words[$i])) {
      $words[$i] = strtoupper($words[$i]);
   }
}      # ("th*", "Q**CK", "br*wn", "f*x")

PHP

Additional Resources and Regex Fun

HTML Form Validation (MDN): A neat overview of the different features offered in HTML5 for client-side form validation!

RegexOne: A helpful interactive regex tutorial

Regex Crossword Game: A super fun way to practice regex for puzzle-lovers :)