CSE 341: Scripting, Ruby, and regular expression matching

Scripting

What is "scripting"?

Ruby is sometimes called a scripting language. The term "scripting language" is often bandied about, but I have never seen or heard anybody describe a precise technical distinction between scripting languages and other languages. (Indeed, Dave Thomas and Andy Hunt, authors of the first major English-language Ruby textbook, seem to agree that it's not a very useful term.)

The thing that seems to lead people to call a language a "scripting language" is the environment in which the major implementation(s) of that language run. Generally, a "scripting language" is one that

The idea is that the language is meant to be used for writing little "scripts" that manipulate its environment. For example, Unix shell scripting languages are designed for controlling the Unix filesystem, commands, and processes. Other large applications --- e.g., office suites, multimedia authoring tools, etc. --- often contain an embedded scripting language for automating small tasks in the application.

Because scripts are usually small, scripting languages tend to be dynamically typed --- maintaining type declarations and annotations is (allegedly) a waste of time for such small programs.

It is sometimes claimed that scripting languages should also be impoverished --- i.e., that you can get away without having powerful but "complicated" constructs like objects, lambdas, and a module system. This claim does not hold up in the face of the evidence: every scripting language that has been around for several years has accumulated many ad hoc constructs to do the jobs of objects, lambdas, etc., and usually has ended up being more complicated than Smalltalk or Scheme in the end.

You should know by now that simplicity and power are not mutually exclusive design goals. Making your language impoverished in the name of simplicity is foolish economy.

Unix shell scripting

When you log into a Unix prompt, you're actually using a program called a shell --- on modern Unices, most often bash, the "Bourne Again SHell".

On modern operating systems (including Unix and Windows), shells actually contain a programming language --- you can set variables called "environment variables", write conditional statements and loops, read input and output, and write to the filesystem.

Much like a read-eval-print loop, a Unix shell generally reads and executes one statement at a time. For example, when you type, at a Unix shell:

emacs myfile.txt

you are actually feeding a line of input (called a command line) to the shell. The shell parses the line into tokens, finds the program named emacs and sends it a single command line argument, the string myfile.txt.

You can also instruct a Unix shell to read and execute an entire file --- see the section below on running scripts in Unix.

The language of Unix shells manages to be weak, but not simple: it is simultaneously impoverished and baroque, filled with arbitrary quirks and limitations that the user must learn. For example:

Given this heinous precedent, it is perhaps not surprising that Perl was welcomed with open arms by Unix shell hackers. Perl is a programming language that's better than shell, but provides many of the same features.

And, indeed, Perl is better than shell, but that is not a very high standard. There is no inherent reason why a programming language for manipulating Unix programs, files, and processes should be even remotely as awful as the shell. Partly in reaction to Perl, people have developed various alternative scripting languages for Unix. Two of the more popular are Python (designed by Guido van Rossum) and Ruby (designed by Matz) --- both are cleaner than Perl, and are gaining in popularity.

Running scripts in Unix

In Unix, any text file can be an executable program. When you invoke a program from the command line, Unix does the following:

  1. Searches for the file with that name in the current executable search path (by convention, this is stored in the PATH environment variable; type echo $PATH in the Unix shell to get the current path).
  2. Sees if the file has execute permissions. If not, print an error and halt.
  3. Otherwise, look at the first two bytes of the file. If they are #! ("shebang"), then the file is assumed to be a text file containing input for some program (e.g., a shell, or a programming language interpreter). In this case, the remainder of the first line is searched for the name of the program to run. When this program is found, it is executed with the current file as its input.
  4. Otherwise, execute the current file as a regular (binary) program.

Step 3 allows you to write scripts using any language of your choice (or, at least any language for which having #! on the first line of input does not break the program). For example, here's a simple bash script that creates a new directory and places an empty README file inside it:

#!/bin/bash

mkdir $1
cd $1
touch README

If you save this file as myscript.sh, give it execute permissions using chmod +x myscript.sh, and run it using ./myscript.sh dirName, then it will execute these commands in bash with $1 bound to dirName --- i.e., it will create a directory dirName, enter that directory, and create ("touch") an empty file named README.

Likewise, if you want to write a Unix script using Ruby, you can create a file called hello.rb containing the following text:

#!/usr/bin/ruby

puts("Hello, world!\n")

As before, give the file executable permissions using chmod +x, and do ./hello.rb. Your terminal should print Hello, world!.

Unix processes as a programming model

We noted before that most shell languages don't provide any way to define procedures. Ruby, and all real scripting languages, do provide procedures and other abstractions, but it's worth taking a few moments to explore what it's like to program in a world where you invoke external programs to instantiate reusable tasks. (This discussion will use Unix as an example, but most modern operating systems have similar features.)

In the world of "normal" programming, you invoke procedures with arguments, and you obtain results. When a procedure's running, it has an activation; you can view a procedure activation as a black box, with inputs, outputs, and access to shared spaces, as in Fig. 1:

[Procedure activation, with input (input parameters),
          output (return values, output parameters), and shared spaces
          (heap and globals)]
Fig. 1: Inputs, outputs, and shared spaces of a procedure activation

The inputs to a procedure are its parameter values. When the procedure returns, it can send output through its return value, or through its output parameters (e.g., in ML, a function with ref cell parameters can update their contents). While it's running, the activation can also use the heap and globals, which are shared by all activations; these spaces can be used for both input and output, as well as scratch space.

A procedure activation also has internal state --- local variables --- which can be used while it is evaluating, but which are inaccessible to the outside world.

In a quite analogous manner, when you run a program in Unix, the operating system "activates" the program and creates a process, with its own memory space, its own inputs and outputs, and access to shared spaces. See Fig. 2.

A process's communication channels include:

[Process activation, with input (command line
          arguments, standard input), output (return code, standard
          output, standard error), and shared spaces (filesystem,
          network, and pipes/IPC)]
Fig. 2: Inputs, outputs, and shared spaces of a Unix process

In this universe, the shared spaces are the entire host environment. Most "normal" programs are meant to "do something to the world", and therefore we usually think of processes as "communicating" by changing things in these shared spaces. For example, when you run emacs, you're usually trying to create or update some file in the filesystem. However, you can use all of a program's inputs and outputs for communication.

An important difference between process activations and most languages' procedure activations is that processes run concurrently --- at any given time, there are many active processes in a system, all doing something. By contrast, in a standard single-threaded programming language, potentially many procedure activations are on the stack, but only the topmost one is "currently executing".

One interesting thing you can do because of this concurrency is pipe results of one program's standard output to another program's standard input. In Unix shell languages, the pipe operator is |. For example, the following command runs ls ("list files"), and passes its output to grep ("global regular expression print"):

ls | grep 'a*.txt'

This will print all the listed files whose names match the regular expression a*.txt. In terms of our process diagram, we have set up a pipeline like the one in Fig. 3.

Notice that the standard output of the ls activation is fed into grep's standard input. grep and ls actually execute concurrently --- ls will execute for a while and produce a "chunk" of output, which will be fed to grep, which will process that "chunk" and then wait for ls to feed it more. Eventually, ls will stop running; grep will stop after finishing the last chunk that ls sends to it.

[ls reads from filesystem and sends standard output to
           grep's standard input; grep reads 'a*.txt' from command
           line arguments; terminal receives standard error stream
           from both processes; shell receives result code.]
Fig. 3: Activations of ls and grep in pipeline, with shell and terminal processes shown for completeness.

Unix shells' process control supports many sophisticated redirection patterns, of which this is only a relatively simple example. Some common/useful operations:

Running processes in various other languages

C
C has the grandparent of all Unix process control primitives, system and related calls (popen, fork, exec) in the and Unix standard libraries. C's system takes a string argument that names the command, e.g.:
int resultCode = system("ls");
Standard ML
SML's OS.Process structure provides a system function that works like C's system:
structure OS.Process : OS_PROCESS = struct .. end
signature OS_PROCESS = sig
   eqtype status
   val success : status
   val failure : status
   val system : string -> status
   ...
end
There is also a Unix structure that allows more detailed control over processes under Unix-compatible operating systems.
Java
The class Runtime provides exec method, which starts a process and returns a handle of type Process for manipulating that running process (for example, it allows you to send text to the running process's standard input stream). To get an instance of Runtime, use the static method Runtime.getRuntime().
class Runtime {
    static Runtime getRuntime()  { ... }
    Process exec(String command) { ... }
}
Ruby
Like shell, Ruby supports backquote syntax:
x = `command arg1 ... argN`
It also has a system call, which resembles the C system call, and IO.popen, which returns a representation of the process similar to Java's Process handle:
system("command arg1 ... argN")
IO.popen("programName", mode)

A Brief Tour of Ruby

Ruby is a purely object-oriented language descended primarily from Smalltalk (for its semantics) and Perl (for its syntax, libraries, and programming environment). Ruby was originally designed by Yukihiro Matsumoto (a.k.a. "Matz"), in the mid-1990's. Ruby 1.0 was released in the late 1990's. By release 1.6 (ca. 2000) the language had become popular in Japan, but it was used relatively rarely outside of Japan. However, Ruby has been steadily gaining in popularity worldwide as an alternative to uglier languages like Perl.

Many Ruby distributions ship with irb, "interactive Ruby" (in contrast with the usual batch-mode ruby executable, which simply processes an entire file at once). irb provides an interactive read-eval-print loop.

If your distribution does not come with irb, you may be able to find a file named eval.rb, which provides a read-eval-print loop with fewer features. On Linux systems, try locate eval.rb to find it; on our attu server, it's currently at //usr/share/doc/ruby-1.6.8/sample/eval.rb. Then, to run this file, I type

ruby /usr/share/doc/ruby-1.6.8/sample/eval.rb

at the command line.

Expressions and values

Ruby is a pure object-oriented language in which all expressions (except assignment, as in Smalltalk) are message sends. Therefore, its core expression evaluation semantics are quite similar to Smalltalk's, but its syntax differs quite a bit...

Ruby expression examples

"hi"             # double quoted string literal
'hi\n'           # single quoted string literal; equiv. to "hi\\n"
%q{'hi,' I said} # %q-quoted string literal; note uninterpreted quote

:hi            # a symbol
24_000         # a small number (instance of Fixnum)
10_000_000_000 # a large number (instance of Bignum)

-1                          # prefix unary send
'hello,' + ' world'         # infix send
'hello, world'.slice(0, 5)  # keyword send: returns 'hello'
3.to_s                      # no-argument keyword send: returns string '3'

[3, 4, 17, -2]      # literal array
[3, 4, 17, -1].sort # keyword send to literal array

x = 'hi'       # assignment; returns left-hand-side
a = b = c = x  # chained assignments (assignments are expresions)

# conditional expression
i = if x == 'hi' then 3 else 7 end

# block-based control structure
3.times { puts('hi!\n') }

# the upto message of integers takes a block parameter
1.upto(10) { |x| puts("x!\n") }

# the foreach message on arrays --- takes a block
['hi','hola','bonjour'].doeach { |x|
    puts(x + ', world')
}

Ruby syntax summary

A subset of the Ruby grammar:

expr ::= atom | binding
       | prefixSend | infixSend | keywordSend
       | syntacticSugar
       | ( expr )

atom ::= ID | @ID | @@ID | $ID
       | literal | firstClassBlock
literal ::= INTEGER | STRING | REGEXP | ...

firstClassBlock ::= lambda plainBlock | proc plainBlock
plainBlock ::=
    { [| ID* |] stmt* }
    do [| ID* |] stmt* end

binding     ::= name = expr

infixSend   ::= expr OPERATOR expr
keywordSend ::= expr.ID [( expr*\, )] [plainBlock]

syntacticSugar ::= ifExpr | caseExpr | ...

The chief syntactic differences (which are not all reflected in the BNF above; many of the differences are in terminals such as STRING) are:

Classes and methods

The syntax of class declarations is as follows:

class ClassName < SuperclassName
    body
end

As usual, class bodies include methods, whose syntax is as follows:

def methodName
    body
end

For example:

class Person < Object
  def greeting
    "Hello, world!"
  end
end

The < SuperclassName clause is optional when the superclass is Object. Receiver instance variables are named inside a method by a preceding @ sign:

class Person
  def greet(personName)
    "Hello, " + personName + ", my name is " + @name
  end
end

Notice that I re-opened the Person declaration in the above example. Unlike classes in many object-oriented languages, Ruby's classes are open: you can add members to any class anywhere in the source text simply by reopening the class declaration. Later definitions override ealier ones. Existing instances will be updated to reflect any additional code.

In Squeak's Morphic library, the base class Morph overrides new to call initialize on freshly created instances before returning them. In Ruby, this is a language-wide convention. In fact, Ruby's default new class method takes a variable number of arguments (Ruby has Scheme-style varargs) and it passes all its arguments on to initialize. Therefore, you can define "constructor"-like code by writing an instance method initialize:

class Person
  def initialize(name)
    @name = name
  end
end

Notice that the class declaration does not contain any declarations for instance variables (unlike in Smalltalk and most other object-oriented languages). You get an instance variable whenever you assign to an @ variable --- if the variable already has a value, it is updated, and if it does not then a new variable is created.

Similarly, any undefined variable expressions will simply return nil:

class Person
  def accessUndefinedVariable
    @someVariableNameWeHaveNotDefined
  end
end

(Person.new).accessUndefinedVariable   # Evaluates to nil

Miscellaneous features and syntactic quirks

Varargs

Recall that in Scheme, we could define variable-argument functions by using . (dot):

(lambda (first second . rest) ...)

In Ruby, we can define variable-argument lambdas or methods by using *:

lambda { |first second *rest| ... }  # Vararg anon function
def f(first second *rest)            # Vararg method
    ....
end

Parallel assignment

Ruby has (a primitive form of) dynamically typed pattern matching for assignments:

arr = [1, 2, 3, 4, 5]
a, b, c = arr  # executes a=1, b=2, c=3
d, e = a, b    # executes d=a, e=b

Notice that there is no data constructor around the comma-separated names on the left-hand side, and that the pattern need not have the same number of parts as the value.

Unary prefix messages

To define a unary prefix message, you write a method name with op@, where op is the prefix operator you want to define. However, Ruby only (syntactically) allows a handful of prefix message names, including + and -

class Complex
    def initialize(r, i)
        @r = r
        @i = i
    end

    def -@
        Complex.new(-@r, @i)
    end
end

Assignment operators

As in Smalltalk, an object's instance variables are not accessible outside the methods of the object's class --- simply because there is no such thing as a field access expression. In Smalltalk, you defined accessor methods, e.g., for Point:

x
    ^ x

x: aNumber
    x := aNumber

Ruby doesn't allow field access either, but Ruby's designer was fond of the

expr.fieldName = expr'

syntax found in languages that do permit direct field access. Ruby provides assignment methods, which are method names that begin with a name or operator and end in =:

class Complex
    def r=(aNumber)
        @r = aNumber
    end

    def i=(aNumber)
        @i = aNumber
    end

    def []=(anIndex, aNumber)
        if anIndex == 0 then
            @r = aNumber
        else
            @i = aNumber
        end
    end
end

a = Complex.new(1, 2)
a.r = 3
a[1] = 4

If you define an assignment method, Ruby will automatically define certain other methods for you --- for example, if you define r=, then Ruby will also define r+=. So, for example:

a = Complex.new(1, 2)
a.r += 3   # equivalent to a.r = a.r + 3

"Attributes": syntactic sugar for accessors

Ruby has syntactic sugars for defining accessor methods:

class Complex
    attr_accessor :r, :i   # Automatically defines r, r=, i, i=
end

To define read-only or write-only attributes, use attr_reader or attr_writer instead:

class Odd
    def initialize(a, b)
        @a = a
        @b = b
    end
    attr_reader :a   # defines only method a
    attr_writer :b   # defines only method b=
end
anOdd = Odd.new('a', 'b')
anOdd.a = 'foo'     # illegal
x = anOdd.b         # illegal
anOdd.b = anOdd.a   # ok

Mixins

All classes have a single superclass, but Ruby has "mixins", which allow you to copy any other class's methods into the current class by using include. For example, the class Comparable is defined as follows:

class Comparable
    def <(other)
        (self <=> other) == -1
    end
    def <=(other)
        (self <=> other) == -1 \
        or (self <=> other) == 0
    end
    ... # in the same vein, but does not include <=>
end

You can implement the entire Comparable interface simply by defining <=> and "mixing in" the interface of Comparable by using the include construct:

class Complex
    def <=>(other)
        mySize    = Math.sqrt(self.r * self.i)
        otherSize = Math.sqrt(other.r * other.i)
        return mySize <=> otherSize
    end
    include Comparable
end

c1 = Complex.new(0, 1)
c2 = Complex.new(1, 2)
less = c1 < c2

String interpolation

Expressions can be substituted into double-quoted strings by using the #{expr} syntax:

x = 'something'
y = "#{x} interpolated"   # evaluates to "something interpolated"

Notice that this is highly unusual by the most programming language standards --- the value of a literal depends on the values of expressions in its context. One way to think about this is that a double-quoted string is syntactic sugar for a call to printf (Ruby's formatted string printing function, which borrows its syntax from C's printf):

# An alternative way of obtaining the same effect
y = '%s interpolated'.printf(x)

Regular expression pattern matching

We mentioned regular expressions when discussing language grammar definitions. Many programming activities involve manipulating text, and regular expressions are useful for text processing applications in general, not just programming language lexers.

Regular expressions as a programming tool were probably pioneered by the Unix utilities grep, sed, and awk. Perl added many extensions to regular expressions, and syntactic sugar for writing regular expressions and matching strings against them. Ruby adopted many of Perl's syntactic conventions, including regular expression literals and a regular expression matching operator.

In Ruby, a regular expression literal is either written between two forward slashes, or enclosed in %r{ and }:

# Equivalent ways of constructing a regular expression object
r1 = /[a-z]+.*/
r2 = %r{[a-z]+.*}

Both of these are syntactic sugar for passing a string to the Regexp.new class method:

r3 = Regexp.new('[a-z]+.*')

To match a regular expression against a value, you call the match method, which returns an instance of the MatchData class if the match succeeds:

aMatch = /h.*,.*!/.match('I say, "hello, world!"')

By default, a regular expression searches for a match anywhere in the string, so the above match will succeed. Failed matches return nil.

Regular expressions in Ruby have many features beyond the automata-theoretic formulation of regular expressions. One important feature is parenthesized submatching: any part of the regular expression enclosed in parentheses will be "saved" for later extraction via the MatchData object. The ith parenthesized subexpression can be extracted from the object using the array subscript operator []. For example:

aMatch = /(h.*),(.*)!/.match('I say, "hello, world!"')
puts(aMatch[1]) # prints 'hello'
puts(aMatch[2]) # prints 'world'

Index zero produces the whole match:

puts(aMatch[0])   # prints 'hello, world!'

MatchData objects understand various other messages, including:

Regexp examples and miscellaneous features

By default, regular expressions match anywhere in a string. ^ indicates that an expression should match only at the beginning of a line, and $ means the expression should match only at the end of a line. The following expression matches any line with at least one equals sign; it saves two submatches: (1) everything from the beginning of a line up to the last equals sign, and (2) everything else:

^(.*)=(.*)$

For example:

m = /^(.*)=(.*)$/.match("FOOBAR='baz bif')
m[1]    # equals "FOOBAR"
m[2]    # equals "'baz bif'"

Recall that regular expressions often support character classes, like [A-Za-z], which match all listed characters. There are some predefined character classes:

For example:

# one or more word characters at the beginning of a line, followed
# by a space, followed by one or more digits, saving the word and digits
re1 = /^(\w+)\s(\d+)/

# match anything following initial whitespace, and save it all up to
# the end of a line
re2 = /^\s*(\S*)$/

Regular expressions can also be used for string substitution and splitting:

# evaluates to "goodbye, world"
'hello, world'.sub(/^\w*/, 'goodbye)

'banana'.sub(/a/, 'o')   # evaluates to "bonana"
'banana'.gsub(/a/, 'o')  # evaluates to "bonono"

# evaluates to array ['apples', 'bananas', 'pears']
'apples,bananas,pears'.split(/,/)

The difference between sub and gsub is that the latter substitutes all matches, whereas the former substitutes only the first match.

The legacy of Perl: the match operator =~

There is an alternative way to do regexp matching: string values support the infix message =~, which takes a regular expression object and attempts to match the regular expression against the string:

aString = I say, "'hello, world!"'
matchPos = aString =~ /.*,.*!/

Unlike the match message, the =~ message returns an integer indicating the index at which the match begins (i.e., the equivalent of begin on the MatchData object). The primary way that =~ communicates with its caller is by implicitly setting a large number of variables. These are:

In my opinion, this interface is clearly as ugly as sin. For one thing, the variable names are absurdly cryptic --- why not at least choose a name like $lastMatch instead of $~? For another, these variables get set whenever you do a regexp match. It is all-too-easy to accidentally lose these values by matching another regular expression. Third, even if you don't make a mistake, code that relies on implicit side effects to hidden variables is not "self-documenting" --- the flow of values is not immediately obvious to the programmer.

What is this interface doing in a nice OO language? The answer is that Ruby has another parent --- Perl --- and this is roughly the way Perl regular expressions work.

Avoid using the match operator whenever possible. I discuss it here mostly in order to note in passing that the entire Perl language is designed this way.

Ruby: Lessons

There isn't much that's truly novel in Ruby. All the major language design ideas were present in Smalltalk-80, decades earlier. In fact, Smalltalk is arguably an improvement upon its successor, to paraphrase what Tony Hoare said of Algol-60.

But Ruby is a good example of how someone with taste can succeed by combining great ideas of the past with several good engineering decisions:

  1. standard libraries (regexp matching, process control, etc.) which had previously proven useful in the target application domain
  2. a syntax more familiar to the average late-1990's programmer.*
  3. a programming system that looks and acts more like systems that were previously used by other languages (bash and Perl) that were clearly inferior, yet successful.

The result was a language that succeeded in becoming popular and making programmers' lives better. It's extraoardinarily difficult to design a really novel programming language, and even harder to get the world to adopt what you invent. But there's lots of room in the world for little languages like Ruby, which take great ideas and make useful tools out of them.

* We shouldn't fault the Smalltalk designers for not using a syntax that feels more familiar to late-1990's programmers. In the mid-1970's, when Smalltalk's syntax was designed, "C style" syntax (curly braces etc.) was not firmly entrenched. If there was any "standard" syntax back then, it was probably the "Algol style" used by Algol, Pascal, and Simula (and later Modula and Ada). Even if Smalltalk's designers had wanted to use the most popular style of syntax at the time, it would not have been C-like. This is an example of how the fates of languages depend heavily on the cultural context in which they appear.