Ruby is sometimes called a scripting language. The term "scripting language" is often bandied about, but I have never seen or heard anybody describe a precise technical distinction between scripting languages and other languages. (Indeed, Dave Thomas and Andy Hunt, authors of the first major English-language Ruby textbook, seem to agree that it's not a very useful term.)
The thing that seems to lead people to call a language a "scripting language" is the environment in which the major implementation(s) of that language run. Generally, a "scripting language" is one that
The idea is that the language is meant to be used for writing little "scripts" that manipulate its environment. For example, Unix shell scripting languages are designed for controlling the Unix filesystem, commands, and processes. Other large applications --- e.g., office suites, multimedia authoring tools, etc. --- often contain an embedded scripting language for automating small tasks in the application.
Because scripts are usually small, scripting languages tend to be dynamically typed --- maintaining type declarations and annotations is (allegedly) a waste of time for such small programs.
It is sometimes claimed that scripting languages should also be impoverished --- i.e., that you can get away without having powerful but "complicated" constructs like objects, lambdas, and a module system. This claim does not hold up in the face of the evidence: every scripting language that has been around for several years has accumulated many ad hoc constructs to do the jobs of objects, lambdas, etc., and usually has ended up being more complicated than Smalltalk or Scheme in the end.
You should know by now that simplicity and power are not mutually exclusive design goals. Making your language impoverished in the name of simplicity is foolish economy.
When you log into a Unix prompt, you're actually using a
program called a shell --- on modern Unices, most
often bash
, the "Bourne Again SHell".
On modern operating systems (including Unix and Windows), shells actually contain a programming language --- you can set variables called "environment variables", write conditional statements and loops, read input and output, and write to the filesystem.
Much like a read-eval-print loop, a Unix shell generally reads and executes one statement at a time. For example, when you type, at a Unix shell:
emacs myfile.txt
you are actually feeding a line of input (called a
command line) to the shell. The shell parses the
line into tokens, finds the program named emacs
and
sends it a single command line argument, the
string myfile.txt
.
You can also instruct a Unix shell to read and execute an entire file --- see the section below on running scripts in Unix.
The language of Unix shells manages to be weak, but not simple: it is simultaneously impoverished and baroque, filled with arbitrary quirks and limitations that the user must learn. For example:
VARIABLENAME="world"
However, when you want to use a variable value then
you must prefix its name with $
:
OTHERVAR="Hello, $VARIABLENAME"
$@
.
The command line argument number n is stored in a variable named
$n
, e.g. $1
is the name of the first
command line argument. There are many implicitly defined
variables that shell programmers must remember to get work done
--- almost any punctuation mark preceded by a dollar sign is a
special variable in some shell.Given this heinous precedent, it is perhaps not surprising that Perl was welcomed with open arms by Unix shell hackers. Perl is a programming language that's better than shell, but provides many of the same features.
And, indeed, Perl is better than shell, but that is not a very high standard. There is no inherent reason why a programming language for manipulating Unix programs, files, and processes should be even remotely as awful as the shell. Partly in reaction to Perl, people have developed various alternative scripting languages for Unix. Two of the more popular are Python (designed by Guido van Rossum) and Ruby (designed by Matz) --- both are cleaner than Perl, and are gaining in popularity.
In Unix, any text file can be an executable program. When you invoke a program from the command line, Unix does the following:
echo
$PATH
in the Unix shell to get the current path).#!
("shebang"), then the file is assumed to be
a text file containing input for some program (e.g., a shell, or
a programming language interpreter). In this case, the
remainder of the first line is searched for the name of the
program to run. When this program is found, it is executed with
the current file as its input.Step 3 allows you to write scripts using any language of your choice (or, at least any language for which having #! on the first line of input does not break the program). For example, here's a simple bash script that creates a new directory and places an empty README file inside it:
#!/bin/bash mkdir $1 cd $1 touch README
If you save this file as myscript.sh
, give it
execute permissions using chmod +x myscript.sh
, and
run it using ./myscript.sh dirName
, then it will
execute these commands in bash with $1 bound to
dirName
--- i.e., it will create a directory
dirName
, enter that directory, and create ("touch")
an empty file named README
.
Likewise, if you want to write a Unix script using Ruby, you
can create a file called hello.rb
containing the
following text:
#!/usr/bin/ruby puts("Hello, world!\n")
As before, give the file executable permissions using
chmod +x
, and do ./hello.rb
. Your
terminal should print Hello, world!
.
We noted before that most shell languages don't provide any way to define procedures. Ruby, and all real scripting languages, do provide procedures and other abstractions, but it's worth taking a few moments to explore what it's like to program in a world where you invoke external programs to instantiate reusable tasks. (This discussion will use Unix as an example, but most modern operating systems have similar features.)
In the world of "normal" programming, you invoke procedures with arguments, and you obtain results. When a procedure's running, it has an activation; you can view a procedure activation as a black box, with inputs, outputs, and access to shared spaces, as in Fig. 1:
The inputs to a procedure are its parameter values. When the
procedure returns, it can send output through its return value, or
through its output parameters (e.g., in ML, a function with
ref
cell parameters can update their contents).
While it's running, the activation can also use the heap and
globals, which are shared by all activations; these spaces can be
used for both input and output, as well as scratch space.
A procedure activation also has internal state --- local variables --- which can be used while it is evaluating, but which are inaccessible to the outside world.
In a quite analogous manner, when you run a program in Unix, the operating system "activates" the program and creates a process, with its own memory space, its own inputs and outputs, and access to shared spaces. See Fig. 2.
A process's communication channels include:
System.in
)
System.out
and System.err
)System.exit()
).In this universe, the shared spaces are the entire host
environment. Most "normal" programs are meant to "do something to
the world", and therefore we usually think of processes as
"communicating" by changing things in these shared spaces. For
example, when you run emacs
, you're usually trying to
create or update some file in the filesystem. However, you can
use all of a program's inputs and outputs for communication.
An important difference between process activations and most languages' procedure activations is that processes run concurrently --- at any given time, there are many active processes in a system, all doing something. By contrast, in a standard single-threaded programming language, potentially many procedure activations are on the stack, but only the topmost one is "currently executing".
One interesting thing you can do because of this concurrency is
pipe results of one program's standard output to
another program's standard input. In Unix shell languages, the
pipe operator is |
. For example, the following
command runs ls
("list files"), and passes its output
to grep
("global regular expression print"):
ls | grep 'a*.txt'
This will print all the listed files whose names match the
regular expression a*.txt
. In terms of our process
diagram, we have set up a pipeline like the one
in Fig. 3.
Notice that the standard output of the ls
activation is fed into grep
's standard input.
grep
and ls
actually execute
concurrently --- ls
will execute for a while and
produce a "chunk" of output, which will be fed to
grep
, which will process that "chunk" and then wait
for ls
to feed it more. Eventually, ls
will stop running; grep
will stop after finishing the
last chunk that ls
sends to it.
Unix shells' process control supports many sophisticated redirection patterns, of which this is only a relatively simple example. Some common/useful operations:
command1 | command2
: as in the
example above, this operator (called "pipe") sends the standard
output of command1
to the standard input of
command2
.command > filename
redirects
standard output to a file, e.g. ls >
dirListing.txt
.filename > command
runs
command
with filename
's
contents sent to standard input.`command`
(command in backquotes> runs
command
and captures its standard output as
a string value, e.g.
LS_OUTPUT=`ls *.sml`captures the directory listing as a string and assigns it to the shell variable
LS_OUTPUT
. You can then do things
like loop over the words in the variable using for
:
for file in $LS_OUTPUT do echo "a file: $file" done
system
and related calls
(popen
, fork
, exec
) in
the and Unix standard libraries. C's system
takes
a string argument that names the command, e.g.:
int resultCode = system("ls");
OS.Process
structure provides a
system
function that works like C's
system
:
structure OS.Process : OS_PROCESS = struct .. end signature OS_PROCESS = sig eqtype status val success : status val failure : status val system : string -> status ... endThere is also a
Unix
structure that allows more
detailed control over processes under Unix-compatible operating
systems.Runtime
provides exec
method, which starts a process and returns a handle of type
Process
for manipulating that running process (for
example, it allows you to send text to the running process's
standard input stream). To get an instance of
Runtime
, use the static method
Runtime.getRuntime()
.
class Runtime { static Runtime getRuntime() { ... } Process exec(String command) { ... } }
x = `command arg1 ... argN`It also has a
system
call, which resembles the C
system call, and IO.popen
, which returns a
representation of the process similar to Java's
Process
handle:
system("command arg1 ... argN") IO.popen("programName", mode)
Ruby is a purely object-oriented language descended primarily from Smalltalk (for its semantics) and Perl (for its syntax, libraries, and programming environment). Ruby was originally designed by Yukihiro Matsumoto (a.k.a. "Matz"), in the mid-1990's. Ruby 1.0 was released in the late 1990's. By release 1.6 (ca. 2000) the language had become popular in Japan, but it was used relatively rarely outside of Japan. However, Ruby has been steadily gaining in popularity worldwide as an alternative to uglier languages like Perl.
Many Ruby distributions ship with irb
,
"interactive Ruby" (in contrast with the usual batch-mode
ruby
executable, which simply processes an entire
file at once). irb
provides an interactive
read-eval-print loop.
If your distribution does not come with irb
, you
may be able to find a file named eval.rb
, which
provides a read-eval-print loop with fewer features. On Linux
systems, try locate eval.rb
to find it; on our
attu
server, it's currently at
//usr/share/doc/ruby-1.6.8/sample/eval.rb
. Then, to
run this file, I type
ruby /usr/share/doc/ruby-1.6.8/sample/eval.rb
at the command line.
Ruby is a pure object-oriented language in which all expressions (except assignment, as in Smalltalk) are message sends. Therefore, its core expression evaluation semantics are quite similar to Smalltalk's, but its syntax differs quite a bit...
"hi" # double quoted string literal 'hi\n' # single quoted string literal; equiv. to "hi\\n" %q{'hi,' I said} # %q-quoted string literal; note uninterpreted quote :hi # a symbol 24_000 # a small number (instance of Fixnum) 10_000_000_000 # a large number (instance of Bignum) -1 # prefix unary send 'hello,' + ' world' # infix send 'hello, world'.slice(0, 5) # keyword send: returns 'hello' 3.to_s # no-argument keyword send: returns string '3' [3, 4, 17, -2] # literal array [3, 4, 17, -1].sort # keyword send to literal array x = 'hi' # assignment; returns left-hand-side a = b = c = x # chained assignments (assignments are expresions) # conditional expression i = if x == 'hi' then 3 else 7 end # block-based control structure 3.times { puts('hi!\n') } # the upto message of integers takes a block parameter 1.upto(10) { |x| puts("x!\n") } # the foreach message on arrays --- takes a block ['hi','hola','bonjour'].doeach { |x| puts(x + ', world') }
A subset of the Ruby grammar:
expr ::= atom | binding | prefixSend | infixSend | keywordSend | syntacticSugar | ( expr ) atom ::= ID | @ID | @@ID | $ID | literal | firstClassBlock literal ::= INTEGER | STRING | REGEXP | ... firstClassBlock ::= lambda plainBlock | proc plainBlock plainBlock ::= { [| ID* |] stmt* } do [| ID* |] stmt* end binding ::= name = expr infixSend ::= expr OPERATOR expr keywordSend ::= expr.ID [( expr*\, )] [plainBlock] syntacticSugar ::= ifExpr | caseExpr | ...
The chief syntactic differences (which are not all reflected in
the BNF above; many of the differences are in terminals such as
STRING
) are:
lambda
and proc
. Either
these flavors' blocks can be delimited by either curly braces,
or by do
/end
. So, these all mean the
same thing:
lambda { |x| x + 1 } proc { |x| x + 1 } lambda do |x| x + 1 end proc do |x| x + 1 endAll of these can be called by sending the
call
or square bracket []
messages:
(lambda {|x| x + 1}).call(2) # evaluates to 3 (proc do |x| x + 1 end)[2] # same as above
receiver.methodName(args)
. Keyword sends may have
no arguments, in which case the argument list and parens are
optional.lambda
or proc
keyword. Such blocks
implement iterators in the style of CLU. The iterator
protocol is somewhat complex, and we won't discuss them
further.-x
instead of x
negated
.name
is an "ordinary" variable (local or
parameter name).@name
is an instance variable.@@name
is a class variable.$name
is a global variable\
character at the end of the line that you wish to continue.if
, case
, and while
.
These structures are implemented using blocks, as in Smalltalk,
but the programmer does not have to use the block syntax when
writing them.The syntax of class declarations is as follows:
class ClassName < SuperclassName body end
As usual, class bodies include methods, whose syntax is as follows:
def methodName body end
For example:
class Person < Object def greeting "Hello, world!" end end
The < SuperclassName
clause is optional
when the superclass is Object. Receiver instance variables are
named inside a method by a preceding @
sign:
class Person def greet(personName) "Hello, " + personName + ", my name is " + @name end end
Notice that I re-opened the Person
declaration in
the above example. Unlike classes in many object-oriented
languages, Ruby's classes are open: you can add
members to any class anywhere in the source text simply by
reopening the class declaration. Later definitions override
ealier ones. Existing instances will be updated to reflect any
additional code.
In Squeak's Morphic library, the base class Morph
overrides new
to call initialize
on
freshly created instances before returning them. In Ruby, this is
a language-wide convention. In fact, Ruby's default
new
class method takes a variable number of arguments
(Ruby has Scheme-style varargs) and it passes all its arguments on
to initialize
. Therefore, you can define
"constructor"-like code by writing an instance method
initialize
:
class Person def initialize(name) @name = name end end
Notice that the class declaration does not contain any
declarations for instance variables (unlike in Smalltalk and most
other object-oriented languages). You get an instance variable
whenever you assign to an @
variable --- if the
variable already has a value, it is updated, and if it does not
then a new variable is created.
Similarly, any undefined variable expressions will simply return nil:
class Person def accessUndefinedVariable @someVariableNameWeHaveNotDefined end end (Person.new).accessUndefinedVariable # Evaluates to nil
Recall that in Scheme, we could define variable-argument
functions by using .
(dot):
(lambda (first second . rest) ...)
In Ruby, we can define variable-argument lambdas or methods by
using *
:
lambda { |first second *rest| ... } # Vararg anon function def f(first second *rest) # Vararg method .... end
Ruby has (a primitive form of) dynamically typed pattern matching for assignments:
arr = [1, 2, 3, 4, 5] a, b, c = arr # executes a=1, b=2, c=3 d, e = a, b # executes d=a, e=b
Notice that there is no data constructor around the comma-separated names on the left-hand side, and that the pattern need not have the same number of parts as the value.
To define a unary prefix message, you write a method name with
op@
, where op
is the prefix operator you
want to define. However, Ruby only (syntactically) allows a
handful of prefix message names, including +
and
-
class Complex def initialize(r, i) @r = r @i = i end def -@ Complex.new(-@r, @i) end end
As in Smalltalk, an object's instance variables are not
accessible outside the methods of the object's class --- simply
because there is no such thing as a field access expression. In
Smalltalk, you defined accessor methods, e.g., for
Point
:
x ^ x x: aNumber x := aNumber
Ruby doesn't allow field access either, but Ruby's designer was fond of the
expr.fieldName = expr'
syntax found in languages that do permit direct field access.
Ruby provides assignment methods, which are method names that
begin with a name or operator and end in =
:
class Complex def r=(aNumber) @r = aNumber end def i=(aNumber) @i = aNumber end def []=(anIndex, aNumber) if anIndex == 0 then @r = aNumber else @i = aNumber end end end a = Complex.new(1, 2) a.r = 3 a[1] = 4
If you define an assignment method, Ruby will automatically
define certain other methods for you --- for example, if you
define r=
, then Ruby will also define
r+=
. So, for example:
a = Complex.new(1, 2) a.r += 3 # equivalent to a.r = a.r + 3
Ruby has syntactic sugars for defining accessor methods:
class Complex attr_accessor :r, :i # Automatically defines r, r=, i, i= end
To define read-only or write-only attributes, use
attr_reader
or attr_writer
instead:
class Odd def initialize(a, b) @a = a @b = b end attr_reader :a # defines only method a attr_writer :b # defines only method b= end anOdd = Odd.new('a', 'b') anOdd.a = 'foo' # illegal x = anOdd.b # illegal anOdd.b = anOdd.a # ok
All classes have a single superclass, but Ruby has "mixins",
which allow you to copy any other class's methods into the current
class by using include
. For example, the class
Comparable
is defined as follows:
class Comparable def <(other) (self <=> other) == -1 end def <=(other) (self <=> other) == -1 \ or (self <=> other) == 0 end ... # in the same vein, but does not include <=> end
You can implement the entire Comparable
interface
simply by defining <=>
and "mixing in" the
interface of Comparable
by using the
include
construct:
class Complex def <=>(other) mySize = Math.sqrt(self.r * self.i) otherSize = Math.sqrt(other.r * other.i) return mySize <=> otherSize end include Comparable end c1 = Complex.new(0, 1) c2 = Complex.new(1, 2) less = c1 < c2
Expressions can be substituted into double-quoted strings by
using the #{expr}
syntax:
x = 'something' y = "#{x} interpolated" # evaluates to "something interpolated"
Notice that this is highly unusual by the most programming
language standards --- the value of a literal depends on the
values of expressions in its context. One way to think about this
is that a double-quoted string is syntactic sugar for a call to
printf
(Ruby's formatted string printing function,
which borrows its syntax from C's printf
):
# An alternative way of obtaining the same effect y = '%s interpolated'.printf(x)
We mentioned regular expressions when discussing language grammar definitions. Many programming activities involve manipulating text, and regular expressions are useful for text processing applications in general, not just programming language lexers.
Regular expressions as a programming tool were probably
pioneered by the Unix utilities grep
,
sed
, and awk
. Perl added many
extensions to regular expressions, and syntactic sugar for writing
regular expressions and matching strings against them. Ruby
adopted many of Perl's syntactic conventions, including regular
expression literals and a regular expression matching
operator.
In Ruby, a regular expression literal is either written between
two forward slashes, or enclosed in %r{
and
}
:
# Equivalent ways of constructing a regular expression object r1 = /[a-z]+.*/ r2 = %r{[a-z]+.*}
Both of these are syntactic sugar for passing a string to the
Regexp.new
class method:
r3 = Regexp.new('[a-z]+.*')
To match a regular expression against a value, you call the
match
method, which returns an instance of the
MatchData
class if the match succeeds:
aMatch = /h.*,.*!/.match('I say, "hello, world!"')
By default, a regular expression searches for a match anywhere
in the string, so the above match will succeed. Failed matches
return nil
.
Regular expressions in Ruby have many features beyond the
automata-theoretic formulation of regular expressions. One
important feature is parenthesized submatching: any part of the
regular expression enclosed in parentheses will be "saved" for
later extraction via the MatchData
object. The
ith parenthesized subexpression can be extracted from the
object using the array subscript operator []
. For
example:
aMatch = /(h.*),(.*)!/.match('I say, "hello, world!"') puts(aMatch[1]) # prints 'hello' puts(aMatch[2]) # prints 'world'
Index zero produces the whole match:
puts(aMatch[0]) # prints 'hello, world!'
MatchData
objects understand various other
messages, including:
begin(n)
: returns the index of the beginning of
the nth matched substring.pre_match
: the portion of the string preceding
the matched substringpost_match
: the portion of the string
succeeding the matched substringBy default, regular expressions match anywhere in a string.
^
indicates that an expression should match only at
the beginning of a line, and $
means the expression
should match only at the end of a line. The following expression
matches any line with at least one equals sign; it saves two
submatches: (1) everything from the beginning of a line up to the
last equals sign, and (2) everything else:
^(.*)=(.*)$
For example:
m = /^(.*)=(.*)$/.match("FOOBAR='baz bif') m[1] # equals "FOOBAR" m[2] # equals "'baz bif'"
Recall that regular expressions often support character
classes, like [A-Za-z]
, which match all
listed characters. There are some predefined character
classes:
\d
: digits (0-9)\D
: non-digit characters\s
: whitespace (spaces, tabs, etc.)\S
: non-whitespace characters\w
: word characters (letters, numbers, and underscore)\W
: non-word characters
For example:
# one or more word characters at the beginning of a line, followed # by a space, followed by one or more digits, saving the word and digits re1 = /^(\w+)\s(\d+)/ # match anything following initial whitespace, and save it all up to # the end of a line re2 = /^\s*(\S*)$/
Regular expressions can also be used for string substitution and splitting:
# evaluates to "goodbye, world" 'hello, world'.sub(/^\w*/, 'goodbye) 'banana'.sub(/a/, 'o') # evaluates to "bonana" 'banana'.gsub(/a/, 'o') # evaluates to "bonono" # evaluates to array ['apples', 'bananas', 'pears'] 'apples,bananas,pears'.split(/,/)
The difference between sub
and gsub
is that the latter substitutes all matches, whereas the
former substitutes only the first match.
=~
There is an alternative way to do regexp matching: string
values support the infix message =~
, which takes a
regular expression object and attempts to match the regular
expression against the string:
aString = I say, "'hello, world!"' matchPos = aString =~ /.*,.*!/
Unlike the match
message, the =~
message returns an integer indicating the index at which the match
begins (i.e., the equivalent of begin
on the
MatchData
object). The primary way that
=~
communicates with its caller is by implicitly
setting a large number of variables. These are:
$&
(yes, dollar-sign ampersand) receives the
substring that matched the regular expresion$`
(dollar-sign backquote) receives the
substring preceding the matched substring$'
(dollar-sign regular-quote) receives the
substring after the matched substring$~
(dollar-sign tilde) receives the
MatchData
object corresponding to the match$1
through $9
receive the first
through ninth matched substrings captured using
parentheses-matching --- i.e., what you'd get if you subscripted
the MatchData
object, $~[1]
through
$~[9]
.In my opinion, this interface is clearly as ugly as sin. For
one thing, the variable names are absurdly cryptic --- why not at
least choose a name like $lastMatch
instead of
$~
? For another, these variables get set
whenever you do a regexp match. It is all-too-easy to
accidentally lose these values by matching another regular
expression. Third, even if you don't make a mistake, code that
relies on implicit side effects to hidden variables is not
"self-documenting" --- the flow of values is not immediately
obvious to the programmer.
What is this interface doing in a nice OO language? The answer is that Ruby has another parent --- Perl --- and this is roughly the way Perl regular expressions work.
Avoid using the match operator whenever possible. I discuss it here mostly in order to note in passing that the entire Perl language is designed this way.
There isn't much that's truly novel in Ruby. All the major language design ideas were present in Smalltalk-80, decades earlier. In fact, Smalltalk is arguably an improvement upon its successor, to paraphrase what Tony Hoare said of Algol-60.
But Ruby is a good example of how someone with taste can succeed by combining great ideas of the past with several good engineering decisions:
The result was a language that succeeded in becoming popular and making programmers' lives better. It's extraoardinarily difficult to design a really novel programming language, and even harder to get the world to adopt what you invent. But there's lots of room in the world for little languages like Ruby, which take great ideas and make useful tools out of them.
* We shouldn't fault the Smalltalk designers for not using a syntax that feels more familiar to late-1990's programmers. In the mid-1970's, when Smalltalk's syntax was designed, "C style" syntax (curly braces etc.) was not firmly entrenched. If there was any "standard" syntax back then, it was probably the "Algol style" used by Algol, Pascal, and Simula (and later Modula and Ada). Even if Smalltalk's designers had wanted to use the most popular style of syntax at the time, it would not have been C-like. This is an example of how the fates of languages depend heavily on the cultural context in which they appear.