================================================================
CSE 344 -- Spring 2011
Lecture 11:   XML
================================================================

Note: 
   -- we will cover only subsets of XML, XPath, XQuery.  
   -- this is sufficent to quite advanced use
   -- the details we leave out can be found in the optional readings.


================================================================

Show sample-xml.xml (from http://www.w3.org/TR/xquery-use-cases/)

Key concepts
   -- element
   -- tag:  begin-tag must match end-tag
   -- elements are "nested"
   -- must have a "root" element
   -- empty elements: <red> </red> = <red/>
   -- attributes

Elements v.s. attributes

   -- ordered v.s. unordered
   -- repeated v.s. unique
   -- nested v.s. flat

Text

   -- #PCDATA ("Parsed Character Data") = the text inside elements
   -- CDATA ("Character Data") = the text inside  attributes
   -- There is no #CDATA and no PCDATA

Well formed and Valid XML Documents

   -- "well formed" XML document -- when tags are matching
   -- "valid" XML document -- matches a given DTD (discussed later)

Use http://validator.w3.org/check to validate

ID's and IDREFS

   -- an attribute is called an ID attribute if it is unique
   -- an attribute is called IDREF if it references an ID
   -- the DTD defines which attribute(s) are ID/IDREFs

Data Section:

<example>
  <![CDATA[ any text goes here, including </false tags> ]]>
</example>


Entity references:

     -- &lt;  means <
     -- &gt;  means >
     -- that's all we need; the general case is very messy

================================================================

DISCUSSION

*** Question: how to the relational and the XML data model compare ?

*** Which data model would you choose ?

    -- University records: students, courses, grades, etc. 
       Relations or XML? Why?

    -- University Web site: news, academics, admissions, events, research, etc. 
       Relations or XML? Why?

    -- A genealogy database (family tree)
       Relations or XML? Why?


----------------

Relational data model = 

     -- rigid flat structure (tables)
     -- schema must be fixed in advanced
     -- binary representation: good for performance, bad for exchange
     -- query language based on Relational Calculus

Semistructured data model / XML

     -- flexible, nested structure (trees)
     -- does not require predefined schema ("self describing")
     -- text representation: good for exchange, bad for performance
     -- query language borrows from automata theory


================================================================

The semistructured data model = A tree !

(show tree for sample-data.xml in class)

*** In class: Mappings between relational data model an the
semistructured data model

      Student(sid, name)
      Takes(sid, cid, grade)
      Course(sid, title, instructor)

      Represent in XML in two ways (in class)


================================================================

Why do we call it "semistructured" ?

  -- missing attributes

         <person> <name> John</name>
                  <phone>1234</phone>
         </person>

          <person> <name>Joe</name>
          </person>

     can we do this in the relational data model ?


  -- multiple attributes

         <person> <name> Mary</name>
                  <phone>1234</phone>
                  <phone>5678</phone>
         </person>

     can we do this in the relational data model ?

  -- attributes have different types in different objects:

         <person> <name> <first> John</first> <last>Smith</last> </name>
                  <phone>1234</phone>
         </person>

================================================================

DTD = Document Type Definition
(show tree for sample-data-with-dtd.xml in class)

   -- goal: impose a structure on the XML document
   -- rather old and arcane; to be replaced by XML Schema, but that is
      TOOOOO complex

Complex = a regular expression over other elements


<!ELEMENT tag (content)>

where content is one of:

   -- Text-only = #PCDATA
   -- Empty = EMPTY
   -- Any = ANY
   -- Mixed content = (#PCDATA | A | B | C)*
   -- regular expression using , |, *, ?.  EXAMPLES IN CLASS