Review: Essence of XML

From: CR (chrisre@cs.washington.edu)
Date: Tue Apr 13 2004 - 20:06:21 PDT

  • Next message: Stavan Parikh: "Essence of XML"

    This paper describes a formalization of XML schema. They identify two
    key properties that a data exchange format should have self-description
    and round-tripping. They note that XML has neither. The contributions
    are to give a formal description and proof of when these properties
    hold. It is remarkable that the rule set is so small but it captures
    quite a bit of the actual schema. An untyped value is validated as a
    value if and only if Value matches type and erases to the same untyped
    value. Essentially, the formalization defines validation as
    accomplishing the round-tripping property.

    This paper is exceptionally clear. I really like the intuitive nature of
    the theorems and the restrictions they choose. It seems to be a clear
    problem with XML – and their arguments are appealing. If you want to
    understand your data and store it efficiently, you

    probably need to be able to translate out of the human readable form. It
    may not always be the case and perhaps you can push the binding to
    semantics even later. For example, you can store 007 as a string (even
    7) until you are forced to do an evaluation that requires the type.
    However, validation is essentially forcing this evaluation and there are
    many benefits to validating data.

    Also the method of type checking, the Closed world approach is
    interesting to me. The knee jerk reaction is that it requires a lot more
    work to do the validation but, I'm not sure that this is so. It seems
    that there are many shortcuts that are not expressed. If you want to
    have restriction and inheritance, it seems that you need to have this
    approach. I suppose this is more wondering about implementation.

    One thing that I think could be improved in their model is the absence
    of a normal form for scalars. This would also solve the allowance of
    ambiguous scalar types. I guess I am not sure of the solution but, I do
    think it is a real problem. It feels a little bit like having an
    inductive argument with a flawed base case. I suppose I need to think a
    bit more about why this is acceptable.

    Without the semantics of the data, it is really worth much less. I
    believe that as much of the semantic as can be efficiently expressed
    should be pushed into the representation. The benefits are not only a
    cleaner model but the possibility for optimization and reduced error.
    There is always a benefit to doing things in a clean formal manner –
    unfortunately most of the benefits are things you never anticipated
    which makes it hard to make a case for it.


  • Next message: Stavan Parikh: "Essence of XML"

    This archive was generated by hypermail 2.1.6 : Tue Apr 13 2004 - 20:06:22 PDT