From: CR (chrisre@cs.washington.edu)
Date: Tue Apr 13 2004 - 20:06:21 PDT
This paper describes a formalization of XML schema. They identify two
key properties that a data exchange format should have self-description
and round-tripping. They note that XML has neither. The contributions
are to give a formal description and proof of when these properties
hold. It is remarkable that the rule set is so small but it captures
quite a bit of the actual schema. An untyped value is validated as a
value if and only if Value matches type and erases to the same untyped
value. Essentially, the formalization defines validation as
accomplishing the round-tripping property.
This paper is exceptionally clear. I really like the intuitive nature of
the theorems and the restrictions they choose. It seems to be a clear
problem with XML – and their arguments are appealing. If you want to
understand your data and store it efficiently, you
probably need to be able to translate out of the human readable form. It
may not always be the case and perhaps you can push the binding to
semantics even later. For example, you can store 007 as a string (even
7) until you are forced to do an evaluation that requires the type.
However, validation is essentially forcing this evaluation and there are
many benefits to validating data.
Also the method of type checking, the Closed world approach is
interesting to me. The knee jerk reaction is that it requires a lot more
work to do the validation but, I'm not sure that this is so. It seems
that there are many shortcuts that are not expressed. If you want to
have restriction and inheritance, it seems that you need to have this
approach. I suppose this is more wondering about implementation.
One thing that I think could be improved in their model is the absence
of a normal form for scalars. This would also solve the allowance of
ambiguous scalar types. I guess I am not sure of the solution but, I do
think it is a real problem. It feels a little bit like having an
inductive argument with a flawed base case. I suppose I need to think a
bit more about why this is acceptable.
Without the semantics of the data, it is really worth much less. I
believe that as much of the semantic as can be efficiently expressed
should be pushed into the representation. The benefits are not only a
cleaner model but the possibility for optimization and reduced error.
There is always a benefit to doing things in a clean formal manner –
unfortunately most of the benefits are things you never anticipated
which makes it hard to make a case for it.
This archive was generated by hypermail 2.1.6 : Tue Apr 13 2004 - 20:06:22 PDT