A definite-clause grammar representation of an XSD schema

A working paper prepared for the W3C XML Schema Working Group

C. M. Sperberg-McQueen

18 July 2004



This document describes a definite-clause grammar written to parse documents which conform to a particular XSD schema (specifically the purchase order schema described in [W3C 2001a]. This provides a simple example of using logic grammars to represent schemas.
This paper assumes a working knowledge of definite-clause grammar notation; see [Sperberg-McQueen 2004a] for the essential background and pointers to further reading.

1. Background

1.1. Using logic grammars in schema processing

Any schema defines a set of trees, and can thus be modeled more or less plausibly by a grammar. Schemas defined using XML Schema 1.0 impose some constraints which are not conveniently represented by pure context-free grammars, and the process of schema-validity-assessment defined by the XML Schema 1.0 specification requires implementations to produce information that goes well beyond a yes/no answer to the question “is this tree a member of the set?” For both of these reasons, it is convenient to use a form of attribute grammar to model a schema; logic grammars are a convenient choice.
In this paper, I introduce some basic ideas for using logic grammars as a way of animating the XML Schema specification / modeling XML Schema.
The first question that arises is:
  • How do we use logic grammars to model XML Schema processing?
To this, there are three answers:
  • Model the schema as a logic grammar; use it to parse document instances and generate the post-schema-validation infoset (PSVI).
  • Move up one level of abstraction: model the schema for schemas as a logic grammar; use it to parse schema documents and generate the appropriate schema components.
  • Do both. This could be done by writing a utility to take a set of components (in the form generated by the logic grammar for schema documents) and generate a corresponding logic grammar (in the form needed for parsing other document instances).
Some problems arise if we use logic grammars to model schemas and perform schema-validity assessment by parsing a document using the logic grammar. These problems are attacked here and in a companion paper in a series of small examples in which a logic grammar is developed which represents a simple schema:
  • In practice, the XML input document is likely to be presented to us in the form of a tree. How do we apply logic grammars to things that are already trees instead of sequences of tokens?
  • How do we check XML attributes?
  • How do we model type assignment, black-box validation, white-box validation?
  • How do we model the post-schema-validation information set?
Some problems arise if we use a logic grammar to model the rules for constructing schema components, and parse schema documents with that logic grammar as a way of modeling the process of constructing schema components.
  • How do we use parse-tree attributes to model information-item properties?
  • How do we use the output of this process to drive schema-validity assessement of document instances? Can we formulate the logic grammars used for schema-validity assessment as properties / attributes of the schema components?
In the remainder of this paper I outline a simplified representation of the purchase-order schema po1.xsd from the XML Schema tutorial [W3C 2001a], which shows how a DCG can be used to support DTD-style validation in a straightforward way This simple example shows how logic grammars can be applied as representations of schemas, but does not illustrate all aspects of schema validation.
In a companion paper ([Sperberg-McQueen 2004b]) I extend the purchase-order example, using definite-clause translation grammar (DCTG) notation to capture information-item properties in the post-schema-validation information set (PSVI) defined in [W3C 2001b] and [W3C 2001c]. In follow-on work to be reported separately I plan to work systematically through the validation rules of the XML Schema specification, providing a more systematic representation of schema-validity assessment in logic-grammar terms, and to develop a logic grammar which reads XML Schema documents and enforces constraints on their XML representation and on the components derived from them, together with routines for compiling schema components into the logic-grammar representation sketched here.

1.2. Representing XML in Prolog

The SWI-Prolog system includes an SGML/XML parser which conveniently makes SGML and XML documents accessible to Prolog manipulation [Wielemaker 2001].
The parser returns the XML document to the user as a list of items,[1] each item one of:
  • an atom (for character data)
  • a structure of the form element(GI,AttributeList,ContentList), where
    • GI is an atom representing the element's generic identifier
    • AttributeList is a list containing the element's attribute-value specifications in the form Attributename=Value, with both attribute names and attribute values represented as Prolog atoms. Like all Prolog lists, this one is comma-separated, so a list as a whole might look like [id=frobnitzer,rend=blort,type=farble]
    • ContentList is a list of items (i.e. atoms representing character data, element(Gi,A,C) structures, entity(N) structures, etc.
  • a structure of the form entity(Integer) representing an otherwise unrepresentable character
  • a structure of the form entity(Name) representing an otherwise unrepresentable character for which a named entity is defined
  • a structure of the form pi(Text) representing a processing instruction
For example, the sample purchase order po1.xml used as an example in [W3C 2001a] looks like this in the format described (I have added white space for legibility; the single quotation marks around some atoms are a way of ensuring that they can be read again by a Prolog implementation and recognized as atoms):
[element('http://www.example.com/PO1':purchaseOrder, 
  [xmlns:apo='http://www.example.com/PO1', 
   orderDate='1999-10-20'], 
  [element(shipTo, 
    [country='US'], 
    [element(name, [], ['Alice Smith']), 
     element(street, [], ['123 Maple Street']), 
     element(city, [], ['Mill Valley']), 
     element(state, [], ['CA']), 
     element(zip, [], ['90952'])
    ]), 
   element(billTo, 
    [country='US'], 
    [element(name, [], ['Robert Smith']), 
     element(street, [], ['8 Oak Avenue']), 
     element(city, [], ['Old Town']), 
     element(state, [], ['PA']), 
     element(zip, [], ['95819'])
    ]), 
   element('http://www.example.com/PO1':comment, 
    [], 
    ['Hurry, my lawn is going wild!']
   ), 
   element(items, [], 
    [element(item, 
      [partNum='872-AA'], 
      [element(productName, [], ['Lawnmower']), 
       element(quantity, [], ['1']), 
       element('USPrice', [], ['148.95']), 
       element('http://www.example.com/PO1':comment, 
        [], 
        ['Confirm this is electric']
       )
      ]), 
     element(item, 
      [partNum='926-AA'], 
      [element(productName, [], ['Baby Monitor']), 
       element(quantity, [], ['1']), 
       element('USPrice', [], ['39.98']), 
       element(shipDate, [], ['1999-05-21'])
      ])
    ])
  ])
]
When elements and attributes are namespace-qualified, as the purchaseOrder and comment elements are in this example, the SWI-Prolog parser represents the name as a pair of two atoms, separated by a colon. To the right of the colon is the local name; depending on the option specified, the left-hand atom is either the namespace name (as shown here) or the namespace prefix used.
Other Prolog representations of XML are possible, of course, but we can use this one and it already exists.

1.3. A simple schema

Consider the simple schema defined for purchase orders in Part 0 of the XML Schema specification ([W3C 2001a]). Its complex types are relatively simple, involving only sequences of elements, some of which are optional, and some attributes. Some attributes and some elements are defined with named and some with anonymous simple types. This schema will serve admirably to illustrate the translation of schemas to logic grammars. For simplicity of exposition, the translation will be divided into several layers. This paper shows:
  1. translation into DCG form
  2. checking for duplicate and missing attributes, validating attribute types
while a companion paper ([Sperberg-McQueen 2004b]) shows:
  1. translation into DCTG form and supplying PSVI properties
  2. supplying properties related to validity, handling invalid input
Some features of XML Schema (handling xsi:type, supporting wildcards, supporting substitution groups) won't be shown here; the purchase-order schema doesn't have any wildcards or substitution groups, and it doesn't have much scope for the use of xsi:type in document instances. But it will be possible, after developing the example, to explain how these features can be handled in logic grammars.

2. Layer 1: Translation into definite-clause grammar

Each element type in the language is translated into three sets of DCG productions:
  • a top-level rule to match the tag and get the contents
  • a set of rules for checking the element's attributes
  • a set of rules defining the element's content; this is the part that looks most like a conventional DCG

2.1. Top-level rules for element types

For each element type in the schema, we create one rule in the grammar, which will serve to match the start-tag, get the attributes and contents of the element, and call routines to check the attributes and content against the complex type. Here there is exactly one such rule for each element type, but in later examples we will have several, in order to handle substitution groups. A simple version of these top-level rules would look like this:[2]
< 1 Rules for elements with complex types [File podcg3.pl] > ≡
/* Definite-clause grammar for the XML Schema 
 * represented in po.xsd.
 * 
 * This DCG was generated by a literate programming 
 * system; if maintenance is necessary, make changes 
 * to the source (podcg.xml), not to this output file.
 */
purchaseOrder --> [element('http://www.example.com/PO1':purchaseOrder,
    Attributes,Content)],
  {
    purchaseOrderType_atts(Attributes,[]),
    purchaseOrderType_cont(Content,[])
  }.
shipTo --> [element(shipTo,Attributes,Content)],
  {
    usAddress_atts(Attributes,[]),
    usAddress_cont(Content,[])
  }.
billTo --> [element(billTo,Attributes,Content)],
  {
    usAddress_atts(Attributes,[]),
    usAddress_cont(Content,[])
  }.
items --> [element(items,Attributes,Content)],
  {
    t_items_atts(Attributes,[]),
    t_items_cont(Content,[])
  }.
item --> [element(item,Attributes,Content)],
  {
    t_item_atts(Attributes,[]),
    t_item_cont(Content,[])
  }.



Because we get the XML document already in tree form, the top-level rules don't have much work to do. In particular, they do not have to identify the substructure of the element or find its end; that has already been done. Instead, we get the attributes and content already separated out. We launch a separate call on each to parse them according to the appropriate rule for their datatype.[3]
The comment element and others associated with simple types pose an interesting question. We could define them following the pattern above, like this:
comment --> [element('http://www.example.com/PO1':comment,
    Attributes,Content)],
  {
    xsd_string_atts(Attributes,[]),
    xsd_string_cont(Content,[])
  }.
But we know that the xsd:string type does not have any attributes declared, so Attributes must be the empty list. We can therefore define these elements this way instead:
< 2 Rules for elements with simple types [File podcg3.pl] > ≡
comment --> [element('http://www.example.com/PO1':comment,
    [],Content)],
  {
    xsd_string_cont(Content,[])
  }.

poname --> [element(name,[],Content)], 
           { xsd_string_cont(Content,[]) }.
street --> [element(street,[],Content)], 
           { xsd_string_cont(Content,[]) }.
city   --> [element(city,[],Content)], 
           { xsd_string_cont(Content,[]) }.
state  --> [element(state,[],Content)], 
           { xsd_string_cont(Content,[]) }.
zip    --> [element(zip,[],Content)], 
           { xsd_decimal_cont(Content,[]) }.

productName --> [element(productName,[],Content)], 
                { xsd_string_cont(Content,[]) }.
quantity    --> [element(quantity,[],Content)], 
                { xsd_integer_cont(Content,[]) }.
'USPrice'   --> [element('USPrice',[],Content)], 
                { xsd_decimal_cont(Content,[]) }.
shipDate    --> [element(shipDate,[],Content)], 
                { xsd_date_cont(Content,[]) }.




2.2. Rules for attributes of complex types

For each complex type in the schema, we generate two rules, one we will use to check the attributes, and one to check the contents. For each simple type, we generate one rule for checking the value.
The attributes for a given element can be enumerated; in this first example, we check to make sure each attribute was declared.
< 3 Rules for attributes defined for complex types (v1) > ≡
purchaseOrderType_atts --> [].
purchaseOrderType_atts --> 
   purchaseOrderType_att, purchaseOrderType_atts.
purchaseOrderType_att --> [orderDate=Date].

usAddress_atts --> [].
usAddress_atts --> usAddress_att, usAddress_atts.
usAddress_att --> [country='US']. /* note: fixed value! */

t_items_atts --> [].
t_items_atts --> t_items_att, t_items_atts.
t_items_att  --> no_such_attribute.
/* note that t_items_att is undefined, since there
 * are no attributes for the Items type 
 */

t_item_atts --> [].
t_item_atts --> t_item_att, t_item_atts.
t_item_att  --> [partNum=SKU]. 

This code is not used elsewhere.

For now, we do not worry about whether required attributes have been specified, or about checking the type of the attribute, or about supplying default values; we will do that later.

2.3. Rules for content of complex types

The most conventional-looking part of our grammar is the representation of the content models. A purchase order (being of type purchaseOrderType) has a ship-to address, a billing address, possibly a comment, and a list of items. An element with complex type USAddress has a name, street, city, state, and zip. We handle optional elements by writing a pair of recursive productions:
< 4 [File podcg3.pl] > ≡
purchaseOrderType_cont --> shipTo, billTo, opt_comment, items.
opt_comment --> [].
opt_comment --> comment, opt_comment.

usAddress_cont --> poname, street, city, state, zip.

t_items_cont --> star_item.
star_item    --> [].
star_item    --> item, star_item.

t_item_cont --> productName, quantity, 'USPrice', 
                opt_comment, opt_shipDate.

opt_shipDate --> [].
opt_shipDate --> shipDate.



It's worth noting that since the content models of XML all define regular languages, the right hand sides of the DCG rules for handling the content of elements are all rather simple. The only complexity is introduced by nested choices, occurrence indicators, and the like. In a way, this is the reflection of a fundamental fact about validating XML: the XML structure is given rather than needing to be discovered, and the rules for validation are correspondingly simpler.
A practical consequence of the fact that XML works with regular-right-part grammars while we are working with something more like standard Backus-Naur Form is that whenever we have recursive constructs like star_block, the parse tree described by the DCG does not correspond directly to the XML structure, and when we generate PSVI structures we are going to need to flatten the result to make it represent the XML structure properly.[4]
There is one other complication in practice: if an element has mixed content, we need to allow atoms representing character data between sub-elements; otherwise, we need to allow only atoms representing white space. For now, we step around this problem by instructing the upstream XML parser to strip whitespace from the beginning and ending of CDATA nodes; this does the job for this particular schema.[5] Later, we'll need to find a systematic way to hop over inter-element characters, and to restrict them to white space where appropriate.

2.4. Rules for checking values of simple types

We need some rules for checking simple types. The rules for built-in types of XML Schema, at least, can be written once for all and built in to a library. N.B. In order to get something working quickly, we cut some corners here.
In each case we write two rules, one to check the content of an element by checking to see whether the putative contents are a legal value, and the second to check the value itself. Within these latter rules, it is necessary to remember that the thing we are checking is a Prolog atom created from the lexical form for the putative value. In the case of strings, anything that's a legal lexical form (represented, following the conventions of the SWI-Prolog parser, as an atom) counts as a legal string.
< 5 Checking simple types [File podcg3.pl] > ≡
/* In our representation of XML, character data is represented
 * as atoms.  See note on handling of non-ASCII characters */
xsd_string_cont([H|T],T) :- xsd_string_value(H).
xsd_string_value(LF) :- atom(LF).



We check decimal numbers by converting the lexical form from an atom into a sequence of characters and then seeing whether Prolog can create a number out of that sequence of characters. This is actually a bit too broad, at least with the Prolog I am using, since it accepts numbers in scientific notation. But it will do for an illustration.
< 6 Checking numbers [File podcg3.pl] > ≡
/* We'll accept anything as a decimal number if we can convert
 * the atom to a list of character codes which can in turn be
 * converted to a number. 
 */
xsd_decimal_cont([H|T],T) :- xsd_decimal_value(H).
xsd_decimal_value(LF) :- atom_codes(LF,L), number_codes(N,L).
xsd_integer_cont([H|T],T) :- xsd_integer_value(H).
xsd_integer_value(LF) :- 
  atom_codes(LF,L), 
  number_codes(N,L),
  integer(N).




For dates, we really punt for now.
< 7 Checking dates [File podcg3.pl] > ≡
xsd_date_cont([H|T],T) :- xsd_date_value(H). 
xsd_date_value(LF) :- atom(LF). 
/* well, we really need to check this ... */



With this set of rules, we have a primitive but working program for checking whether a given XML document does or does not conform to the purchase order schema. There are a number of constraints it does not check, however; in the following sections of this document we will refine the code to make it a more persuasive representation of the schema.

3. Layer 2: checking attribute types

One obvious improvement would be to check to make sure that the values of attributes are legitimate representations of the simple types with which the attributes are declared. Here is a revision of the relevant rules:
< 8 Rules for attributes defined for complex types (v2) [File podcg3.pl] > ≡
purchaseOrderType_atts --> [].
purchaseOrderType_atts --> 
  purchaseOrderType_att, purchaseOrderType_atts.
purchaseOrderType_att --> 
  [orderDate=Date], { xsd_date_value(Date) }.

usAddress_atts --> [].
usAddress_atts --> usAddress_att, usAddress_atts.
usAddress_att --> [country='US']. /* note: fixed value! */

t_items_atts --> [].
t_items_atts --> t_items_att, t_items_atts.
t_items_att  --> [this_will_never_match].
/* Since it has no equal sign or value, the rule for
 * t_items_att will never match anything.
 * Alternatively we could leave t_items_att undefined, since 
 * there are no attributes for the Items type.
 */

t_item_atts --> [].
t_item_atts --> t_item_att, t_item_atts.
t_item_att  --> [partNum=SKU],
  { po_sku_value(SKU) }.



Hand-coding a rule to check SKU values is not too hard, and illustrates how we can check the lexical forms of simple types by writing mini-grammars for them. The SKU type is declared this way:
 <xsd:simpleType name="SKU">
  <xsd:restriction base="xsd:string">
   <xsd:pattern value="\d{3}-[A-Z]{2}"/>
  </xsd:restriction>
 </xsd:simpleType>
The pattern can be translated readily into the following grammar operating on the character sequence of the lexical form:
< 9 Checking SKU values [File podcg3.pl] > ≡
po_sku_value(LF) :- 
  atom_chars(LF,Charseq),
  sku_value(Charseq,[]).

sku_value --> sku_decimal_part, hyphen, sku_alpha_part.
sku_decimal_part --> digit, digit, digit.
sku_alpha_part --> cap_a_z, cap_a_z.




The mini-grammar for SKU values relies on the SWI-Prolog character classification predicate char_type, though it would be possible to implement rules for digit checking and letter checking in any Prolog system.
< 10 Character classification rules [File podcg3.pl] > ≡
digit --> [Char], { char_type(Char,digit) }.
hyphen --> ['-'].
cap_a_z --> [Char], { char_type(Char,upper) }.



4. Layer 3: checking for duplicate, defaulted, and required attributes

Another obvious improvement would be to check to ensure that attributes are not specified more than once (strictly speaking, the upstream XML processor should be checking this, so this is probably not essential)[6] and that required attributes do have values specified. We have already seen how to check a fixed value: we simply specify the value as part of the grammatical rule for the attribute.
In our sample schema, the only required attribute on any type is the partNum attribute on the item element; it is also the only attribute defined for items. In a case like this one, we could simply write the relevant part of the grammar this way:
t_item_atts --> [partNum=SKU], { po_sku_value(SKU) }.
Or we could even put the constraint into the top-level rule for the item element:
item --> [element(item,[partNum=SKU],Content)],
  {
    po_sku_value(SKU),
    t_item_cont(Content,[])
  }.
In the general case, however, it is going to be simpler to put checking for required, forbidden, and defaulted attributes into a separate rule, which we will here call t_item_att_check:
< 11 Checking for required, forbidden attributes > ≡
t_item_att_check(LAVS) :-
  atts_present(LAVS,[partNum]).

This code is not used elsewhere.

In the general case, we may have to check for required, forbidden, and defaulted attributes, and to merge defaulted attributes into the list of attribute-value specifications. So we can extend the general form of the foo_att_check predicate to:
< 12 Checking for required, forbidden attributes > ≡
t_item_att_check(LAVS,AugmentedLAVS) :-
  atts_present(LAVS,[partNum]),
  atts_absent(LAVS,[]),
  atts_defaulted(LAVS,[],AugmentedLAVS).

This code is not used elsewhere.

We assume here the existence of three special-purpose items for checking the presence or absence of items in lists of attribute-value specifications.

4.1. Checking that required attributes are present

A list of attribute-value specifications contains all the attributes in a list of required attributes, if the list of required attributes is empty.
< 13 Utilities for checking required attributes > ≡
atts_present(LAVS,[]).
Continued in <Utilities for checking required attributes 14>, <Utilities for checking required attributes 15>, <Utilities for checking forbidden attributes 16>, <Utilities for checking forbidden attributes 17>, <Utilities for checking defaulted attributes 18>
This code is not used elsewhere.

A list of attribute-value specifications contains all the attributes in a list of required attributes, if it contains the first one, and also all the others.
< 14 Utilities for checking required attributes [continues 13 Utilities for checking required attributes] > ≡
atts_present(LAVS,[HRA|RequiredTail]) :-
  att_present(LAVS,HRA),
  atts_present(LAVS,RequiredTail).



And finally, an attribute value specification is present for a given attribute name if it is either at the head of the list, or in the tail of the list.
< 15 Utilities for checking required attributes [continues 13 Utilities for checking required attributes] > ≡
att_present([Attname=Attval|Tail],Attname).
att_present([AVS|Tail],Attname) :-
  att_present(Tail,Attname).



Some readers will be looking, right about now, for a rule of the form
att_present([],Attname) :- ...
We don't have a rule for the empty list, though, because if we have run through the list of attribute-value specifications without finding one for the attribute name we are seeking, we want the predicate to fail. If necessary to prevent later maintenance programmers, or ourselves, from supplying the ‘missing’ induction basis by mistake, we could write:
att_present([],Attname) :- !, fail.

4.2. Checking that forbidden attributes are absent

Now for forbidden attributes. (These would be originally optional attributes whose maxOccurs has been set to 0 in a refinement step.) If there are no attribute names in the list of forbidden attributes, then we are done and all is well.
< 16 Utilities for checking forbidden attributes [continues 13 Utilities for checking required attributes] > ≡
atts_absent(LAVS,[]).



If the list of forbidden attribute names is not empty, we check first the head, then (recursively) the tail.
< 17 Utilities for checking forbidden attributes [continues 13 Utilities for checking required attributes] > ≡
atts_absent(LAVS,[HFA|ForbiddenTail]) :-
  not(att_present(LAVS,HFA)),
  atts_absent(LAVS,ForbiddenTail).



4.3. Supplying default values for attributes

Finally, for attributes with defaults we specify a list of attribute-value specifications; for each of these, we check to see whether a value is already specified for an attribute of that name. If it is, we skip the attribute in question; if not, we add it to the list of attribute-value specifications. Either way, we call the predicate recursively to take care of the rest of the defaulted attributes.
< 18 Utilities for checking defaulted attributes [continues 13 Utilities for checking required attributes] > ≡

atts_defaulted(LAVS,[],LAVS).
atts_defaulted(LAVS,[AN=AV|Tail],AugmentedLAVS) :-
  att_present(LAVS,AN),
  atts_defaulted(LAVS,Tail,AugmentedLAVS).
atts_defaulted(LAVS,[AN=AV|Tail],[AN=AV|AugmentedLAVS]) :-
  not(att_present(LAVS,AN)),
  atts_defaulted(LAVS,Tail,AugmentedLAVS).



5. Evaluation

So far, we have shown that given a relatively simple schema, it is possible to create a definite-clause grammar which checks the salient constraints defined by the schema:
  • Each element is checked against the rules for its datatype.
  • Elements declared with a complex type are checked with regard to their attributes and their content.
    • Attributes are checked to see that they were declared and that their value is of the correct type.
    • Sets of attribute value specifications are checked to make sure that required attributes have been specified, and forbidden attributes have not. Default values can be supplied for attributes which have them.
    • The regular language defined in a content model is translated into a set of definite-clause grammar productions; any legal sequence of children will be recognized, and any illegal sequence of children will fail to be recognized.
    • Elements declared with a simple type are checked to ensure that their content is a legal value of the simple type.
Running a lightly modified version of this grammar over a set of test cases based on po1.xml, the grammar accepted a number of valid variations of the po1.xml example, with
  • additional attributes (namespace declarations, xsi attributes)
  • omission or inclusion of optional elements
  • omission or inclusion of optional attributes
Modification of the grammar was necessary because the grammar as shown does not handle namespace declarations or attributes in the xsi namespace; since all the test files do have namespace declarations, the following rules had to be added to the grammar (I haven't added them above for fear it would clog up the exposition):
purchaseOrderType_att --> magic_att.
usAddress_att --> magic_att.
t_items_att  --> magic_att.
t_item_att  --> magic_att.

magic_att --> [xmlns=NS].
magic_att --> [xmlns:Pre=NS].
magic_att --> ['http://www.w3.org/2001/XMLSchema-instance':Localname=Value].
The grammar detected a number of errors:
  • misspelled generic identifiers and attribute names
  • omission of required elements
  • excess occurrences of elements
  • elements out of sequence
  • extraneous undeclared elements present
  • value other than the prescribed value for a fixed attribute
The following errors were not detected:
  • multiple po:comment elements (the schema allows at most one at the purchaseOrder level and at most one in each item)
        <apo:comment>Hurry, my lawn is going wild!</apo:comment>
        <apo:comment>I don't know how much longer I can hold out.</apo:comment>
    
    This is an error in the grammar above, which has been retained to illustrate the risks of hand-translation. Instead of
    opt_comment --> [].
    opt_comment --> comment, opt_comment.
    
    (which allows arbitrarily many comments), the rule for the optional comment element should read
    opt_comment --> [].
    opt_comment --> comment.
    
  • two orderDate attributes
    <apo:purchaseOrder xmlns:apo="http://www.example.com/PO1"
                       orderDate="1999-10-19"
                       orderDate="1999-10-20"> 
    
    (this can be blamed on the upstream processor, which should probably have rejected the document without producing an infoset, since single occurrence of attribute names is a well-formedness constraint)
  • quantity exceeding the maximum legal value
            <item partNum="926-AA">
                <productName>Baby Monitor</productName>
                <quantity>100</quantity>
                <USPrice>39.98</USPrice>
                <shipDate>1999-05-21</shipDate>
            </item>
    
  • omission of the required partNum attribute
            <item>
                <productName>Baby Monitor</productName>
                <quantity>1</quantity>
                <USPrice>39.98</USPrice>
                <shipDate>1999-05-21</shipDate>
            </item>
    
The following errors were detected, or at least the document was not accepted as valid, but validating the document caused a Prolog error because the number_chars predicate used to validate numbers is apparently not robust against non-numeric arguments.
  • invalid zip code
    <zip>OX2 6NN</zip>
  • wrong datatype for quantity
    <quantity>one</quantity>
  • bad value for USPrice
    <USPrice>USD 39.98</USPrice>
See A note on the test cases below for further details on the test documents.
The example we have been developing does not illustrate all forms of content models, occurrence indicators, or simple types, but I think it's obvious how to extend the treatment given here.
There are several gaps in coverage, which will be made good elsewhere:
  • handling wildcards
  • handling character data intermingled with mixed content models, handling whitespace intermingled with non-mixed content models
  • handling substitution groups
  • handling xsi:type specifications in the document instance
  • diagnosing errors (i.e. saying something more useful than “no” or “ERROR: number_chars/2: Syntax error: Illegal number” when an error is encountered) and recovering from them
  • providing relevant PSVI properties
It will be simplest to show how to address these gaps if we handle the last item first, and make our parser start returning more than a yes/no answer to the question “is this document/element/lexical form ok?” As shown by the examples used to introduce DCG notation, we could do that by adding parameters to the productions in the grammar; it will be more convenient, however, to translate our grammar from DCG form into a related form, that of definite-clause translation grammars. To that, the next paper ([Sperberg-McQueen 2004b]) is dedicated.

A. Works cited

Grune, Dick, and Ceriel J. H. Jacobs. 1990. Parsing techniques: a practical guide. New York, London: Ellis Horwood, 1990. Postscript of the book is available from the first author's Web site at http://www.cs.vu.nl/~dick/PTAPG.html

Sperberg-McQueen, C. M. 2004a. “A brief introduction to definite clause grammars and definite clause translation grammars”. Working paper prepared for the W3C XML Schema Working Group. 18 July 2004.

Sperberg-McQueen, C. M. 2004b. “Representing an XSD schema as a definite-clause translation grammar”. Working paper prepared for the W3C XML Schema Working Group. July 2004.

W3C (World Wide Web Consortium). 2001a. “XML Schema Part 0: Primer”, ed. David Fallside. W3C Recommendation, 2 May 2001. [Cambridge, Sophia-Antipolis, Tokyo: W3C] http://www.w3.org/TR/xmlschema-0/.

W3C (World Wide Web Consortium). 2001b. XML Schema Part 1: Structures, ed. Henry S. Thompson, David Beech, Murray Maloney, and Noah Mendelsohn. W3C Recommendation 2 May 2001. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/

W3C (World Wide Web Consortium). 2001c. XML Schema Part 2: Datatypes, ed. Biron, Paul V. and Ashok Malhotra. W3C Recommendation 2 May 2001. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/

Wielemaker, Jan. “SWI-Prolog SGML/XML parser: Version 1.0.14, March 2001”. http://www.swi-prolog.org/packages/sgml2pl.html

B. A note on the test cases

The test documents mentioned in the text were created manually to test the implementation of a schema corresponding to the schema document po1.xsd of the XML Schema tutorial [W3C 2001a].
A simple XSLT stylesheet read the schema and produced about 120 suggestions for tests, both valid and invalid documents. Because the stylesheet is primitive and simple-minded, some of the suggestions are not useful; in the end, eleven valid variations of the po1.xml example were produced:
  1. po1v10a.xsd, add additional attributes (namespace declarations, xsi attributes) to purchaseOrder
  2. po1v25.xml, omit optional comment element
  3. po1v33.xml, omit optional orderDate attribute from purchaseOrder
  4. po1v38.xml, add harmless attributes to USAddress elements
  5. po1v62d.xml, valid but unreasonable zip code (3.141592); the declaration is too weak to detect this problem
  6. po1v65.xml, omit country attribute on USAddress
  7. po1v79.xml, empty items element
  8. po1v80.xml, longer list of items (10)
  9. po1v100a.xml, various alternative values for quantity element (1, 43, 55, 99)
  10. po1v100b.xml, various alternative values for quantity element (1, 43, 55, 99)
  11. po1v121.xml, alternative values for partNum
Also, 58 test cases with errors were produced:
  1. po1e4.xml, misspell purchaseOrder
  2. po1e13.xml, omit main seq of PurchaseOrderType
  3. po1e14.xml, duplicate main sequence of PurchaseOrderType
  4. po1e15a.xml, reorder children of purchaseOrder
  5. po1e15b.xml, reorder children of purchaseOrder
  6. po1e16.xml, extraneous child of purchaseOrder
  7. po1e18.xml, omit required shipTo element
  8. po1e19.xml, duplicate shipTo
  9. po1e20.xml, misspell shipTo
  10. po1e27.xml, supply two comment elements (invalid)
  11. po1e28.xml, misspell 'comment'
  12. po1e30.xml, omit items element
  13. po1e31.xml, supply two items elements (invalid)
  14. po1e32.xml, misspell 'items'
  15. po1e35.xml, supply two orderDate attributes
  16. po1e36.xml, misspell 'orderDate' on purchaseOrder
  17. po1e41.xml, omit USAddress contents
  18. po1e42.xml, supply double sequence of address children
  19. po1e43.xml, reorder children of USAddress
  20. po1e44.xml, add extraneous child to USAddress
  21. po1e46.xml, omit name from USAddress
  22. po1e47.xml, provide name twice in USAddress
  23. po1e48.xml, misspell 'name' in USAddress
  24. po1e50.xml, omit street in USAddress
  25. po1e51.xml, double occurrence of street in USAddress
  26. po1e52.xml, misspell 'street'
  27. po1e55.xml, double occurrence of 'city'
  28. po1e56.xml, misspell 'city'
  29. po1e62.xml, omit zip
  30. po1e62b.xml, omit contents of zip element
  31. po1e62c.xml, invalid zip code
  32. po1e63.xml, double occurrence of zip code
  33. po1e64.xml, misspell 'zip' in USAddress
  34. po1e68.xml, misspell attribute name 'country' on USAddress
  35. po1e70.xml, provide value other than 'US' for fixed att 'country'
  36. po1e70b.xml, provide value other than 'US' for fixed att 'country'
  37. po1e78.xml, add extraneous child to sequence in type t_Items
  38. po1e81.xml, misspell 'item'
  39. po1e86.xml, omit contents of item element
  40. po1e87.xml, repeat sequence of children in item
  41. po1e88.xml, reorder children of item
  42. po1e89.xml, add extraneous child to item
  43. po1e91.xml, omit productName
  44. po1e92.xml, reduplicated productName
  45. po1e101a.xml, exceed the maximum legal value for quantity
  46. po1e101b.xml, violate min, lexical form rules for quantity
  47. po1e101c.xml, wrong datatype for quantity
  48. po1e101d.xml, bad value for quantity
  49. po1e105bisa.xml, bad value for USPrice
  50. po1e105bisb.xml, bad value for USPrice
  51. po1e106.xml, misspell USPrice
  52. po1e109.xml, two comments in item
  53. po1e113.xml, two shipDates (invalid)
  54. po1e114.xml, misspell 'shipDate'
  55. po1e116.xml, omit required partNum attribute
  56. po1e122a.xml, bad value for SKU (lowercase letters, char out of range)
  57. po1e122b.xml, bad value for SKU (excess chars)
  58. po1e122c.xml, bad value for SKU, missing hyphen
To run the tests conveniently, the following auxiliary Prolog file can be used. At the moment, it is tied to the manually-modified version of podcg3.pl, but that will change as work on this paper progresses.
< 19 Utility routines for testing Prolog implementations of po1.xsd [File potests.pl] > ≡
/* potest: simple one-item test routine for SWI Prolog 
 * 
 * This DCG was generated by a literate programming 
 * system; if maintenance is necessary, make changes 
 * to the source (podcg.xml), not to this output file.
 */
grammar_version(current,dctg,'d:/home/cmsmcq/2003/schema/dctg/po4.pl').
grammar_version(old,dctg,'d:/home/cmsmcq/2003/schema/dctg/podctg.pl').
grammar_version(dcg,dcg,'d:/home/cmsmcq/2003/schema/dctg/podcg3.pl').

potest(Grammar,dcg,XMLDocument) :-
  write('Loading grammar '), write(Grammar), write('...'), nl,
  consult(Grammar),
  write('Loading document '), write(XMLDocument), write('...'), nl,
  load_structure(XMLDocument,Infoset,[dialect(xmlns),space(remove)]),
  (purchaseOrder(Infoset,[]) -> writeq('yes') ; writeq('no')),
  nl.
potest(Grammar,dctg,XMLDocument) :-
  write('Loading grammar '), write(Grammar), write('...'), nl,
  consult('d:/usr/lib/prolog/msmdctg.pl'),
  dctg_reconsult(Grammar),
  write('Loading document '), write(XMLDocument), write('...'), nl,
  load_structure(XMLDocument,Infoset,[dialect(xmlns),space(remove)]),
  (e_purchaseOrder(_Structure,Infoset,[]) -> writeq('yes') ; writeq('no')),
  nl.

potest2(XMLDocument,dcg) :-
  write('Loading document '), write(XMLDocument), write('...'), nl,
  load_structure(XMLDocument,Infoset,[dialect(xmlns),space(remove)]),
  (purchaseOrder(Infoset,[]) -> writeq('yes') ; writeq('no')),
  nl.
potest2(XMLDocument,dctg) :-
  write('Loading document '), write(XMLDocument), write('...'), nl,
  load_structure(XMLDocument,Infoset,[dialect(xmlns),space(remove)]),
  (e_purchaseOrder(_Structure,Infoset,[]) -> writeq('yes') ; writeq('no')),
  nl.

one(V) :-
  grammar_version(V,Syn,File),
  potest(File,Syn,'d:/usr/lib/xmlschema/po/tests/po1.xml').

two(V) :-
  grammar_version(V,Syn,File),
  potest(File,Syn,'d:/usr/lib/xmlschema/po/tests/po1.xml'),
  potest2('d:/usr/lib/xmlschema/po/tests/po1v10a.xml',Syn).

/* Each of the following should be valid */
good(V) :-
  grammar_version(V,Syn,File),
  potest(File,Syn,'d:/usr/lib/xmlschema/po/tests/po1.xml'),
  potest2('d:/usr/lib/xmlschema/po/tests/po1v10a.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1v25.xml',Syn),    
  potest2('d:/usr/lib/xmlschema/po/tests/po1v33.xml',Syn),  
  potest2('d:/usr/lib/xmlschema/po/tests/po1v38.xml',Syn),  
  potest2('d:/usr/lib/xmlschema/po/tests/po1v62d.xml',Syn),  
  potest2('d:/usr/lib/xmlschema/po/tests/po1v65.xml',Syn),   
  potest2('d:/usr/lib/xmlschema/po/tests/po1v79.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1v80.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1v100a.xml',Syn),  
  potest2('d:/usr/lib/xmlschema/po/tests/po1v100b.xml',Syn),  
  potest2('d:/usr/lib/xmlschema/po/tests/po1v121.xml',Syn).

/* Each of the following should raise an error */
bad(V) :-
  grammar_version(V,Syn,File),
  potest(File,Syn,'d:/usr/lib/xmlschema/po/tests/po1.xml'),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e04.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e13.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e14.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e15.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e15a.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e15b.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e15c.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e16.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e16b.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e18.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e19.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e20.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e27.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e28.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e28b.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e30.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e31.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e32.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e35.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e36.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e41.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e42.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e43.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e44.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e46.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e47.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e48.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e50.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e51.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e52.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e55.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e56.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e62.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e62b.xml',Syn),
  /*
  potest2('d:/usr/lib/xmlschema/po/tests/po1e62c.xml',Syn),
  */
  potest2('d:/usr/lib/xmlschema/po/tests/po1e63.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e64.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e68.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e70.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e70b.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e78.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e81.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e86.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e87.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e88.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e89.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e91.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e92.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e101a.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e101b.xml',Syn),
  /*
  potest2('d:/usr/lib/xmlschema/po/tests/po1e101c.xml',Syn),
  */
  potest2('d:/usr/lib/xmlschema/po/tests/po1e101d.xml',Syn),
  /* 
  potest2('d:/usr/lib/xmlschema/po/tests/po1e105bisa.xml',Syn),
  */
  potest2('d:/usr/lib/xmlschema/po/tests/po1e105bisb.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e106.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e109.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e113.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e114.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e116.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e122a.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e122b.xml',Syn),
  potest2('d:/usr/lib/xmlschema/po/tests/po1e122c.xml',Syn).
/* 62c, 101c, 105bisa are commented out because they blow up 
 * number_chars/2.  Need a more robust approach to checking
 * numbers, apparently. */

ugly(V) :- good(V), bad(V).



The tests can be run from the SWI Prolog command line in bash thus:[7]
cd ~/2003/schema/dctg
$ /cygdrive/d/usr/src/pl/bin/plcon.exe -f potests.pl -g 'one(current)' -t halt
or
$ /cygdrive/d/usr/src/pl/bin/plcon.exe \
>  -f d:/home/cmsmcq/2002/Prolog/potest.pl \
>  -g 'good(current)' -t halt
The command line might be simpler if Prolog has a good search path, but I haven't gotten around to that yet.
The -g option specifies which set of tests to run: one, two (these check whether the tests work at all), good (the valid documents; these should each produce “yes”), bad (the invalid documents), or ugly (good followed by bad).
For what it's worth, on 6 August 2003, the 71 or so tests of ugly for podctg.pl executed in 0.36 seconds of clock time on my IBM Thinkpad with a 550 MHz Celeron processor.

C. SWI Prolog handling of characters

In the interests of simplicity, the code presented above blissfully ignores non-ASCII characters. A more reliable technique will need to deal properly with them as they are represented in the output of the SWI parser.
The version of SWI Prolog used here (5.0.10) uses atoms to represent characters which are representable in Prolog, and uses entity(Name) and entity(Num) structures otherwise. There may be some rough spots in the Unicode support. Here is a quick summary of what I have tested and the results.
  • named general entity reference in document for single character (defined by using numeric character reference):
    • ntilde (defined as &#241;): represented directly (code 241)
    • lsquo and rsquo (defined as &#x2018; and &#x2019;): represented as entity(lsquo) and entity(rsquo)
  • numeric character reference in document:
    • 241 (ntilde): represented directly (code 241)
    • U+2018, U+2019 (lsquo and rsquo): represented as entity(8216) and entity(8217)
  • larger general entities with non-ASCII characters in the replacement text:
    • using named entities: top-level entity is expanded; within the expansion, ntilde and the sinqle quotes treated as described above (named entities)
    • using numeric character references: top-level entity is expanded; within the expansion, ntilde and the sinqle quotes treated as described above (numeric references)
  • UTF-8 (entire document translated to UTF-8, no entity or character references involved):
    • ntilde shows up as character 241
    • lsquo and rsquo show up as characters 24 and 25 (n.b. 24 = 8216 mod 256)
Until this behavior changes, it would appear best to use numeric character references for non-ASCII characters.

Notes

[1] For small documents, the parser returns the entire document in the list structure described; for large documents, alternative calls are provided to avoid having to have the entire document in memory at a time. I will here use only the simpler all-at-once form of the parser.
[2] Various complications will lead to something slightly different in a production version of this idea, but I start with the simplest idea that I have been able to make work.
[3] The DCG representation of a schema thus looks rather like a set of separate grammars for regular languages, linked together solely by the nested calls. That makes the DCG we are writing unusual as a DCG, but it will work very nicely for us later when we begin to model skip and lax processing.
[4] See [Grune/Jacobs 1990] sec. 2.3.2.3 (pp. 35-37) for discussion of this point.
[5] To be specific, when loading the document instance, we specify load_structure('d:/usr/lib/xmlschema/po/po.xml', P01, [dialect(xmlns),space(remove)]) instead of just load_structure('d:/usr/lib/xmlschema/po/po.xml', P01, [dialect(xmlns)]).
[6] And yet: testing shows that the upstream processor doesn't actually check for duplicate attributes.
[7] The $ and > characters are provided by bash, the \ immediately followed by a newline is just a way to continue a single command onto a new line.