Logic grammars and XML Schema

This document describes some possible applications of logic grammars to schema processing as described in the XML Schema specification. The term logic grammar is used to denote grammars written in logic-programming systems; the best known logic grammars are probably definite-clause grammars (DCGs), which are a built-in part of most Prolog systems. This paper works with definite-clause translation grammars (DCTGs), which employ a similar formalism but which more closely resemble attribute grammars as described by [Knuth 1968] and later writers; it is a bit easier to handle complex specifications with DCTGs than with DCGs. Both DCGs and DCTGs can be regarded as syntactic sugar for straight Prolog; before execution, both notations are translated into Prolog clauses in the usual notation.

1 Introduction

1.1 Background

It is often observed that the needs of precision and completeness in technical specifications are sometimes at odds with the goal of making the specifications easy to read and understand. It is sometimes suggested (e.g. by [Wadler 2000]) that specifications would be less ambiguous, more compact, and easier to understand if they were written not in natural-language prose but in a formal notation, for example in Prolog or in Z or in the Gentzen calculus or in some formalism developed for the purpose. Some propose not to eliminate the prose but to augment it with a formal expression of the same rules. This is a special case of a practice sometimes seriously recommended for developing national or international standards: draft the specification in two languages at once (e.g. English and French) as a way of exposing ambiguity, vagueness, and misunderstandings early and forcing the working group to achieve greater clarity and precision in their work.

If the second language used is a formal language, rather than a second natural language, this practice can have the same advantages, as well as the the added benefit that the resulting specification can be read and understood not only by human readers but also by machines. If the notation used for the specification can be directly interpreted by a machine, then the specification can be tested much more readily for completeness, consistency, and accurate reflection of the developers' design decisions; when a specification can be directly interpreted by machine, the specification itself becomes its own reference implementation.

The work described here may be regarded, in part, as a test of these propositions. I show how the definitions and declarations of a schema expressed in XML Schema can be represented using a formal notation developed for writing attribute grammars in Prolog; in follow-on work I expect to provide Prolog equivalents for all of the validation rules, info-set contributions, and constraints on schemas given in (rather formal) English prose by [W3C 2001b] and [W3C 2001c]. If the arguments for formalizing technical specifications are correct, the Prolog representations ought (if they are properly done) to be less ambiguous, more compact, and easier to understand that the formal English of the specification. The lack of ambiguity is guaranteed, as long as the Prolog is syntactically correct. The compactness can easily be measured; for the schema used here as an example, the Prolog is in fact not much more compact than the XML representation of the schema. Whether the Prolog representation is clearer or not is something which individual readers must decide for themselves.¹

1.2 Direct interpretation (animation) of specifications

The idea of direct interpretation of formal specifications is not new; many textbooks on formal methods or on specific formal methods (Z, the Vienna Development Method VDM, etc.) discuss it at least as an idea. [Stepney 1993] provides a simple Web-accessible example of the idea: she illustrates the creation of a high-integrity compiler by

specifying the semantics (both dynamic and static) of a procedural programming language formally, in Z
specifying the semantics of a target machine language (or: assembly language) formally, also in Z
specifying the translation from source language to machine language formally
specifying the syntax of the programming language in a definite-clause translation grammar
translating the static semantics, dynamic semantics, and the source-to-machine language translation as grammatical attributes² in the DCTG. A Prolog system asked to supply the dynamic semantics of a program fragment serves as an interpreter for the source language; a Prolog system asked to supply the translation semantics acts as a compiler for the source language.

Animation of a specification is probably easiest if the spec is in a formal language. As Stepney's book illustrates, it is still possible otherwise, but it involves both the animation of a formal specification and an interpretation of the spec (in the form of a translation into a more formal language); the latter requires careful checking.

This paper attempts to make plausible the claim that a similar approach can be used with the XML Schema specification, in order to provide a runnable XML Schema processor with a very close tie to the wording of the XML Schema specification. Separate papers will report on an attempt to make good on the claim by building an XML Schema processor using this approach; this paper will focus on the rationale and basic ideas, omitting many details.

1.3 Using logic grammars in schema processing

Any schema defines a set of trees, and can thus be modeled more or less plausibly by a grammar. Schemas defined using XML Schema 1.0 impose some constraints which are not conveniently represented by pure context-free grammars, and the process of schema-validity-assessment defined by the XML Schema 1.0 specification requires implementations to produce information that goes well beyond a yes/no answer to the question “is this tree a member of the set?” For both of these reasons, it is convenient to use a form of attribute grammar to model a schema; logic grammars are a convenient choice.

In the remainder of this paper, I introduce some basic ideas for using logic grammars as a way of animating the XML Schema specification / modeling XML Schema.

The first question that arises is: How can logic grammars be used to model XML Schema processing? To this, there are three possible answers:

Model the schema as a logic grammar; use it to parse document instances and generate the post-schema-validation infoset (PSVI).
Move up one level of abstraction: model the schema for schemas as a logic grammar; use it to parse schema documents and generate the appropriate schema components.
Do both. This could be done by writing a utility to take a set of components (in the form generated by the logic grammar for schema documents) and generate a corresponding logic grammar (in the form needed for parsing other document instances).

It is the first of these answers which is illustrated here; the approach shown here provides a useful basis for the work on the meta-level.

Some problems arise if we use logic grammars to model schemas and perform schema-validity assessment by parsing a document using the logic grammar. Solutions to these problems are illustrated below with a DCTG which represents a simple schema:

In practice, the XML input document is likely to be presented to us in the form of a tree. How do we apply logic grammars to things that are already trees instead of sequences of tokens?
How do we check XML attributes?
How do we model type assignment, black-box validation, white-box validation?
How do we model the post-schema-validation information set?

In the next part of this paper (section 2) I introduce the definite-clause grammar notation, with some simple examples. A simplified representation of the purchase-order schema po1.xsd from the XML Schema tutorial [W3C 2001a] shows how a DCG can be used to support DTD-style validation in a straightforward way In section 3, I describe definite-clause translation grammars (DCTGs), which provide a slightly more convenient notation for complex attribute grammars than do DCGs. The patterns illustrated by the purchase-order schema are elaborated to handle various non-DTD constructs: post-schema-validation properties, error handling, substitution groups, xsi:type, and wildcards. This simple example shows how logic grammars can be applied as representations of schemas, but does not illustrate all aspects of schema validation.

A fuller account of the work described here, including source code for the entire DCTG described here, is given in [Sperberg-McQueen 2003] In follow-on work reported in a separate paper, I plan to work systematically through the validation rules of [W3C 2001b] and [W3C 2001c], providing a more systematic representation of schema-validity assessment in logic-grammar terms. In a third paper, I will develop a grammar which reads XML Schema documents and enforces constraints on their XML representation and on the components derived from them, together with routines for compiling schema components into the logic-grammar representation sketched here.

2 Simple DTD-style document checking

We start with a simple case: considering only a subset of the kind of schema information captured in a document-type definition (DTD) in XML 1.0 DTD syntax, we write a logic grammar to represent a schema or DTD and use it to validate a document. The same technique, of course, can be used to capture and use part of the information in a schema expressed in XML Schema 1.0.

Before examining the schema, we need some brief digressions on DCTG notation, the representation of XML in Prolog, and the schema to be used as an example.

2.1 Introduction to DCTG notation

The basic idea of logic grammars is simple. Consider the following grammar for a trivially small fragment of English. It is taken from example 8.1 in [König/Seiffert 1989], an introduction to Prolog for linguists:

%%% Trivial context-free grammar for a fragment of English %%% in DCTG notation. s ::= np, vp. np ::= det, n. np ::= n. vp ::= v, np. vp ::= v. n ::= [mary]. n ::= [john]. n ::= [woman]. n ::= [man]. n ::= [apple]. det ::= [the]. v ::= [loves]. v ::= [eats]. v ::= [sings].

Each rule here consists of a left hand side and a right-hand side. The left-hand side is a Prolog atom or structure (these are all atoms), and the right-hand side is a comma-delimited list of atoms and structures; the operator ::= is used to separate them.

The rule s ::= np, vp means that a sentence (s) can consist of a noun phrase (np) and a verb phrase (vp), in that order. The rule n ::= [mary] means that a noun (n) can consist of a sequence of tokens consisting exclusively of the word “mary” (for technical reasons I won't go into here, illustrations of Prolog grammars often use all lower-case letters). Conventional formal language theory treats these rules as rules governing a symbol-rewriting system, allowing s to be rewritten as np, vp and n as the terminal symbol [mary].

The basic idea of logic grammars is that these rules can also be viewed as instructions for a proof. If we wish to prove that a sequence Q is an s, one way to go about it is to prove that Q has two subsequences, such that Qa is an np, Qb is a vp, and Q is equal to the concatenation of Qa and Qb. This is consistent with the rewriting-system account of grammars, but sheds an unexpected light on parsing: it can be done by a theorem-proving system.³

Definite clause translation grammar (DCTG) notation was introduced to make it easy to work with attribute grammars in Prolog [Abramson 1984]. Attribute grammars, first described by [Knuth 1968], are context-free grammars in which each node in the parse tree has a set of attributes. These attributes (which I will call grammatical attributes to distinguish them from the attributes associated with elements in XML) may either be inherited from the context or synthesized out of attribute values associated with children of the parse node. In other words, the information in grammatical attributes can travel top-down or bottom-up; Knuth provided an algorithm for detecting unwanted circular dependencies, and many attribute grammar systems require that attributes be declared as inherited or synthetic. Prolog DCTGs typically use Prolog's built-in unification mechanisms for attribute propagation, and it's not necessary to declare attributes as top-down or bottom-up in advance.

We can illustrate DCTG notation by extending the English grammar example with a number attribute, to allow the grammar to enforce subject-verb agreement. A sentence can consist of an np and a vp, on condition that they agree in number:

s ::= np^^S, vp^^P, { S^^number(Num), P^^number(Num) }.

Here, the expression enclosed in braces on the right-hand side of the rule expresses the subject-verb agreement rule, a constraint on the grammatical construction which is not itself a grammatical constituent. Such constraints are sometimes called ‘guards’ (e.g. by [Brown/Blair 1990]); they correspond directly to the ‘semantic conditions’ which are a constituent part of attribute grammars as described by [Alblas 1991]. The rule also shows how non-terminal symbols in the right-hand side can be suffixed with the operator ^^ and a variable name (here, S and P), to allow their attributes to be referred to elsewhere. The expression S^^number(Num) effectively means that the constituent S has the value Num for the grammatical attribute number. The same variable Num appears as the value of number attribute of P; the condition succeeds only if the two values can be unified.

The rules for np illustrate three ways to construct noun phrases. A noun phrase can be made up of a determiner and a noun; they must agree in number. This covers phrases like “the apple”, “the apples”, “one apple”, “some apples”. The agreement rule excludes the phrases “one apples” and “some apple”.⁴ The number attribute of the np gets the same value as the same attribute on the determiner and noun.

np ::= det^^D, n^^N, <:> number(Num) ::- D^^number(Num), N^^number(Num).

The list of grammatical attributes is separated by the operator <:> from the right-hand side of the rule. Each attribute is identified by a Prolog structure, e.g. value(V), whose functor is the name of the attribute and whose arguments are its values. The attribute structure may be followed by the operator ::- (designed to look a lot like the standard Prolog operator :- or ‘neck’) and a series of goals which will be satisfied when the attribute value is to be instantiated. In practice, these goals help to calculate the attribute values. Values of attributes attached to the items on the right-hand side may be referred to by the variable name associated with the item and the name of the attribute, joined by the operator ^^, as for example in D ^^ number(Num).

A noun phrase can also be without a determiner, if it consists of a plural noun or of a singular proper noun:

np ::= n^^N, { N^^number(pl) } <:> number(pl). np ::= pn^^N, { N^^number(sg) } <:> number(sg).

The remaining syntactic rules introduce no new syntax. Note that the pre-terminal classes simply copy the value of the number attribute from the lexicon.

vp ::= v^^V, np^^O <:> number(Num) ::- V^^number(Num). vp ::= v^^V <:> number(Num) ::- V^^number(Num). n ::= [L], { lex(L,n,Num) } <:> number(Num). pn ::= [L], { lex(L,pn,Num) } <:> number(Num). v ::= [L], { lex(L,v,Num) } <:> number(Num). det ::= [L], { lex(L,det,Num) } <:> number(Num).

The lexicon encodes basic part of speech information and number. The rule for the uses the anonymous variable _ (underscore) to show that it can unify with either sg or pl.

lex(mary,pn,sg). lex(john,pn,sg). lex(woman,n,sg). lex(women,n,pl). lex(man,n,sg). lex(men,n,pl). lex(apple,n,sg). lex(apples,n,pl). lex(the,det,_). lex(some,det,pl). lex(one,det,sg). lex(loves,v,sg). lex(love,v,pl). lex(eats,v,sg). lex(eat,v,pl). lex(sings,v,sg). lex(sing,v,pl).

A partial EBNF grammar for DCTG notation⁵ is:

grammar ::= rule* rule ::= lhs '::=' rhs ('<:>' att-spec ('&&' att-spec)*)? lhs ::= term rhs ::= term (',' term)* attspec ::= compound-term ('::-' goal (',' goal)*)? compound-term ::= ATOM '(' term (',' term)* ')'

2.2 The Prolog translation of DCTGs

As noted above, DCTG rules can be translated mechanically into standard Prolog. During translation, each grammar rule turns into a Prolog clause, and three arguments are added to each non-terminal. The first additional argument is a Prolog structure with the functor node, described further below. The second and third are difference lists, used to represent the sequence of items being parsed.

The node structure has three arguments:

the non-terminal of the grammar rule (here s, np, vp, etc.)
a list of the node structures associated with the items on the right-hand side of the grammar rule
a list of grammatical attributes (in this grammar, this list will be empty)

The translation of the np rules for both our grammars illustrates the translation process nicely.

/* trivial grammar */ np(node(np, [A, B], []), C, D) :- det(A, C, E), n(B, E, D). np(node(np, [A], []), B, C) :- n(A, B, C). /* with attributes */ np(node(np, [A, B], [ (number(C)::-A^^number(C), B^^number(C))]), D, E) :- det(A, D, F), n(B, F, E). np(node(np, [A], [number(pl)]), B, C) :- n(A, B, C), A^^number(pl). np(node(np, [A], [number(sg)]), B, C) :- pn(A, B, C), A^^number(sg).

The predicate dctg_reconsult(File) is used to translate a DCTG grammar into Prolog clauses and load them; it is provided by [Abramson/Dahl/Paine 1990] and is available from a variety of sources on the net.

2.3 Representing XML in Prolog

The SWI-Prolog system includes an SGML/XML parser which conveniently makes SGML and XML documents accessible to Prolog manipulation [Wielemaker 2001].

The parser returns the XML document to the user as a list of items,⁶ each item one of:

an atom (for character data)
a structure of the form element(GI,AttributeList,ContentList), where
- GI is an atom representing the element's generic identifier
- AttributeList is a list containing the element's attribute-value specifications in the form Attributename=Value, with both attribute names and attribute values represented as Prolog atoms. Like all Prolog lists, this one is comma-separated, so a list as a whole might look like [id=frobnitzer,rend=blort,type=farble]
- ContentList is a list of items (i.e. atoms representing character data, element(Gi,A,C) structures, entity(N) structures, etc.
a structure of the form entity(Integer) representing an otherwise unrepresentable character
a structure of the form entity(Name) representing an otherwise unrepresentable character for which a named entity is defined
a structure of the form pi(Text) representing a processing instruction

For example, the sample purchase order po1.xml used as an example in [W3C 2001a] looks like this in the format described (I have added white space for legibility; the single quotation marks around some atoms are a way of ensuring that they can be read again by a Prolog implementation and recognized as atoms):

[element('http://www.example.com/PO1':purchaseOrder, [xmlns:apo='http://www.example.com/PO1', orderDate='1999-10-20'], [element(shipTo, [country='US'], [element(name, [], ['Alice Smith']), element(street, [], ['123 Maple Street']), element(city, [], ['Mill Valley']), element(state, [], ['CA']), element(zip, [], ['90952']) ]), element(billTo, [country='US'], [element(name, [], ['Robert Smith']), element(street, [], ['8 Oak Avenue']), element(city, [], ['Old Town']), element(state, [], ['PA']), element(zip, [], ['95819']) ]), element('http://www.example.com/PO1':comment, [], ['Hurry, my lawn is going wild!'] ), element(items, [], [element(item, [partNum='872-AA'], [element(productName, [], ['Lawnmower']), element(quantity, [], ['1']), element('USPrice', [], ['148.95']), element('http://www.example.com/PO1':comment, [], ['Confirm this is electric'] ) ]), element(item, [partNum='926-AA'], [element(productName, [], ['Baby Monitor']), element(quantity, [], ['1']), element('USPrice', [], ['39.98']), element(shipDate, [], ['1999-05-21']) ]) ]) ]) ]

When elements and attributes are namespace-qualified, as the purchaseOrder and comment elements are in this example, the SWI-Prolog parser represents the name as a pair of two atoms, separated by a colon. To the right of the colon is the local name; depending on the option specified, the left-hand atom is either the namespace name (as shown here) or the namespace prefix used.

Other Prolog representations of XML are possible, of course, but we can use this one and it already exists.

2.4 A simple schema

Consider the simple schema defined for purchase orders in Part 0 of the XML Schema specification ([W3C 2001a]). Its complex types are relatively simple, involving only sequences of elements, some of which are optional, and some attributes. Some attributes and some elements are defined with named and some with anonymous simple types. This schema will serve admirably to illustrate the translation of schemas to logic grammars. For simplicity of exposition, the translation will be divided into several layers:

translation into DCTG form
checking for duplicate and missing attributes, validating attribute types
supplying simple PSVI properties

Some features of XML Schema (handling xsi:type, supporting wildcards, supporting substitution groups) won't be shown here; the purchase-order schema doesn't have any wildcards or substitution groups, and it doesn't have much scope for the use of xsi:type in document instances. But it will be possible, after developing the example, to explain how these features can be handled in logic grammars.

2.5 Layer 1: Basic content-model checking

A level of validation similar to that of DTDs can be obtained by translating each content model in the schema into one or more DCTG rules in the obvious way. This will involve four kinds of rules:

Each content model or complex type is translated into a set of equivalent DCTG rules.
In a content model, the child element types are treated as terminal symbols; for each element type, then, there will be a rule which matches the element as a whole and recursively checks its attributes and content.
Each complex type defines a set of legal attibutes; these can also be checked by a DCTG.
Some miscellaneous utility predicates will need to be defined; these are not schema-dependent.

2.5.1 Rules for complex content models

Each content model in a schema defines a regular language; the schema as a whole can be regarded as a set of such languages, one per complex type. The purchase order schema specifies that any element with complex type purchaseOrderType) has a ship-to address, a billing address, possibly a comment, and a list of items. An element with complex type USAddress has a name, street, city, state, and zip. And the complex type Items takes a sequence of zero or more item children, each in turn containing product name, quantity, US price, an optional comment, and an optional shipping date. The translation into DCTG notation is straightforward:

t_PurchaseOrderType_cont ::- shipTo, billTo, opt_comment, items. opt_comment ::- []. opt_comment ::- comment. t_USAddress_cont ::- poname, street, city, state, zip. t_items_cont ::- star_item. star_item ::- []. star_item ::- item, star_item. t_item_cont ::- productName, quantity, 'USPrice', opt_comment, opt_shipDate. opt_shipDate ::- []. opt_shipDate ::- shipDate.

A practical consequence of the fact that XML Schema works with regular-right-part grammars while we are working with something more like standard Backus-Naur Form is that whenever we have recursive constructs like star_item, the parse tree described by the DCG does not correspond directly to the XML structure, and when we generate PSVI structures we are going to need to flatten the result to make it represent the XML structure properly.⁷

There is one other complication in practice: if an element has mixed content, we need to allow atoms representing character data between sub-elements; otherwise, we need to allow only atoms representing white space. For now, we step around this problem by instructing the upstream XML parser to strip whitespace from the beginning and ending of CDATA nodes; this does the job for this particular schema, and allows the exposition to be simpler. To handle mixed content and non-significant whitespace properly, a logic-grammar representation of schemas can either (a) allow optional whitespace- or text-nodes in its grammar rules, or (b) provide a specialized grammar interpreter which understands the rules and enforces them without any change to the structure of the grammar.

2.5.2 Terminal rules for elements

The content-model rules given above treat individual elements as terminal symbols in the language defined by their parent element's content model. For each element type in the schema, we create a grammar rule to match a single occurrence of the element, and recursively check its attributes and content. In the purchase-order schema there will be exactly one such rule for each element type, but in schemas which have substitution groups, there might be several (one for each member of an element's substitution group).

The terminal rules for purchase orders and the shipping and billing addresses are shown here; note that the two addresses use the same recursive checking, since they have the same complex type. Similar rules are provided for all element types with complex content:

purchaseOrder ::- [element('http://www.example.com/PO1':purchaseOrder, Attributes,Content)], { t_PurchaseOrderType_atts(_A,Attributes,[]), t_PurchaseOrderType_cont(_C,Content,[]) }. shipTo ::- [element(shipTo,Attributes,Content)], { t_USAddress_atts(_A,Attributes,[]), t_USAddress_cont(_C,Content,[]) }. billTo ::- [element(billTo,Attributes,Content)], { t_USAddress_atts(_A,Attributes,[]), t_USAddress_cont(_C,Content,[]) }.

The comment element and others associated with simple types have similar rules. Since simple types have no attributes other than namespace declarations and attributes in the xsi namespace, the same predicate for attribute checking can be used for all of them.

comment ::= [element('http://www.example.com/PO1':comment, Attributes,Content)], { sT_atts(_A,Attributes,[]), xsd_string_cont(_C,Content,[]) }. quantity ::= [element(quantity,Attributes,Content)], { sT_atts(A,Attributes,[]), xsd_integer_cont(C,Content,[]) }. /* etc. */

The comment element has a namespace qualifier and the quantity element does not, because the purchase-order schema declares local elements like quantity as unqualified.

2.5.3 Rules for simple content

The terminal rules for elements with simple content call the predicates xsd_string_cont, xsd_integer_cont, etc.

From the point of view of content-model grammars, simple values have no internal structure. The rules are correspondingly trivial: they take one atom off the list and call a predicate which is responsible for checking to see that the value is legal:

xsd_string_cont(_,[H|T],T) :- xsd_string_value(H). xsd_decimal_cont(_,[H|T],T) :- xsd_decimal_value(H). /* etc. */

A grammar constructed in this way will successfully check the document against the content models of all complex types. The second layer of our work will involve checking attributes and simple types.

2.6 Layer 2: Checking attributes and simple types

2.6.1 Checking attributes

The terminal rules for each element type call a type-specific predicate for checking attributes. The simplest form of attribute checking is just to make sure each attribute was declared, and to check its value against the appropriate simple type We can do this with grammars like the following for the purchase order, US address, and simple types:

t_PurchaseOrderType_atts ::- []. t_PurchaseOrderType_atts ::- t_PurchaseOrderType_att, t_PurchaseOrderType_atts. t_PurchaseOrderType_att ::- [orderDate=Date], { xsd_date_value(Date) }. t_PurchaseOrderType_att ::- magic_att. t_USAddress_atts ::- []. t_USAddress_atts ::- t_USAddress_att, t_USAddress_atts. t_USAddress_att ::- [country='US']. /* note: fixed value! */ t_USAddress_att ::- magic_att. t_items_atts ::- sT_atts. t_item_atts ::- []. t_item_atts ::- t_item_att, t_item_atts. t_item_att ::- [partNum=SKU], { po_sku_value(SKU) }. sT_atts ::- []. sT_atts ::- magic_att, sT_atts. magic_att ::- [xmlns=NS]. magic_att ::- [xmlns:Pre=NS]. magic_att ::- ['http://www.w3.org/2001/XMLSchema-instance':Localname=Value].

2.6.2 Checking simple types

Checking attribute values and element content against simple types is not hard: we can check lexical forms by writing mini-grammars for them, and perform range checking using Prolog's built-in numeric comparators.

The rules for part numbers (SKU) illustrate the method. The SKU type is declared this way:

<xsd:simpleType name="SKU"> <xsd:restriction base="xsd:string"> <xsd:pattern value="\d{3}-[A-Z]{2}"/> </xsd:restriction> </xsd:simpleType>

The pattern can be translated readily into the following grammar operating on the character sequence of the lexical form:

po_sku_value(LF) :- atom_chars(LF,Charseq), sku_value(Charseq,[]). sku_value ::- sku_decimal_part, hyphen, sku_alpha_part. sku_decimal_part ::- digit, digit, digit. sku_alpha_part ::- cap_a_z, cap_a_z.

The mini-grammar for SKU values relies on the SWI-Prolog character classification predicate char_type, though it would be possible to implement rules for digit checking and letter checking in any Prolog system.

Similar rules can be constructed for any simple type; details are shown in [Sperberg-McQueen 2003].

2.6.3 Checking for duplicate, defaulted, and required attributes

The grammar given above for attributes does not ensure that attributes are not specified more than once (strictly speaking, the upstream XML processor should be checking this, but it's a safety measure) nor that required attributes appear. We could rewrite the grammar so that it embodies these constraints, but doing so would complicate the grammar considerably, so we define an ad hoc predicate t_item_att_check instead, which checks for required and forbidden attributes, and merges defaulted attributes into an augmented list:

t_item_att_check(LAVS,AugmentedLAVS) :- atts_present(LAVS,[partNum]), atts_absent(LAVS,[]), atts_defaulted(LAVS,[],AugmentedLAVS).

The definition of the utility predicates atts_present, atts_absent, and atts_defaulted is straightforward and will not be described here.

2.7 Evaluation

So far, we have shown that given a relatively simple schema, it is possible to create a definite-clause grammar which checks the salient constraints defined by the schema:

Each element is checked against the rules for its datatype.
Elements declared with a complex type are checked with regard to their attributes and their content.
- Attributes are checked to see that they were declared and that their value is of the correct type.
- Sets of attribute value specifications are checked to make sure that required attributes have been specified, and forbidden attributes have not. Default values can be supplied for attributes which have them.
- The regular language defined in a content model is translated into a set of definite-clause grammar productions; any legal sequence of children will be recognized, and any illegal sequence of children will fail to be recognized.
- Elements declared with a simple type are checked to ensure that their content is a legal value of the simple type.

Running this grammar over a set of test cases based on po1.xml, the grammar accepted a number of valid variations of the po1.xml example, with

additional attributes (namespace declarations, xsi attributes)
omission or inclusion of optional elements
omission or inclusion of optional attributes

The grammar successfully detected several kinds of errors in the test cases:

misspelled generic identifiers and attribute names
omission of required elements
excess occurrences of elements
elements out of sequence
extraneous undeclared elements present
value other than the prescribed value for a fixed attribute

Some errors were not detected:

multiple po:comment elements (the schema allows at most one at the purchaseOrder level and at most one in each item). This was an error in the DCTG. Instead of the correct rule given above, the original hand-translated version of the DCTG had
opt_comment ::- []. opt_comment ::- comment, opt_comment.
(which allows arbitrarily many comments). Hand translation is error prone.
two orderDate attributes. This can be blamed on the upstream processor, which should probably have rejected the document without producing an infoset, since single occurrence of attribute names is a well-formedness constraint.
quantity exceeding the maximum legal value. This was due to carelessness writing the rules for checking quantities.

The following errors were detected, or at least the document was not accepted as valid, but validating the document caused a Prolog error because the number_chars predicate used to validate numbers is apparently not robust against non-numeric arguments.

invalid zip code (<zip>OX2 6NN</zip>)
wrong datatype for quantity (<quantity>one</quantity>)
bad value for USPrice (<USPrice>USD 39.98</USPrice>)

The example we have been developing does not illustrate all forms of content models, occurrence indicators, or simple types, but I think it's obvious how to extend the treatment given here to handle them. There are, however, several gaps in the coverage, which are worth further discussion:

handling wildcards
handling substitution groups
handling xsi:type specifications in the document instance
diagnosing errors (i.e. saying something more useful than “no” or “ERROR: number_chars/2: Syntax error: Illegal number” when an error is encountered) and recovering from them
providing relevant PSVI properties

It will be simplest to show how to address these gaps if we handle the last item first, and make our parser start returning more than a yes/no answer to the question “is this document/element/lexical form ok?” The simplest way to do so is to add some grammatical attributes to the grammar, as shown in the next section.

3 Layer 3: Providing PSVI properties

Having produced a working parser for the purchase-order schema, it is simple to add grammatical attributes corresponding to some basic information-set properties which are required to be in the input infoset:

for Attribute Information Items:
- [local name]
- [namespace name]
- [normalized value]
for Element Information Items:
- [local name]
- [namespace name]
- [children]
- [attributes]
- [in-scope namespaces] or [namespace attributes]
for Namespace Information Items:
- [prefix]
- [namespace name]

These properties will be represented as grammatical attributes attached to the non-terminals in our grammars which correspond to elements, attributes, and namespace declarations. (Character information items won't be parsed separately, and we won't attempt to provide any grammatical attribute for the character code property.)

Additionally, we will add some more interesting properties of the PSVI:

type definition name, namespace, anonymous, and type
schema specified (schema or infoset)
validation attempted (always full)
validity (always valid, for now, because when the document is not valid, the parse simply fails)

3.1 Rules for content of complex types

The base context-free grammar in the layer-3 parser is substantially the same as before; the only real difference is that we give each non-terminal a children property; the children property for the left-hand side of any grammar rule contains a flat list of the child elements matched by the rule. (Since repetitions, optional elements, and so on involve other grammar rules, the actual parse tree is not flat; the children property is our way of flattening the tree.)

The calculation of children is perhaps simplest for the USAddress type. It's just a list of the children: since no child is optional, there is no variation.

t_USAddress_cont ::= e_name^^N, e_street^^S, e_city^^C, e_state^^ST, e_zip^^Z <:> children([N,S,C,ST,Z]).

More complex, because the comment element is optional, is the children attribute of the t_PurchaseOrderType non-terminal. When a comment is present, we want it listed among the children; when it is not present, however, we don't want any dummy node. The opt_e_comment non-terminal, that is, should have a children attribute which is either the empty list or a list containing the comment node; we can then use the Prolog predicate flatten to insert either the Comm node, or nothing at all, into a list containing the other children of the purchase order element.

t_PurchaseOrderType_cont ::= e_shipTo^^S, e_billTo^^B, opt_e_comment^^C, e_items^^I <:> children(Lpe) ::- C^^children(CC), flatten([S,B,CC,I],Lpe). opt_e_comment ::= [] <:> children([]). opt_e_comment ::= e_comment^^Comm <:> children([Comm]).

A similar method is used for the item element, which also has optional children.

t_item_cont ::= e_productName^^PN, e_quantity^^Q, e_USPrice^^USP, opt_e_comment^^C, opt_e_shipDate^^S <:> children(Lpe) ::- C^^children(CC), S^^children(SC), flatten([PN,Q,USP,CC,SC],Lpe). opt_e_shipDate ::= [] <:> children([]). opt_e_shipDate ::= e_shipDate^^S <:> children([S]).

The items element is just a simple list; its children property can be done using the same methods we used to generate a flat list of attributes, above.

t_Items_cont ::= star_e_item^^L <:> children(List) ::- L^^children(List). star_e_item ::= [] <:> children([]). star_e_item ::= e_item^^I, star_e_item^^L <:> children([I|T]) ::- L^^children(T).

3.2 Terminal rules for element types

In layer 3 of our parser, the terminal rules will differ from layer 2 primarily in adding a large number of grammatical attributes to the node, by means of which we will model both the input and the post-schema-validation properties of the element in the infoset. Also, on a mundane level, the predicate for attribute checking is different: instead of calling the DCTG rule for attributes directly, we call a wrapper predicate which performs some other work before and after calling the grammatical rule.

The basic pattern of these rules is straightforward. For any element in namespace n with local name gi and complex type ct, we will construct an appropriate non-terminal symbol nt, and the element rule will look like this:

nt ::= [element(n:gi,Attributes,Content)], { ct_atts(A,NA,Attributes), ct_cont(C,Content,[]) } <:> attributes(A) && namespaceAttributes(NA) && children(Ch) ::- C^^children(Ch) && localName(gi) && namespacename(n) && type_definition_anonymous(Boolean) && type_definition_namespace(URI) && type_definition_name(NCName) && type_definition_type(complex) && validation_attempted(full) && validity(valid) .

In this layer, we also become more systematic about naming conventions. To eliminate the risk of name collisions, we will not use names from the schema directly as non-terminal symbols, but will instead attach prefixes to them: e_ for elements, t_ for types, etc. We will avoid those prefixes otherwise.

The purchase order schema po.xsd defines the following fifteen element types: the list gives the simple names which will be used to refer to them in the grammar below, as well as their schema-component designator as defined in [Holstege/Vedamuthu 2003]. Since their local names are all unique, the grammar below simply uses e_ plus their local names to refer to them. In other schemas, it will be necessary to mangle the names, or generate arbitrary identifiers, in order to distinguish different element types which have the same local names.

e_purchaseOrder = /element(purchaseOrder)
e_comment = /element(comment)
e_shipTo = /complexType(PurchaseOrderType)/sequence()/element(shipTo)
e_billTo = /complexType(PurchaseOrderType)/sequence()/element(billTo)
e_items = /complexType(PurchaseOrderType)/sequence()/element(items)
e_name = /complexType(USAddress)/sequence()/element(name)
e_street = /complexType(USAddress)/sequence()/element(street)
e_city = /complexType(USAddress)/sequence()/element(city)
e_state = /complexType(USAddress)/sequence()/element(state)
e_zip = /complexType(USAddress)/sequence()/element(zip)
e_item = /complexType(Items)/sequence()/element(item)
e_productName = /complexType(Items)/sequence()/element(item)/complexType()/sequence()/element(productName)
e_quantity = /complexType(Items)/sequence()/element(item)/complexType()/sequence()/element(quantity)
e_USPrice = /complexType(Items)/sequence()/element(item)/complexType()/sequence()/element(USPrice)
e_shipDate = /complexType(Items)/sequence()/element(item)/complexType()/sequence()/element(shipDate)

The simple purchase-order schema defines four complex types; one is anonymous; we'll use the local name of its host element after the t_ prefix:

t_PurchaseOrderType = /complexType(PurchaseOrderType)
t_USAddress = /complexType(USAddress)
t_Items = /complexType(Items)
t_item = /complexType(Items)/sequence()/element(item)/complexType()

The elements with complex types get these rules:

e_purchaseOrder ::= [element('http://www.example.com/PO1':purchaseOrder, Attributes,Content)], { t_PurchaseOrderType_atts(A,NA,Attributes), t_PurchaseOrderType_cont(C,Content,[]) } <:> localname(purchaseOrder) && type_definition_anonymous('false') && type_definition_namespace('http://www.example.com/PO1') && type_definition_name('PurchaseOrderType') && type_definition_type(complex) && attributes(A) && namespaceattributes(NA) && children(Ch) ::- C^^children(Ch) && namespacename('http://www.example.com/PO1') && validationattempted(full) && validity(valid). e_shipTo ::= [element(shipTo,Attributes,Content)], { t_USAddress_atts(A,NA,Attributes), t_USAddress_cont(C,Content,[]) } <:> localname(shipTo) && type_definition_anonymous('false') && type_definition_namespace('http://www.example.com/PO1') && type_definition_name('USAddress') && type_definition_type(complex) && attributes(A) && namespaceattributes(NA) && children(Ch) ::- C^^children(Ch) && namespacename('') && validationattempted(full) && validity(valid). /* etc. */

Note that the shipTo element, being local and unqualified, has an empty namespacename property. The type_definition_name property for the item element provides the generated name we use for the type. That this name is not assigned by the schema is clarified by type_definition_anonymous('true').

The rules for elements with simple types are slightly simpler than those for elements with complex types, but follow the same basic pattern.

The schema po.xsd defines two simple types: SKU and the anonymous simple type used for quantities:

t_quantityType = /complexType(Items)/sequence()/element(item)/complexType()/sequence()/element(quantity)/simpleType()
t_SKU = /simpleType(SKU)

In addition, several built-in simple types are used:

t_string = xsd:string
t_integer = xsd:integer
t_decimal = xsd:decimal
t_date = xsd:date

In a full implementation, we'll need to do some more serious name mangling to ensure uniqueness of relatively short, handy names for all types. For now, we just choose the names manually.

The rules for simple types are illustrated by those for comment and quantity.

e_comment ::= [element('http://www.example.com/PO1':comment,Attributes,Content)], { sT_atts(A,NA,Attributes), xsd_string_cont(C,Content,[]) } <:> localname(comment) && type_definition_anonymous('false') && type_definition_namespace('http://www.w3.org/2001/XMLSchema') && type_definition_name('string') && type_definition_type(simple) && attributes(A) && namespaceattributes(NA) && children(C) && namespacename('http://www.example.com/PO1') && validationattempted(full) && validity(valid). e_quantity ::= [element(quantity, Attributes,Content)], { sT_atts(A,Attributes,[]), t_quantity_cont(C,Content,[]) } <:> localname(quantity) && type_definition_anonymous('true') && type_definition_namespace('http://www.example.com/PO1') && type_definition_name('t_quantityType') && type_definition_type(simple) && attributes(A) && namespaceattributes(NA) && children(C) && namespacename('') && validationattempted(full) && validity(valid). /* etc. */

3.3 Rules for attributes

For each complex type, the grammars does several things to validate all the attributes on occurrences of that type and to provide appropriate nodes and infoset properties:

The input structure has namespace attributes and other attributes in the same list, while the infoset assigns them to two different properties. So the parser must partition the list of attributes in the input.
Each non-namespace attribute must be validated against its declared type.
The parser must check that attributes required by the complex type are present and attributes forbidden are not present. Default values must be supplied where needed.

The parser also provides basic infoset properties for the XML attributes.

3.3.1 Basic pattern

For each complex or simple type dt, the basic pattern of the attribute-checking rule will be:

dt_atts(Lpa,Lpna,Lavs) :- /* parse against grammar of attributes */ lavs_dt(LpaAll,Lavs,[]), /* partition the result */ partition(LpaAll,LpaPresent,Lpna), /* check min, max occurrence rules */ attocc_dt(LpaPresent,Lpa).

The logical variables have the following meanings:

Lpa
List of parsed attributes (i.e. of node() structures of the kind returned by any DCTG rule) for this complex type, including defaulted attributes

Lpna
List of parsed namespace attributes

Lavs
The list of attribute-value specifications provided by the input structure returned by the SWI Prolog parser.

LpaAll
Combined list of parsed-attribute node() structures for all attributes, both namespace attributes and others

LpaPresent
List of parsed-attribute nodes for attributes explicitly assigned values in the document instance (without defaulted attributes)

For each type, a grammar defining the legal attributes will be constructed; if type dt has attributes an1 and an2, of types st1 and st2 respectively, then the core context-free grammar will have a form like this:

lavs_dt ::= []. lavs_dt ::= avs_dt, lavs_dt. /* declared attributes */ lavs_dt ::= avs_nsd, lavs_dt. /* namespace declarations */ lavs_dt ::= avs_xsi, lavs_dt. /* XSI attributes */ avs_dt ::= [an1=Av], { st1_value(Av) }. avs_dt ::= [an2=Av], { st2_value(Av) }.

Simple types will, of course, have no declared attributes.

3.3.2 Rules for the Purchase-order type

The PurchaseOrderType illustrates the pattern. It defines only one attribute, orderDate, of type xsd:date. No attributes here are required, forbidden, or defaulted, so we don't need any calls to atts_present, atts_absent, or atts_defaulted. We do need an attributes property, the value of which is a flattened list of the attributes parsed. Following the patterns described above, this gives us the following definitions for the relevant predicates:

/* t_TYPENAME_atts(Lpa,Lpna,Lavs): true if Lavs contains an input-form list of attribute specifications which is legal for complex type TYPENAME, and which corresponds to the list of parsed attributes Lpa plus the list of parsed namespace attributes Lpna. */ t_PurchaseOrderType_atts(Lpa,Lpna,Lavs) :- lavs_t_PurchaseOrderType(LpaAll,Lavs,[]), partition(LpaAll,LpaPres,Lpna), attocc_t_PurchaseOrderType(LpaPres,Lpa). lavs_t_PurchaseOrderType ::= [] <:> attributes([]). lavs_t_PurchaseOrderType ::= avs_t_PurchaseOrderType^^Pa, lavs_t_PurchaseOrderType^^Lpa <:> attributes([Pa|L]) ::- Lpa^^attributes(L). lavs_t_PurchaseOrderType ::= avs_nsd^^Pa, lavs_t_PurchaseOrderType^^Lpa <:> attributes([Pa|L]) ::- Lpa^^attributes(L). lavs_t_PurchaseOrderType ::= avs_xsi^^Pa, lavs_t_PurchaseOrderType^^Lpa <:> attributes([Pa|L]) ::- Lpa^^attributes(L). avs_t_PurchaseOrderType ::= [orderDate=Value], { ws_normalize(collapse,Value,SNValue), xsd_date_value(SNValue) } <:> localname('orderDate') && namespacename('') && normalizedvalue(SNValue) && type_definition_anonymous('false') && type_definition_namespace('http://www.w3.org/2001/XMLSchema') && type_definition_name('date') && type_definition_type(simple) && schemaspecified(schema) && validationattempted(full) && validity(valid). attocc_t_PurchaseOrderType(L,L).

3.3.3 Other issues surrounding attribute parsing

Some other issues relating to attribute handling should be mentioned; full solutions are not described here for space reasons.

A single set of rules for namespace attributes and XSI attributes can be constructed; since these rules are not schema-dependent, the same rules can be used for all schemas.

Each complex type will also have a rule for occurrence-checking, which will take something like the following form (assuming that Lreq, Ldft, and Lnot are lists of required, defaulted, and forbidden attributes):

attocc_dt(LpaPres,LpaAll) :- atts_present(LpaPres,Lreq), atts_absent(LpaPres,Lnot), atts_defaulted(LpaPres,Ldft,LpaAll).

The predicates atts_present, etc., are straightforward to implement (they recur on both the list of attributes found in the input and the list of attributes on the forbidden/required/default list) and require no comment.

Before attribute values (or element content) are validated against a simple type, the appropriate kind of whitespace normalization must be performed to obtain the lexical form of the simple type. This can be done by specifying a predicate ws_normalize(+kw,+Atom,-Atom), which takes three arguments: a keyword to say what kind of normalization to perform, an atom representing the character string to be normalized, and an atom representing the same string after normalization. (The arguments marked + are expected to be used as input, i.e. the arguments will be instantiated at the time the relation is called; the argument marked - will normally be uninstantiated when the predicate is called and will be bound to an appropriate value. Readers used to other programming languages may think of it, without too much distortion, as a VAR parameter called by reference and used to return the result of a computation.)

There are three values for the keyword, described in the XML Schema 1.0 specification as follows:

preserve No normalization is done, the value is not changed (this is the behavior required by [XML 1.0 (Second Edition)] for element content)
replace All occurrences of #x9 (tab), #xA (line feed) and #xD (carriage return) are replaced with #x20 (space)
collapse After the processing implied by replace, contiguous sequences of #x20's are collapsed to a single #x20, and leading and trailing #x20's are removed.

The normalization algorithm is readily implemented by converting the atom to a list of characters or character codes using the atom_codes predicate, processing the list, and converting the result back into an atom. Details are left as an exercise for the reader.

As noted above, the list of parsed attribute nodes returned from the DCTG parse must be partitioned into a list of namespace attributes and a list of all other attributes; this is a simple exercise.

3.4 Rules for checking values of simple types

In this layer of the program, no new kinds of rules for checking simple types are needed; the rules in layer 3 differ from those in layer 2 only in providing better checking of the lexical form.

A simple example is given by the rule for checking the values of the quantity element:

t_quantity_value(LF) :- atom_chars(LF,Lchars), integer(_,Lchars,[]), number_chars(Num,Lchars), Num < 100.

More elaborate predicates and DCTG rules are needed to check decimal numbers, integers, and dates, but the basic principle is the same: a DCTG checks the lexical form, and ad hoc predicates check for out-of-range.

3.5 Evaluation

When run against a small battery of test cases, the grammar as shown above accepted all the valid test cases and rejected all the invalid tests, with the exception of instances which supply duplicate attributes; in theory, the upstream XML parser should be catching that well-formedness error, and so I have thus far chosen not to modify the grammar to catch it.

The grammar as shown illustrates well the utility of logic grammars, and in particular DCTG notation, for representing XML Schema 1.0 schemas. The translation from schema document to DCTG notation is straightforward and seems unlikely to present any serious difficulties for a mechanical process.

There are some interesting features of XML Schema 1.0 which are not illustrated by the purchase order schema which has been used here as an example; the next section of this paper outlines briefly how I think those features can be implemented on the basis of the skeleton described above.

4 Notes on other features and further work

4.1 Supporting substitution groups

Substitution groups can be handed by defining multiple element rules with the same left-hand side. If a is in the substitution group of b, then we have both

b ::- [element(b,Attributes,Content)], { b_atts(Attributes,[]), b_cont(Content,[]) }. a ::- [element(a,Attributes,Content)], { a_atts(Attributes,[]), a_cont(Content,[]) }.

and also

b ::- [element(a,Attributes,Content)], { a_atts(Attributes,[]), a_cont(Content,[]) }.

4.2 Handling invalid data

The grammar described above does what grammars traditionally do: it distinguishes members of a set (valid documents) from non-members (invalid documents, incompletely validate documents, and so on). A schema processor assessing the validity of a document instance is supposed to do more. In particular, a schema processor is expected not to fail in the presence of invalid data, merely to mark it invalid and return the appropriate post-schema-validation infoset.

In order to achieve this behavior, the basic structure shown above needs to be augmented in two ways:

Each predicate which performs schema validity assessment must succeed for all input, rather than succeeding only for valid input.
When the input is not valid, the parser should return error diagnostics. We can establish the convention that the parser returns with an empty list of error codes if and only if the input was valid.

We achieve this by building fallback behavior into the appropriate rules. The rule for checking lexical form against a simple type, for example, might take the form sva_st_lf(Type,LF,Lerr), where Type is the type name, LF is the lexical form to be checked, and Lerr is the list of errors raised. The rule for checking simple element content, then, would change. Instead of consisting solely of the first rule below, it would also include the second:

/* if lexical form has no errors, then element has none */ sva_content_st_TN(Gi,Lin,Lpn,[]) :- sva_st_lf_TN(TN,Lin,[]). /* if lexical form has errors, then so does element */ sva_content_st_TN(Gi,Lin,Lpn,[ElemError|Errors]) :- sva_st_lf_TN(TN,Lin,Errors), not(Errors = []), ElemError = error(cvc-type.3.1.3).

When validation against a content model does not work, the parser should fall back on lax validation; the entire sequence of children should be revalidated as if it matched a wildcard with processContents="lax".

4.3 Reducing redundancy

One of the most prominent features of the parser for purchase orders outlined above is the relatively high degree of redundancy. Every complex type is translated into two DCTG grammars (one for content, one for attributes) and several additional type-specific predicates. Every element type has a separate predicate for its appearance as a terminal symbol in its parents' grammars. Every one of the terminal-symbol rules has an identical structure; they differ only in the name of the predicate and the type-specific predicates they call to check attributes and content.

The regularities of schema processing would be more obvious, and much of this redundancy could be eliminated, if we represented simple types, complex types, attributes, and elements not as sets of Prolog rules but as structured data, which is passed to parsing rules as a parameter.

A complex type, for example, might be represented by a simple Prolog fact like the following:

complextype(t_PurchaseOrderType, '/complexType(po:PurchaseOrderType)', 'PurchaseOrderType', 'http://www.example.com/PO1', global, [ (id(t_PurchaseOrderType)), (scd('/complexType(po:PurchaseOrderType)')), (anonymous(false)), (name('PurchaseOrderType')), (target_namespace('http://www.example.com/PO1')), (base_type_definition(t_urtype)), (derivation_method(restriction)), (final([])), (abstract(false)), (attribute_uses([ au(required(false), attdecl(a_purchaseOrder_orderDate), value_constraint(keyword(absent))) ])), (attribute_wildcard(keyword(absent))), (content_type(c(t_PurchaseOrderType_cont,element-only))), (prohibited_substitutions([])), (annotations(keyword(absent))) ]).

The type identifier t_PurchaseOrderType can then be passed as a parameter to predicates which perform schema-validity assessment.

This move to a higher level of abstraction can make the logic of the representation harder to follow, so it was not used in the example worked through above. But the reification of schema components has a number of advantages, some of which are mentioned below, and is a key feature of further work on schemas and logic grammars.

4.4 Handling xsi:type

When an xsi:type attribute appears on an element in a document instance, a schema processor should (a) check to make sure that the type it names is derived from the type specified in the schema for that element, and then (b) validate the element with the derived type, rather than with the schema-specified type.

The grammar shown above makes no provision for this usage, in part because the purchase-order schema provides relatively little scope for interesting applications of the idea. To handle xsi:type, the terminal rule for the element would need to check for xsi:type before invoking the predicates to check attributes and content. In the grammar as shown, this is possible but would make the terminal rules verbose and opaque.

Once schema components are reified, however, it is straightforward to handle xsi:type. The pattern for terminal rules can change to something like the following:

purchaseOrder ::- [element('http://www.example.com/PO1':purchaseOrder, Attributes,Content)], { XSI = 'http://www.w3.org/2001/XMLSchema-instance', ( member(XSI:type=Local,Attributes) -> Type = Local ; Type = t_PurchaseOrderType ) sva_atts(Type,A,NA,Attributes), sva_cont(Type,C,Content) }.

The (Condition -> Thenclause ; Elseclause) construct serves to initialize the variable Type to the value of xsi:type, if specified, and to t_PurchaseOrderType otherwise.

4.5 Mixed content

The discussion above ignores the problem of parsing mixed content, as the purchase-order schema does not have any mixed-content types.

The problem is simple: between any two non-terminals in the DCTG for a complex type, there may be atoms representing character data; if the complex type is element-only, those atoms should represent white space, while if the complex type has mixed content, any XML character should be legal.

Logic grammars can deal with mixed content in a variety of ways. The character-data atoms can be filtered out of the content list before it is parsed using the DCTG. The character-data atoms can be written into the grammar; the process is similar to that of adding elements to a content model in the process of eliminating SGML inclusion exceptions.

The most plausible approach, however, is to re-organize the parser so that instead of expressing the grammar in native Prolog rules like the t_PurchaseOrderType_cont rules shown above, it represents them using some Prolog data structure (a combination of lists and structured terms is an obvious choice). The parser can then be a simple grammar interpreter which reads those data structures and does the parsing.⁸

This approach is tempting, because it goes hand in hand with the project of reifying the schema components and reducing redundancy in the code.

4.6 Supporting wildcards with skip and lax processing

Wildcards are easily supported if the basic validation predicates accept types as parameters: a sequence of children is lax-validated by reading each in turn, looking it up in the schema, and validating with the element declaration found, if any, or with an appropriate type definition if xsi:type is specified.

Skip wildcards (black-box processing) are even easier to implement.

4.7 Further work

The processor outlined above seems promising enough to make further work seem useful. Three items of further work seem worth calling out specifically. First, we need to show that a DCTG-based representation of schemas like that described here conforms to the schema component constraints of the spec, and that a parser running the DCTG grammar performs validation according to the validation rules of the spec. Second, it would be useful to represent the schema for schemas as shown, so that we can validate XML-encoded schema documents. Third, it is a natural further step to combine the two levels of processing to make a schema processor which reads schema documents and from them builds DCTGs (or equivalent data structures) with which it validates document instances.

According to section 2.4 of [W3C 2001b], to be minimally conforming a translation of an XML Schema into DCTG notation must

implement the schema component constraints (see Appendix C.4 and subsection 6 of each component) I take this to mean that the schemas used by the processor must obey them; the processor may or may not independently enforce them. Proof obligation: prove that the style of DCTG outlined above does satisfy these constraints.
implement the validation rules (see Appendix C.1 and subsection 4 of each component). I take this to mean the processor must work correctly from its component representation to the values of the relevant PSVI instance properties. Proof obligation: prove that the style of DCTG outlined above does correctly implement the validation rules.
implement (i.e. make) the schema information set contributions (see Appendix C.2 and subsection 5 of each component). Proof obligation: make the parser decorate the input with appropriate grammatical attributes representing the relevant properties.

This work will be done in separate papers.

5 Appendix: Further reading

For more information on the representation of SGML and XML documents as Prolog structures, see the SWI add-ons documentation [Wielemaker 2001]. Other representations are possible; I have used this one because it's convenient and because representing sibling sequences in list form makes it easier to use DCGs and DCTGs.

Definite-clause grammars (DCGs) are introduced in almost any good Prolog textbook: e.g. [Clocksin/Mellish 1984], [Sterling/Shapiro 1994], or [Bratko 1990]. They are discussed at somewhat greater length in treatments of Prolog for natural-language processing, including [König/Seiffert 1989], [Gazdar/Mellish 1989], and [Gal et al. 1991]. Most extended discussions show how to use additional arguments to record syntactic structure or handle the semantics of the material.

Definite-clause translation grammars were introduced as a way of making it easier to handle semantics; they provide explicit names for attributes (in the sense of attribute grammars [Knuth 1968]).

Notes

Bibliography

[Abramson 1984] Abramson, Harvey. 1984. “Definite Clause Translation Grammars”. Proceedings of the 1984 International Symposium on Logic Programming, Atlantic City, New Jersey, February 6-9, 1984, pp. 233-240. (IEEE-CS 1984, ISBN 0-8186-0522-7)

[Abramson/Dahl 1989] Abramson, Harvey, and Veronica Dahl. 1989. Logic Grammars. Symbolic Computation AI Series. Springer-Verlag, 1989.

[Alblas 1991] Alblas, Henk. 1991. “Introduction to attribute grammars”. Attribute grammars, applications and systems: International Summer School SAGA, Prague, Czechoslovakia, June 4-13, 1991, Proceedings, pp. 1-15. Berlin: Springer, 1991. Lecture Notes in Computer Science, 545.

[Bratko 1990] Bratko, Ivan. 1990. Prolog programming for artificial intelligence. Second edition. Wokingham: Addison-Wesley. xxi, 597 pp.

[Brown/Blair 1990] Brown, Allen L., Jr., and Howard A. Blair. 1990. “A logic grammar foundation for document representation and layout”. In EP90: Proceedings of the International Conference on Electronic Publishing, Document Manipulation and Typography, ed. Richard Furuta. Cambridge: Cambridge University Press, 1990, pp. 47-64.

[Brown, Fuchs et al. 1991] Brown, Allen, Matthew Fuchs, Jonathan Robie, and Philip Wadler. 1991. “XML Schema: Formal Description”. W3C Working Draft, 25 September 2001. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/xmlschema-formal/

[Brown/Wakayama/Blair 1992] Brown, Allen L., Jr., Toshiro Wakayama, and Howard A. Blair. 1992. “A reconstruction of context-dependent document processing in SGML”. In EP92: Proceedings of Electronic Publishing, 1992, ed. C. Vanoirbeek and G. Coray. Cambridge: Cambridge University Press, 1992. Pages 1-25.

[Brüggemann-Klein 1993] Brüggemann-Klein, Anne. 1993. Formal models in document processing. Habilitationsschrift, Freiburg i.Br., 1993. 110 pp. Available at ftp://ftp.informatik.uni-freiburg.de/documents/papers/brueggem/habil.ps (Cover pages archival copy also at http://www.oasis-open.org/cover/bruggDissert-ps.gz). Brüggemann-Klein provides a formal definition of 1-unambiguity, which corresponds to the notion of unambiguity in ISO 8879 and determinism in XML 1.0. Her definition of 1-unambiguity can be used to check XML Schema's Unique Particle Attribution constraint by changing every minOccurs and maxOccurs value greater than 1 to 1, if the two are equal, and otherwise changing minOccurs to 1 maxOccurs greater than 1 to unbounded.

[Clocksin/Mellish 1984] Clocksin, W. F., and C. S. Mellish. Programming in Prolog. Second edition. Berlin: Springer, 1984.

[Gal et al. 1991] Gal, Annie, Guy Lapalme, Patrick Saint-Dizier, and Harold Somers. 1991. Prolog for natural language processing. Chichester: Wiley, 1991. xiii, 306 pp.

[Gazdar/Mellish 1989] Gazdar, Gerald, and Chris Mellish. 1989. Natural language processing in PROLOG: An introduction to computational linguistics. Wokingham: Addison-Wesley, 1989. xv, 504 pp.

[Grune/Jacobs 1990] Grune, Dick, and Ceriel J. H. Jacobs. 1990. Parsing techniques: a practical guide. New York, London: Ellis Horwood, 1990. Postscript of the book is available from the first author's Web site at http://www.cs.vu.nl/~dick/PTAPG.html

[Holstege/Vedamuthu 2003] Holstege, Mary, and Asir S. Vedamuthu, ed. 2003. XML Schema: Component Designators. W3C Working Draft 09 January 2003. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/xmlschema-ref/

[Knuth 1968] Knuth, D. E. 1968. “Semantics of context-free languages”. Mathematical Systems Theory 2: 127-145.

[König/Seiffert 1989] König, Esther, and Roland Seiffert. Grundkurs PROLOG für Linguisten. Tübingen: Francke, 1989. [= Uni-Taschenbücher 1525]

[Pereira/Shieber 1987] Pereira, Fernando C. N., and Stuart M. Shieber, Prolog and natural-language analysis. CSLI lecture notes 10. Stanford: Center for the study of language and information, 1987.

[Sterling/Shapiro 1994] Sterling, Leon, and Ehud Shapiro. 1994. The Art of Prolog: Advanced Programming Techniques. Cambridge, Mass.: MIT Press.

[W3C 2001a] W3C (World Wide Web Consortium). 2001a. “XML Schema Part 0: Primer”, ed. David Fallside. W3C Recommendation, 2 May 2001. [Cambridge, Sophia-Antipolis, Tokyo: W3C] http://www.w3.org/TR/xmlschema-0/.

[W3C 2001b] W3C (World Wide Web Consortium). 2001b. XML Schema Part 1: Structures, ed. Henry S. Thompson, David Beech, Murray Maloney, and Noah Mendelsohn. W3C Recommendation 2 May 2001. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/

[W3C 2001c] W3C (World Wide Web Consortium). 2001c. XML Schema Part 2: Datatypes, ed. Biron, Paul V. and Ashok Malhotra. W3C Recommendation 2 May 2001. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/

[Wadler 2000] Wadler, Philip. “A formal semantics of patterns in XSLT and XPath.” Markup Languages: Theory & Practice 2.2 (2000): 183-202.

Lpa	List of parsed attributes (i.e. of node() structures of the kind returned by any DCTG rule) for this complex type, including defaulted attributes
Lpna	List of parsed namespace attributes
Lavs	The list of attribute-value specifications provided by the input structure returned by the SWI Prolog parser.
LpaAll	Combined list of parsed-attribute node() structures for all attributes, both namespace attributes and others
LpaPresent	List of parsed-attribute nodes for attributes explicitly assigned values in the document instance (without defaulted attributes)

1.	The author thanks Allen Brown for suggesting the idea of which this paper is an elaboration, namely, that it would be interesting to translate the formal inference rules of XML Schema: Formal Description [Brown, Fuchs et al. 1991] mechanically into a definite-clause translation grammar. This would allow those inference rules to be interpreted directly, so that they could be applied to test cases and the rules themselves would be, in some sense, a reference implementation of XML Schema. The paper was undertaken initially in order to help the author (and possibly others) understand how this idea might work; it has served its purpose for the author, and the author hopes it serves some purpose for others as well. The paper has grown a bit in the working out, and parts of the original conception are now to be written up in companion papers, — in particular, the inference rules of [Brown, Fuchs et al. 1991] do not make any appearance, since it seems simpler to translate the rules of [W3C 2001b] directly into Prolog.
2.	Here I am obliged to use the term attribute in the sense defined by [Knuth 1968]. In order to distinguish the attributes of elements in an XML document from attributes in this sense, I will where necessary refer to the former as XML attributes and to the latter either as grammatical attributes or (adopting the terminology of XML information sets) as properties.
3.	The careful reader may now be objecting that the rules as given don't say anything explicit about the sequences Q, Qa, and Qb. That's true, and so before rules like those shown can be used by a Prolog system, they must be translated into an equivalent but less convenient form in which the bookkeeping necessary for keeping track of the lists and their concatenation is handled by additional arguments. The process is purely mechanical and is illustrated below.
4.	In a more adequate grammar of English, of course, some would be recognized as compatible with singular nouns, too: “Every man loves some woman.”
5.	This EBNF ignores some Prolog constructs like curly braces; it is therefore not exact or complete.
6.	For small documents, the parser returns the entire document in the list structure described; for large documents, alternative calls are provided to avoid having to have the entire document in memory at a time. I will here use only the simpler all-at-once form of the parser.
7.	See [Grune/Jacobs 1990] sec. 2.3.2.3 (pp. 35-37) for discussion of this point.
8.	[Pereira/Shieber 1987] provides some examples of the kind of embedded interpreter described here, including a left-corner parser which does not suffer from infinite loops on left-recursive grammar rules.