Introduction to Parsing with GXPARSE

1 Introduction to Parsing with GXPARSE

GXPARSE comprises three subsystems:

GXPARSE was used to translate this document from docbook XML to HTML.

The software is under the Lesser GNU General Public License.

The software requires Java 1.5 and uses Java generic types.

1.1 Parsing XML

The parser presents every element to a single element method. When the method is invoked, the start tag has been read and is at the top of a stack of start tags, representing the most recently encountered element and all of its containing elements. The method can do some work related to the beginning of the element, before giving control back to the parser by invoking a parseContent method. The parser parses the entire content of the element, passing each part to the appropriate handler, before returning control to the element method. The element method can do som work related to the end of the element, prior to exiting. When the elment method exits, the start tag will be removed from the stack.

1.1.1 Element Handlers

As a simple example, consider a requirement to convert itemized lists from docbook XML to unordered lists in HTML. Two element handlers are needed to place the appropriate HTML tags around the content of the entire list, and around the content of each item in the list:

// <itemizedlist> handler produces a HTML <ul>
public class Element_itemizedlist
        implements ElementListener<Parser<Docbook>>
{
    public void doElement(Parser<Docbook> parser,
                          CurrentElement element)
            throws ListenerException, IOException
    {
        Docbook docbook = parser.getGlobal();
        docbook.writerStack.write(docbook.lineSeparator);
        docbook.writerStack.write("<ul>");
        element.parseContent();
        docbook.writerStack.write(docbook.lineSeparator);
        docbook.writerStack.write("</ul>");
    }
}

// <listitem> handler produces a HTML <li>
public class Element_listitem
        implements ElementListener<Parser<Docbook>>
{
    public void doElement(Parser<Docbook> parser,
                          CurrentElement element)
            throws ListenerException, IOException
    {
        Docbook docbook = parser.getGlobal();
        docbook.writerStack.write(docbook.lineSeparator);
        docbook.writerStack.write("<li>");
        element.parseContent();
        docbook.writerStack.write("</li>");
    }
}

Every element requires a handler, but many elements can go to a generic handler that simply passes content straight through to the output:

// generic handler passes content without change
public class DefaultHandler
        implements ElementListener<Parser<Docbook>>
{
    public void doElement(Parser<Docbook> parser,
                          CurrentElement element)
            throws ListenerException, IOException
    {
        element.parseContent();
    }
}

1.1.2 Character Handlers

When the content of an element is character data, the content will be sent to a character handler. The simplest form of character handler simply copies the character data to the output:

    public void characters(Parser<Global> parser, CharSequence charSequence)
            throws ListenerException, IOException {
        // copy input to the output
        writerStack.append(charSequence);
    }

Some applications will require more elaberate character handlers. For example, when the output is HTML, ceratin characters must be replaced by character entities:

    public void characters(final Parser<Docbook> parser,
                           final CharSequence charSequence)
            throws ListenerException, IOException
    {
        Docbook docbook = parser.getGlobal();
        // for HTML, convert '&', '<', '>' to entities, write
        // other characters as themselves
        for (int i = 0 ; i < charSequence.length() ; ++i) {
            char chr = charSequence.charAt(i);
            switch (chr) {
                case '>':
                    docbook.writerStack.write("&gt;");
                    break;

                case '<':
                    docbook.writerStack.write("&lt;");
                    break;

                case '&':
                    docbook.writerStack.write("&amp;");
                    break;

                default:
                    docbook.writerStack.write(chr);
                    break;
            }
        }
    }

Whether character processing is simple or complex, design is made easier by the fact that the characters presented to the character handler are no more and no less than the content that needs to be processed by the handler.

1.1.3 Listeners and Element Mappers

The parser invokes methods in a listener. The listener has a single element method for all elements. That element handler invokes an element method that sends each elements to the appropriate element handler. There are a number of other methods in the listener, but some applications will need only an element method and a character-handling method:

// Listener receives content from parser and sends the
// content to handlers
public class DocbookListener
        extends AbstractErrorListener<Parser<Docbook>>
{
    private ElementMapper elementMapper;

    public DocbookListener(Writer writer)
        throws IOException
    {
        super(writer);
        // more code to set up the element mapper
    }

    // send elements to element handlers
    public void doElement(Parser<Docbook> parser,
                          CurrentElement element)
            throws ListenerException, IOException {
        elementMapper.doElement(parser, element);
    }

    // handle character data
    public void characters(final Parser<Docbook> parser,
                           final CharSequence charSequence)
            throws ListenerException, IOException
    {
        Docbook docbook = parser.getGlobal();
        // for HTML, convert '&', '<', '>' to entities, write
        // other characters as themselves
        for (int i = 0 ; i < charSequence.length() ; ++i) {
            char chr = charSequence.charAt(i);
            switch (chr) {
                case '>':
                    docbook.writerStack.write("&gt;");
                    break;

                case '<':
                    docbook.writerStack.write("&lt;");
                    break;

                case '&':
                    docbook.writerStack.write("&amp;");
                    break;

                default:
                    docbook.writerStack.write(chr);
                    break;
            }
        }
    }
}

1.1.4 The XML Parser Interface

The parser receives a data stream specified by an instance of Input and invokes methods in a Listener. An invocation of a Parser might look like this:

    public class AppData extends Global
    {
        // persistent data for the application
    }

    AppData appData;
    Input input;
    ParserFactory<AppData> parserFactory;
    Listener<AppData> myListener;

        try {
            Parser<appData>parser = parserFactory.newParser();
            parser.setListener(myListener)
            parser.parse(input, appData);
        } catch (ListenerException x) {
            // handle an exception from application
        } catch (ParseException x) {
            // handle an exception from API
        } catch (IOException x) {
            // handle an input or output exception from API or application
        }

1.2 Character Stream Output Tools

The processing required for character data depends on the element currently being processed. All of of element content and all of the information about an element is discarded after the element handler exits back to the parser. The dependency and the discarding of data both create problems when processing data sequentially. Some of those problems are solved by the WriterStack and the Resequencer.

1.2.1 The WriterStack

When a document has many different elements, a an element-dependent character handler would be quite complex. Program design would be simpler if the character handler could handle characters in the same way for every element. Identical hendling in all elements is possible if each element handler has the opportunity to capture the output whenever that element handler requires special handling for the character content. The WriterStack makes it possible for an element handler to capture the output of the character handler by temporarily redirecting the character handler to a different Writer.

Consider an application in which titles require special handling. The character content must go to the output, as it would for many element. In addition, the titles must be entered in a table of contents. The title element handler can push a CharArrayWriter on the WriterStack to capture the output of the character handler, and then give control back tot he parser to parse the element content. After the content has been parsed, the CharArrayWriter holds the title text, and is popped from the WriterStack, and converted to a String. The String is written to the WriterStack, which is now directed back to the normal destination, and to the table of contents:

public class Element_title
        implements ElementListener<Parser<Docbook>>
{
    public void doElement(Parser<Docbook> parser, CurrentElement element)
            throws ListenerException, IOException
    {
        Docbook docbook = parser.getGlobal();
        CharArrayWriter titleWriter = new CharArrayWriter();
        docbook.writerStack.push(titleWriter);
        element.parseContent();
        docbook.writerStack.pop();
        String title = titleWriter.toString();
        // send character data to the normal output
        docbook.writerStack.write("<h1>" + title "</h1>")
        // more code to save title for table of contents ...
    }
}

The WriterStack makes Unix-like redirection available from within a Java program. Element handlers can control character processing by redirecting the output of a simple character handler that does only generic processing (like changing ampersand and angle brackets to entities to produce HTML character content). The simpler character handler is likely to be much faster than one that does element-dependent processing.

1.2.2 The Resequencer

A frequent problem in sequential processing is the need to output at from input that is not available at the time when the output is required. For example, an IDREF attribute usually identifies a place where output is needed, but the necessary data is identified by an ID attribute somewhere else in the document. The Resequencer solves this problem by allowing a sequential output to be treated almost like a random access output device.

1.2.2.1 The ResequencingWriter

A ResequencingWriter is a Writer that implements the Resequencer interface.

1.2.2.2 The ResequencingWriterStack

A ResequencingWriterStack is a WriterStack that implements the Resequencer interface. It can be used both to write the text of a title out when the title is encountered, and to write the same text whenever a reference to the title is encountered.

public class Element_title
        implements ElementListener
{
    public void doElement(Parser parser,
                          CurrentElement element)
            throws ListenerException, IOException {

        // capture title in String so that it can be used twice
        CharArrayWriter titleWriter = new CharArrayWriter();
        resequencingWriterStack.push(titleWriter);
        element.parseContent();
        resequencingWriterStack.pop();
        String title = titleWriter.toString();

        // output title to document
        resequencingWriterStack.write(
                blankLine + "=== " + title + " ===");

        // if parent has an ID attribute, associate the title
        // with the ID value
        Attribute id = element.getParent().getAttribute(ATT_ID);
        if (id != null) {
            // the mark is created if it does not already exist
            Resequencer.Mark mark
                  = resequencingWriterStack.mark(id.getValue())
            resequencingWriterStack.mark(id.getValue()).write(title);
        }
    }
}

A ref element indicates a point at which the title is to be printed (along with other information.

public class Element_ref
        implements ElementListener<Parser<Docbook>>
{
    public void doElement(Parser<Global> parser,
                          CurrentElement element)
            throws ListenerException, IOException {
        Attribute idref = element.getAttribute(ATT_IDREF);
        if (idref != null) {
            // insert a mark for the IDREF, data for the mark
            // will be supplied at some other time (before or
            // after this event!)
            resequencingWriterStack.write(lineSeparator + "See \"");
            // a mark is created if it does not already exist
            resequencingWriterStack.writeMark(idref.getValue());
            resequencingWriterStack.write("\"");
        }
        element.parseContent();
    }
}

A Resequencer gives most of the advantages of random access to the output, because the program can issue a command to "write" the data (by writing a mark) whenever the need is indicated by the input, and can accept the actual data at any time before or after the data was needed. The Resequencer simple stores all of the data (ordinary data, marks, and data for the marks) until it is closed. At that time, the Resequencer puts everything together in the right sequence, and sends it all to the final destination.

1.3 Pattern Matching

The Java package java.util.regex supplies potent tools for matching regular expressions in strings and character sequences. The package ca.gorman.util.scan extends the Java package to matching sequences of characters as they pass through subclasses of Reader or Writer. A pattern (either a regular expression, or one derived from regular expressions) can used to identify a character sequence in the input or output stream and, optionally, to hold that character sequence (or parts of the sequence) for further processing. The patterns are sufficiently flexible to describe sequences (such as nested sets of parentheses) that cannot be described by regular expressions.

The patterns can be specified, with an associated action that is relevant to the matched text, in a subclass of ScanRule:

    public interface ScanMatch
    {
        MatchedText match(ScanBuffer scanBuffer)
                throws IOException;
    }

    public interface ScanAction
    {
        void action(ScanState scanState, MatchedText matchedText)
                throws IOException;
    }

    public interface ScanRule extends ScanRule, ScanAction {}

The scanning engine invokes the match method to request the ScanRule to test the input buffer for a match. The ScanRule either returns the matched text (possibly of zero length) or returns null to indicate no match. If matched text is returned, the scanning engine invokes the action method on the matched text.

Testing for a match is done by the ScanRule, rather than the scanning engine. It is therefore possible to create a ScanRule to recognize complex text structures by using other ScanRules to recognize smaller (and presumably simpler) parts of those structures.

The action method receives a character sequence that has been matched and removed from the match buffer, and also has access to the scanner. The action method can send the character sequence to the default scanner output, do something else with the data, or do nothing at all (which effectively discards the data).

1.3.1 Simple Pattern Matching

As an illustration of the programing style for the pattern matching package, consider a requirement to extract words and numbers from the string "word 12345 Word m1xed" and list them, one to a line:

    word
    12345
    Word

Two rules are needed, one ScanRule to recognize and print text and one ScanRule to recognize and discard text:

    /** Factory-generated regex match. */
    private static class RegexMatch implements ScanMatch
    {
        private final ScanMatch regexMatch;

        public RegexMatch(String regex)
        {   regexMatch = regexFactory.newRegexMatch(regex); }

        public MatchedText match(ScanBuffer scanBuffer)
                throws IOException
        {   return regexMatch.match(scanBuffer); }
    }

    /** Match and print text */
    private static class PrintRule
            extends RegexMatch implements ScanRule
    {
        public PrintRule(String regex)
        {   super(regex);   }

        public void action(ScanState scanState,
                           MatchedText matchedText)
                throws IOException
        {
            // output matched character sequence and a newline
            scanState.out().append(matchedText);
            scanState.out().append('\n');
        }
    }

    /** Match and discard text */
    private static class SkipRule
            extends RegexMatch implements ScanRule
    {
        public PrintRule(String regex)
        {   super(regex);   }

        public void action(ScanState scanState,
                           MatchedText matchedText)
                throws IOException
        {
            // no output, discard the text
        }
    }

Instances of PrintRule and SkipRule are constructed from regular expressions and used to construct a RuleList:

    String wordPattern = "\\b\\p{Alpha}+\\b";
    String numberPattern = "\\b\\p{Digit}+\\b";
    String nonblankPattern = "[^\\s]+";
    String whitespacePattern = "\\s+";

    RuleList ruleList = AbstractRuleList.newInstance(
                                new SkipRule(whitespacePattern),
                                new PrintRule(wordPattern),
                                new PrintRule(numberPattern),
                                new SkipRule(nonblankPattern));

One form of the scanner can read from any CharSequence (like a String) or Reader and write to any Appendable (like a Writer or StringBuffer):

    ScannerFactory scannerFactory = ScannerFactory.newInstance();
    Scanner scanner = scannerFactory.newScanner();

    String string = "word 12345 Word m1xed";
    StringReader reader = new StringReader(string);
    StringBuffer stringBuffer = new StringBuffer;
    Writer writer = new FileWriter("filename");
    
    scanner.scan(string, writer, ruleList);
    scanner.scan(reader, stringBuffer, ruleList);

A second form of the scanner is a Reader:

            Writer output;

            // create the scanner as a Reader
            Reader scanningReader = new ScanningReader(
                                            reader, ruleList);

            // scan
            for ( int chr = scanningReader.read()
                ; chr >= 0
                ; chr = scanningReader.read()
                ) {
                output.write(chr);
            }

A third form of the scanner is a Writer:

            Reader input;

            // create the scanner as a Writer
            Writer scanningWriter = new ScanningWriter(
                                            writer, ruleList);

            // scan
            for ( int chr = input.read()
                ; chr >= 0
                ; chr = input.read()
                ) {
                scanningWriter.write(chr);
            }

1.3.2 Complex Pattern Matching

A ScanRule that recognizes complex structures can be based on several ScanMatches that recognize simpler structures. For example, a single ScanRule to recognize both number strings and letter strings can be built up from two ScanMatches, one to recognize number strings and one to recognize letter strings:

/** Match and print text */
public class PrintRule implements ScanRule
{
    private final Modifier modifier = Modifier.newInstance();
    private final String wordPattern = "\\b\\p{Alpha}+\\b";
    private final String numberPattern = "\\b\\p{Digit}+\\b";
    private final ScanMatch wordMatch
            = regexFactory.newRegexMatch(wordPattern);
    private final ScanMatch numberMatch
            = regexFactory.newRegexMatch(numberPattern);
    private final ScanMatch printMatch
            = modifier.anyOf(wordMatch, numberMatch);
    private final ScanMatch scanMatch
            = modifier.oneOrMore(printMatch);

    public MatchedText match(ScanBuffer scanBuffer)
            throws IOException
    {
        return scanMatch.match(scanBuffer);;
    }

    public void action(ScanState scanState,
                       MatchedText matchedText)
            throws IOException
    {
        scanState.out().write(matchedText);
        scanState.out().write('\n');
    }
}

The scanner can be created with a single ScanRule for recognizing and printing text:

            StringReader input = new StringReader(
                    "word 12345 Word m1xed");
            StringWriter output = new StringWriter();
    
            // create the scanner as a Reader
            ScanningReader scanningReader = new ScanningReader(
                                input,
                                new SkipRule(whitespacePattern),
                                new PrintRule(),
                                new SkipRule(nonblankPattern));

            // scan
            for ( int chr = scanningReader.read()
                ; chr >= 0
                ; chr = scanningReader.read()
                ) {
                output.write(chr);
            }