XML Parser

DESCRIPTION:

Parses an XML or HTML file or string, and generates an Splus structure representing the XML/HTML tree. Use htmlTreeParse when the content is known to be (potentially malformed) HTML.

USAGE:

xmlTreeParse(file, ignoreBlanks=TRUE, handlers=NULL, replaceEntities=FALSE,
              asText=FALSE, trim=TRUE, validate=FALSE, getDTD=TRUE,
               isURL=FALSE, asTree = FALSE, addAttributeNamespaces = FALSE,
               useInternalNodes = FALSE, isSchema = FALSE)
htmlTreeParse(file, ignoreBlanks = TRUE, handlers = NULL,
               replaceEntities = FALSE, asText = FALSE, trim = TRUE,
               isURL = FALSE, asTree = FALSE, 
               useInternalNodes = FALSE) 

ARGUMENTS:

file
The name of the file containing the XML contents. This can contain ~ which is expanded to the user's home directory. It can also be a URL. See isURL. Additionally, the file can be compressed (gzip) and is read directly without the user having to de-compress (gunzip) it.
ignoreBlanks
logical value indicating whether text elements made up entirely of white space should be included in the resulting `tree'.
handlers
Optional collection of functions used to map the different XML nodes to Splus objects. This is a named list of functions, and a frame can be used to provide local data. This provides a way of filtering the tree as it is being created, adding or removing nodes, and generally processing them as they are constructed in the C code. The standard functions are startElement(), endElement() comment(), externalEntity(), entityDeclaration(), processingInstruction(), cdata(), text().
replaceEntities
logical value indicating whether to substitute entity references with their text directly. This should be left as False. The text still appears as the value of the node, but there is more information about its source, allowing the parse to be reversed with full reference information.
asText
logical value indicating that the first argument, `file', should be treated as the XML text to parse, not the name of a file. This allows the contents of documents to be retrieved from different sources (e.g. HTTP servers, XML-RPC, etc.) and still use this parser.
trim
whether to strip white space from the beginning and end of text strings.
validate
logical indicating whether to use a validating parser or not, or in other words check the contents against the DTD specification. If this is true, warning messages will be displayed about errors in the DTD and/or document, but the parsing will proceed except for the presence of terminal errors.
getDTD
logical flag indicating whether the DTD (both internal and external) should be returned along with the document nodes. This changes the return type.
isURL
indicates whether the file argument refers to a URL (accessible via ftp or http) or a regular file on the system. If asText is TRUE, this should not be specified. The function attempts to determine whether the data source is a URL by using to look for http or ftp at the start of the string. The libxml parser handles the connection to servers, not the Splus facilities (e.g. ).
asTree
this only applies when on passes a value for the handlers argument and is used then to determine whether the DOM tree should be returned or the handlers object (if F, the handlers object is returned).
addAttributeNamespaces
a logical value indicating whether to return the namespace in the names of the attributes within a node or to omit them. If this is TRUE, an attribute such as xsi:type="xsd:string" is reported with the name xsi:type. If it is FALSE, the name of the attribute is type.
useInternalNodes
a logical value indicating whether to call the converter functions with objects of class XMLInternalNode rather than XMLNode. This should make things faster as we do not convert the contents of the internal nodes to S-Plus explicit objects. Also, it allows one to access the parent and ancestor nodes. However, since the objects refer to volatile C-level objects, one cannot store these nodes for use in further computations within S-Plus -- they disappear after the processing the XML document is completed. If this argument is TRUE and no handlers are provided, the return value is a reference to the internal C-level document pointer. This can be used to do post-processing via XPath expressions using the getNodeSet function. HOWEVER, NOTE that the internal C-level document pointers are not currently implemented in S-Plus.
isSchema
a logical value indicating whether the document is an XML schema. Passing TRUE is currently not supported in S-Plus, but it is in the argument list for compatibility.

DETAILS:

The handlers argument is used similarly to those specified in . When an XML tag (element) is processed, we look for a function in this collection with the same name as the tag's name. If this is not found, we look for one named startElement . If this is not found, we use the default built in converter. The same works for comments, entity references, cdata, processing instructions, etc. The default entries should be named comment , startElement, externalEntity , processingInstruction , text , cdata and namespace. All but the last should take the XMLnode as their first argument. In the future, other information may be passed via ..., for example, the depth in the tree, etc. Specifically, the second argument will be the parent node into which they are being added, but this is not currently implemented, so should have a default value ( NULL).

The namespace function is called with a single argument which is an object of class XMLNameSpace. This contains begin{description} item[id] the namespace identifier as used to qualify tag names; item[uri] the value of the namespace identifier, i.e. the URI identifying the namespace. item[local] a logical value indicating whether the definition is local to the document being parsed. end{description}

One should note that the namespace handler is called before the node in which the namespace definition occurs and its children are processed. This is different than the other handlers which are called after the child nodes have been processed.

Each of these functions can return arbitrary values that are then entered into the tree in place of the default node passed to the function as the first argument. This allows the caller to generate the nodes of the resulting document tree exactly as they wish. If the function returns NULL, the node is dropped from the resulting tree. This is a convenient way to discard nodes having processed their contents.

VALUE:

If handlers is provided, and asTree == FALSE , the handlers object is returned (it is assumed to be carrying its return information in a frame, or to have saved it elsewhere). By default, an object of class XML doc is returned, which contains fields/slots named file , version and children.
file
The (expanded) name of the file containing the XML.
version
A string identifying the version of XML used by the document.
children
A list of the XML nodes at the top of the document. Each of these is of class XMLNode. These are made up of 4 fields.
name
The name of the element.
attributes
For regular elements, a named list of XML attributes converted from the
children
List of sub-nodes.
value
Used only for text entries. Some nodes specializations of XMLNode, such as XMLComment , XMLProcessingInstruction, XMLEntityRef are used.

If the value of the argument getDTD is TRUE, the return value is a list of length 2. The first element is as the document as described above. The second element is a list containing the external and internal DTDs. Each of these contains 2 lists - one for elements and another for entities. See .

NOTE:

Make sure that the necessary 3rd party libraries are available.

AUTHOR(S):

Duncan Temple Lang

REFERENCES:

http://xmlsoft.org, http://www.w3.org/xml

SEE ALSO:

EXAMPLES:

 fileName <- system.file("exampleData", "test.xml", package="XML")
   # parse the document and return it in its standard format.
 xmlTreeParse(fileName)

   # parse the document, discarding comments.
  
 xmlTreeParse(fileName, handlers=list("comment"=function(x,...){NULL}), asTree = TRUE)

   # print the element names
 invisible(xmlTreeParse(fileName,
            handlers=list(startElement=function(x, ...) {
                                    cat("In element",x$name, x$value,"\n")
                                    x}
                                  ), asTree = TRUE
                          )
          )

 # Parse some XML text.
 # Read the text from the file
 xmlText <- paste(readLines(fileName, -1), "\n", collapse="")
 xmlTreeParse(xmlText, asText=TRUE)

 # Read a MathML document and convert each node
 # so that the primary class is 
 #   <name of tag>MathML
 # so that we can use method  dispatching when processing
 # it rather than conditional statements on the tag name.
 # See plotMathML() in examples/.
 fileName <- system.file("exampleData", "mathml.xml",package="XML")
m <- xmlTreeParse(fileName, 
                  handlers=list(
                   startElement = function(node){
                   cname <- paste(xmlName(node),"MathML", sep="",collapse="")
                   class(node) <- c(cname, class(node)); 
                   node
                }))



  # This should raise an error.
  try(xmlTreeParse(
            system.file("exampleData", "TestInvalid.xml", package="XML"),
            validate=TRUE))


 # Parse an XML document directly from a URL.
 # Requires Internet access.
 xmlTreeParse("http://www.omegahat.org/Scripts/Data/mtcars.xml")