htmlTreeParse
when the content is known
to be (potentially malformed) HTML.
xmlTreeParse(file, ignoreBlanks=TRUE, handlers=NULL, replaceEntities=FALSE, asText=FALSE, trim=TRUE, validate=FALSE, getDTD=TRUE, isURL=FALSE, asTree = FALSE, addAttributeNamespaces = FALSE, useInternalNodes = FALSE, isSchema = FALSE) htmlTreeParse(file, ignoreBlanks = TRUE, handlers = NULL, replaceEntities = FALSE, asText = FALSE, trim = TRUE, isURL = FALSE, asTree = FALSE, useInternalNodes = FALSE)
isURL
.
Additionally, the file can be compressed (gzip)
and is read directly without the user having
to de-compress (gunzip) it.
startElement()
,
endElement()
comment()
,
externalEntity()
,
entityDeclaration()
,
processingInstruction()
,
cdata()
,
text()
.
file
argument refers to a URL
(accessible via ftp or http) or a regular file on the system.
If
asText
is TRUE, this should not be specified.
The function attempts to determine whether the
data source is a URL by using
to look for http or ftp at the start of the string.
The libxml parser handles the connection to servers,
not the Splus facilities (e.g.
).
handlers
argument and is used then to determine
whether the DOM tree should be returned or the
handlers
object (if F, the handlers object is returned).
TRUE
, an attribute such as
xsi:type="xsd:string"
is reported with the name
xsi:type
.
If it is
FALSE
, the name of the attribute is
type
.
TRUE
and no handlers are provided, the
return value is a reference to the internal C-level document pointer.
This can be used to do post-processing via XPath expressions using
the getNodeSet function.
HOWEVER, NOTE that the internal C-level document pointers are
not currently implemented in S-Plus.
The
handlers
argument is used similarly
to those specified in
.
When an XML tag (element) is processed,
we look for a function in this collection
with the same name as the tag's name.
If this is not found, we look for one named
startElement
. If this is not found, we use the default
built in converter.
The same works for comments, entity references, cdata, processing instructions,
etc.
The default entries should be named
comment
,
startElement
,
externalEntity
,
processingInstruction
,
text
,
cdata
and
namespace
.
All but the last should take the XMLnode as their first argument.
In the future, other information may be passed via ...,
for example, the depth in the tree, etc.
Specifically, the second argument will be the parent node into which they
are being added, but this is not currently implemented,
so should have a default value (
NULL
).
The
namespace
function is called with a single argument which
is an object of class
XMLNameSpace
. This contains
begin{description}
item[id] the namespace identifier as used to
qualify tag names;
item[uri] the value of the namespace identifier,
i.e. the URI
identifying the namespace.
item[local] a logical value indicating whether the definition
is local to the document being parsed.
end{description}
One should note that the
namespace
handler is called before the
node in which the namespace definition occurs and its children are
processed. This is different than the other handlers which are called
after the child nodes have been processed.
Each of these functions can return arbitrary values that are then
entered into the tree in place of the default node passed to the
function as the first argument. This allows the caller to generate
the nodes of the resulting document tree exactly as they wish. If the
function returns
NULL
, the node is dropped from the resulting
tree. This is a convenient way to discard nodes having processed their
contents.
handlers
is provided, and
asTree == FALSE
, the
handlers
object is returned (it
is assumed to be carrying its return information in a frame, or to
have saved it elsewhere).
By default, an object of class XML doc is returned,
which contains fields/slots named
file
,
version
and
children
.
XMLNode
.
These are made up of 4 fields.XMLNode
, such as
XMLComment
,
XMLProcessingInstruction
,
XMLEntityRef
are used.
If the value of the argument getDTD is TRUE, the return value is a list of length 2. The first element is as the document as described above. The second element is a list containing the external and internal DTDs. Each of these contains 2 lists - one for elements and another for entities. See .
Make sure that the necessary 3rd party libraries are available.
Duncan Temple Lang
http://xmlsoft.org, http://www.w3.org/xml
fileName <- system.file("exampleData", "test.xml", package="XML") # parse the document and return it in its standard format. xmlTreeParse(fileName) # parse the document, discarding comments. xmlTreeParse(fileName, handlers=list("comment"=function(x,...){NULL}), asTree = TRUE) # print the element names invisible(xmlTreeParse(fileName, handlers=list(startElement=function(x, ...) { cat("In element",x$name, x$value,"\n") x} ), asTree = TRUE ) ) # Parse some XML text. # Read the text from the file xmlText <- paste(readLines(fileName, -1), "\n", collapse="") xmlTreeParse(xmlText, asText=TRUE) # Read a MathML document and convert each node # so that the primary class is # <name of tag>MathML # so that we can use method dispatching when processing # it rather than conditional statements on the tag name. # See plotMathML() in examples/. fileName <- system.file("exampleData", "mathml.xml",package="XML") m <- xmlTreeParse(fileName, handlers=list( startElement = function(node){ cname <- paste(xmlName(node),"MathML", sep="",collapse="") class(node) <- c(cname, class(node)); node })) # This should raise an error. try(xmlTreeParse( system.file("exampleData", "TestInvalid.xml", package="XML"), validate=TRUE)) # Parse an XML document directly from a URL. # Requires Internet access. xmlTreeParse("http://www.omegahat.org/Scripts/Data/mtcars.xml")