uri = "http://www.omegahat.net/index.html" uri = "http://www.omegahat.net/RCurl/philosophy.xml" multiHandle = getCurlMultiHandle() streams = HTTPReaderXMLParser(multiHandle, save = TRUE) curl = getCurlHandle(URL = uri, writefunction = streams$getHTTPResponse) multiHandle = push(multiHandle, curl)
links = downloadLinks(multiHandle, "http://www.omegahat.net", "ulink", "url", verbose = TRUE) xmlEventParse(streams$supplyXMLContent, handlers = links, saxVersion = 2)
complete(multiHandle)
links$contents()
TRUE for the
save argument.
The definition of the XML event handlers is reasonably straightforward
at this point. We need a handler function for the link element that
adds an HTTP request for the link document to the multi curl handle.
And we need a way to get the resulting text back when the request is
completed. We maintain a list of text gatherer objects in the
variable docs. These are indexed by the names of the
documents being downloaded.
The function that processes a link element in the XML document merely
determines whether the document is already being downloaded (to avoid
duplicating the work) or not. If not, it pushes the new request for
that document onto the curl handle and returns. This is the function
op()
.
There are details about dealing with relative links. We have ignored
them here and only dealt with links that have an explicit
http: prefix.
downloadLinks =
function(curlm, base, elementName = "a", attr = "href", verbose = FALSE)
{
docs = list()
contents = function() {
sapply(docs, function(x) x$value())
}
ans = list(docs = function() docs,
contents = contents)
op = function(name, attrs, ns, namespaces) {
if(attr %in% names(attrs)) {
u = attrs[attr]
if(length(grep("^http:")) == 0)
return(FALSE)
if(!(u %in% names(docs))) {
if(verbose)
cat("Adding", u, "to document list\n")
write = basicTextGatherer()
curl = getCurlHandle(URL = u, writefunction = write$update)
curlm <<- push(curlm, curl)
docs[[u]] <<- write
}
}
TRUE
}
ans[elementName] = op
ans
}
library(RCurl) HTTPReaderXMLParser
[1] The creation of the regular curl handle and pushing it onto the multiHandle stack is equivalent to
handle = getURLAsynchronous(uri,
write = streams$getHTTPResponse,
multiHandle = multiHandle, perform = FALSE)