This meetup included an extensive Text Mining in R session with an Introduction to tm by Ingo Feinerer and a talk about Text Mining with Hadoop by Stefan Theussl.
After a creative break for the last month Ingo and Stefan gave great talks covering tm in greater detail after the brief introduction in February.
Ingo Feinerer: Introduction to tm
Ingo started right away with a nice bottom-up introduction covering tm’s building blocks like Sources, Readers and Corpora. The creation of Document-TermMatrices was also motivated with a small clustering example for 3 documents.
CRAN Package Link Package Vignette
The word cloud shown above was created from the tm package vignette as follows:
library(tm)
library(wordcloud)
<- sprintf("file://%s", system.file(file.path("doc", "tm.pdf"), package = "tm"))
uri stopifnot(all(file.exists(Sys.which(c("pdfinfo", "pdftotext")))))
<- Corpus(URISource(uri), readerControl = list(reader = readPDF))
corp <- paste(content(corp[[1]]), collapse = "\n")
tmvignette <- stripWhitespace(removePunctuation(removeNumbers(tmvignette)))
vigclean <- removeWords(vigclean, stopwords())
vigclean wordcloud(vigclean)
Stefan Theussl: Text Mining with Hadoop
Stefan gave a solution to the problem when things (i.e. text corpora) get big using a set of Hadoop R-packages he created in collaboration with Ingo.
CRAN packages:
Documentation:
Best,
-ViennaR