# Text Mining on DerStandard Comments

I did some text mining on DerStandard.at before, back then primarily interested in the comment count per article. What has been a simple HTTP GET and passing the response to BeautifulSoup requires a more sophisticated approach today. Things change, webcontent is created dynamically and we have to resort to other tools these days.

Headless browsers provide a JavaScript API, useful for retrieving the desired data after loading the page. My choice fell on phantomjs, available on pacman:

pacman -Qi phantomjs | head -n3

Name                     : phantomjs
Version                  : 2.1.1-3
Beschreibung             : Headless WebKit with JavaScript API


Since my JavaScript skills were close to non-existent, writing the script was the hard part. After some copypasta and trial-and-error coding, inevitably running into scoping and async issues, this clobbed together piece of code actually works! It reads the URL as its first and the pagenumber in the comment section as its second argument. The innerHTML from a .postinglist element is written to a content.html in the directory the parse-postinglists.js has been invoked from. Actually the data is appended to the file, that’s why I can loop over the available comment pages in R:

for (i in 1:43) {
system(paste("phantomjs parse-postinglists.js",
"http://derstandard.at/2000043417670/Das-Vokabular-der-Asylkritiker",
i, sep=" "))
}


The article of interest is itself a quantitative analysis of word frequencies in recent forum debates on asylum. Presentation and insights are somewhat underwhelming though. After all, there is a lot of information collected and stored. Some of which, found in the attributes, is perfectly appropriate for the metadata in my Corpus object (created with the help of the tm package1), as can be seen from the first document in it:

meta(corp[[1]])

author       : Darth Invadeher
datetimestamp: 16-09-06 14:33
pos          : 104
neg          : 13

pid is the unique posting ID, a parentpid value is applied when this particular posting refers to another posting, i.e. is a followup posting. This opens up the possibility to relate authors to each other and probably a lot more. badges doesn’t fit too well as an attribute name, it actually denotes the follower count in that forum. pos and neg show the positive resp. negative rating count on that particular posting. At the time of this analysis there were 1064 documents (i.e. postings) in the corpus.
R