Text Mining on DerStandard Comments

Text Mining on DerStandard Comments

I did some text mining on DerStandard.at before, back then primarily interested in the comment count per article. What has been a simple HTTP GET and passing the response to BeautifulSoup requires a more sophisticated approach today. Things change, webcontent is created dynamically and we have to resort to other tools these days.

Headless browsers provide a JavaScript API, useful for retrieving the desired data after loading the page. My choice fell on phantomjs, available on pacman:

pacman -Qi phantomjs | head -n3
Name                     : phantomjs
Version                  : 2.1.1-3
Beschreibung             : Headless WebKit with JavaScript API

Since my JavaScript skills were close to non-existent, writing the script was the hard part. After some copypasta and trial-and-error coding, inevitably running into scoping and async issues, this clobbed together piece of code actually works! It reads the URL as its first and the pagenumber in the comment section as its second argument. The innerHTML from a .postinglist element is written to a content.html in the directory the parse-postinglists.js has been invoked from. Actually the data is appended to the file, that’s why I can loop over the available comment pages in R:

for (i in 1:43) {
  system(paste("phantomjs parse-postinglists.js",
               i, sep=" "))

The article of interest is itself a quantitative analysis of word frequencies in recent forum debates on asylum. Presentation and insights are somewhat underwhelming though. After all, there is a lot of information collected and stored. Some of which, found in the attributes, is perfectly appropriate for the metadata in my Corpus object (created with the help of the tm package1), as can be seen from the first document in it:

author       : Darth Invadeher
datetimestamp: 16-09-06 14:33
pos          : 104
neg          : 13
badges       : 0
pid          : 1014788751
parentpid    : NA
id           : 1
language     : de

pid is the unique posting ID, a parentpid value is applied when this particular posting refers to another posting, i.e. is a followup posting. This opens up the possibility to relate authors to each other and probably a lot more. badges doesn’t fit too well as an attribute name, it actually denotes the follower count in that forum. pos and neg show the positive resp. negative rating count on that particular posting. At the time of this analysis there were 1064 documents (i.e. postings) in the corpus.

The average lifespan of an online article is rather short. Interestingly, the likelihood to get a lot of votes diminishes even faster. That’s probably because a few posters take their debate further long-since the average voter is done with this article. So don’t be late to the party!

Sorry, your browser does not support SVG.

The bigger part of the data stored relates to the posting content. For now, I’m interested in extracting keywords that define the discussions. Their importance for the particular posting and the whole corpus is defined by a two-fold normalization by use of a TF-IDF weighting function. Obviously, what has been a rather one-sided reflection on terms used by asylum critics was followed by a nomenclatura debate. You can tell that from the dominance of “Begriff”, “Wort”, “Bezeichnung”, “Ausdruck” etc:

Sorry, your browser does not support SVG.