# Text Mining on DerStandard Comments

I did some text mining on DerStandard.at before, back then primarily interested in the comment count per article. What has been a simple HTTP GET and passing the response to BeautifulSoup requires a more sophisticated approach today. Things change, webcontent is created dynamically and we have to resort to other tools these days.

Headless browsers provide a JavaScript API, useful for retrieving the desired data after loading the page. My choice fell on phantomjs, available on pacman:

pacman -Qi phantomjs | head -n3

Name                     : phantomjs
Version                  : 2.1.1-3
Beschreibung             : Headless WebKit with JavaScript API


Since my JavaScript skills were close to non-existent, writing the script was the hard part. After some copypasta and trial-and-error coding, inevitably running into scoping and async issues, this clobbed together piece of code actually works! It reads the URL as its first and the pagenumber in the comment section as its second argument. The innerHTML from a .postinglist element is written to a content.html in the directory the parse-postinglists.js has been invoked from. Actually the data is appended to the file, that’s why I can loop over the available comment pages in R:

for (i in 1:43) {
system(paste("phantomjs parse-postinglists.js",
"http://derstandard.at/2000043417670/Das-Vokabular-der-Asylkritiker",
i, sep=" "))
}


The article of interest is itself a quantitative analysis of word frequencies in recent forum debates on asylum. Presentation and insights are somewhat underwhelming though. After all, there is a lot of information collected and stored. Some of which, found in the attributes, is perfectly appropriate for the metadata in my Corpus object (created with the help of the tm package1), as can be seen from the first document in it:

meta(corp[[1]])

author       : Darth Invadeher
datetimestamp: 16-09-06 14:33
pos          : 104
neg          : 13
pid          : 1014788751
parentpid    : NA
id           : 1
language     : de


pid is the unique posting ID, a parentpid value is applied when this particular posting refers to another posting, i.e. is a followup posting. This opens up the possibility to relate authors to each other and probably a lot more. badges doesn’t fit too well as an attribute name, it actually denotes the follower count in that forum. pos and neg show the positive resp. negative rating count on that particular posting. At the time of this analysis there were 1064 documents (i.e. postings) in the corpus.

The average lifespan of an online article is rather short. Interestingly, the likelihood to get a lot of votes diminishes even faster. That’s probably because a few posters take their debate further long-since the average voter is done with this article. So don’t be late to the party!

The bigger part of the data stored relates to the posting content. For now, I’m interested in extracting keywords that define the discussions. Their importance for the particular posting and the whole corpus is defined by a two-fold normalization by use of a TF-IDF weighting function. Obviously, what has been a rather one-sided reflection on terms used by asylum critics was followed by a nomenclatura debate. You can tell that from the dominance of “Begriff”, “Wort”, “Bezeichnung”, “Ausdruck” etc:

# Decay Score Functions for Gnus

For a few weeks now, I’ve been using Gnus as my MUA. The main reason for leaving Mutt is the intriguing adaptive scoring Gnus provides, which seems to be especially tailored towards people like me that read mailing lists a lot. Gnus decays non-permanent scores once a day according to modest rules applied with the gnus-decay-score function. By default, every score entry up to 60 is reduced by a constant (of 3), values greater than 60 are downscaled by 5%. This results in a (almost) linear function that deals gently with the literal scores. While there is nothing wrong with a linear function here, I’d like to restrict adaptive scoring to a certain range.

The first alternative builds on the default decay and multiplies the literal score with a factor derived from the exponentiation of the score with a negative constant. Thus the adapted score is still (almost) a linear function but shrinks output values quite a bit:

(defun gnus-decay-score-1 (score)
(floor
(pcase score
(0 0)
((guard (< score 0)) (* score (expt (abs score) -0.2)))
(_ (* score (expt score -0.2))))))


For comparison, the following function applies an increasing exponential decay to literal score values. This results in values that increase rapidly first, level off thereafter and essentially max out close to a defined upper limit:

(defun gnus-decay-score-2 (score)
(let ((ulim 100)
(exponent -0.12))
(floor
(pcase score
(0 0)
((guard (< score 0)) (* -1 ulim (- 1 (expt (abs score) exponent))))
(_ (* ulim (- 1 (expt score exponent))))))))


Plotting adapted scores from various decay functions for a sequence from -50 to 250, a range that encloses almost all my current score values, will reveal the differences:

reset
set terminal svg size 720,320 enhanced font 'Lato,10'

set auto y
set xtics 25
set tics out scale 0.4 nomirror
set style line 101 lc rgb '#8c8c8c' lt 1 lw 1
set border 3 front ls 101
set xlabel 'Literal Score'
set key left nobox
set termoption dash

plot dat using 1:4 with lines lc rgb '#cd00cd' lt 5 lw 1 title 'gnus-decay-score-2',\
dat using 1:3 with lines lc 3 lt 5 lw 1 title 'gnus-decay-score-1',\
dat using 1:2 with lines lc rgb '#228b22' lt 3 lw 1 title 'gnus-decay-score',\
dat using 1:1 with lines lc 1 lt 1 lw 1 title 'literal score'


# NBA’s Free Agent’s Relative Salary Increase

While NBA’s salary cap continues to rise – 2016 every team has roughly $24 million more to spend, the projected cap for 2017 is something in-between$102 and $108 million – free agents (FA) take their opportunity to sign staggering contracts. The agreement of Mozgov and the Lakers for example attracted a lot of attention. But is he that bad of a deal for the LAL? I decided to not only trust in my own judgement, but do some simple dataset exploration. It’s amazing how sometimes you can perform the whole exploratory data analysis life cycle inside Emacs – from the data retrieval to the publishing (as is done in this post). Initially I considered mentioning homoiconicity, since all my text is data – to be exact, though, the internal representation of an org-table is still a list and not what I look at in the buffer. Nonetheless, the excitement is quite similar. The process of getting messy data into shape took me a few minutes: And these are the outcomes: Whiteside and Drummond are absolutely worth the dough, Conley is more of a surprise. Actually, there are a lot of big men on top of that list, will the small-ball dominance be a blip? Table 1: NBA’s Free Agent’s Salary Comparison (Top 20) Team Player Pos Age Contract (in y) Salary Total (in MM) Salary 2015-16 Salary 2016-17 Increase (in MM) MIA Hassan Whiteside C 27 4 98 981348 24500000 +23.52 DET Andre Drummond C 23 5 130 3272091 26000000 +22.73 MEM Mike Conley G 29 5 153 9588426 30600000 +21.01 DAL Harrison Barnes F 24 4 95 3873398 23750000 +19.88 TOR DeMar DeRozan G 27 5 145 9500000 29000000 +19.50 WAS Bradley Beal G 23 5 120 5694674 24000000 +18.31 POR Allen Crabbe F 24 4 75 947276 18750000 +17.80 BOS Al Horford F/C 30 4 113 12000000 28250000 +16.25 ATL Kent Bazemore G/F 27 4 70 2000000 17500000 +15.50 ORL Bismack Biyombo C 24 4 72 3000000 18000000 +15.00 ORL Evan Fournier G/F 24 5 85 2288205 17000000 +14.71 POR Evan Turner G/F 28 4 70 3425510 17500000 +14.07 WAS Ian Mahinmi C 30 4 64 4000000 16000000 +12.00 CHA Nicolas Batum G/F 28 5 120 12235750 24000000 +11.76 DAL Dirk Nowitzki F 38 2 40 8333334 20000000 +11.67 MIA Tyler Johnson G 24 4 50 845059 12500000 +11.65 LAL Jordan Clarkson G 24 4 50 845059 12500000 +11.65 NOP Solomon Hill F 25 4 52 1358880 13000000 +11.64 HOU Ryan Anderson F 28 4 80 8500000 20000000 +11.50 LAL Timofey Mozgov C 30 4 64 4950000 16000000 +11.05 These are some teams I’m interested in that decided to offer their new signings a relative salary increase (with one exception). The table pretty much reflects the pecuniary space of these teams before the free agency: Table 2: Average Salary Increase per signed FA Team N FAs Average Increase BOS 1 +16.25 MEM 4 +8.42 LAL 5 +7.12 MIA 7 +5.92 CHI 3 +2.72 SAS 3 +2.26 NYK 5 +0.92 GSW 5 +0.52 CLE 2 -2.00 ## Tag Cloud in Elisp # Tag Cloud in Elisp While I went with manual links so far, I rediscovered tags in Org-mode to associate and filter headings in a non-hierarchical manner. While manually linking still seems important to me, finding (all) information on a given topic by tags is faster obviously. Several note-taking systems have incorporated tagging, a lot of them provide tag clouds. Something that’s missing in Org-mode. So I played around with the "special" display text-property which allows to change the text’s appearance. Some even implemented animations in Emacs for fun using this property. First, I came up with a way to retrieve and sort all the available tags in my file and store them in an alist: (let ((tbl (with-current-buffer (find-file-noselect "~/slips/slips.org") (org-with-wide-buffer (goto-char (point-min)) (let (tags) (while (re-search-forward (concat org-outline-regexp-bol "$?:.*?[ \t]$?:$[[:alnum:]_@#%:]+$:[ \t]*$")
nil t)
(dolist (tag (org-split-string
(match-string-no-properties 1)
":"))
(push tag tags)))
(mapcar (lambda (x) (cons (car x) (length x)))
(seq-group-by 'identity tags)))))))
(sort tbl (lambda (x y) (> (cdr x) (cdr y)))))


The tag cloud itself is a work in progress, the code may appear on Github in the near future though:

# Clock into Recently Clocked Tasks with Ido

This will be short (and sweet). org-clock-in offers a list of recently clocked tasks to select from when called with one C-u (known as the universal-argument). Unfortunately, the task selection doesn’t use any completion engine. Therefore I prefer the following snippet, using ido for completion:

(defun bp/org-clock-in-select ()
"Select a task to clock into from a list of recently clocked items."
(interactive)
(let (res)
(dolist (i org-clock-history)
(with-current-buffer
(org-base-buffer (marker-buffer i))
(org-with-wide-buffer
(ignore-errors
(goto-char (marker-position i))
(push (,(org-get-heading 'notags) . ,i) res)))))
(let* ((l (reverse (mapcar 'car res)))
(with-current-buffer
(org-with-wide-buffer
(org-clock-in)))))))


# Discover Destructuring Assignment in Elisp

LISt Processing in Emacs Lisp obviously involves a lot of juggling with lists and their elements. What else would be more convenient than generalizing the access and binding of list elements? Not only does the concept of destructuring assignment come along with code that is easier to write but also easier to read (terse, patterns that visually cue what elements are supposed to be assigned to variables). Alas, some features of Elisp have to be discovered. While pcase got its node in the Elisp Manual, neither is there an explanation what QPATTERN and UPATTERN mean nor are the related macros ever even mentioned. As if this wasn’t enough, the docstrings of pcase-let and its starred equivalent will leave the average Emacs user puzzled, pcase-dolist doesn’t even have one. This will hopefully change in subsequent versions of Emacs. For now, get ready to embark on a journey of discovery!

pcase is by far the most frequently used macro from pcase.el. What it does is pattern matching, a concept that goes beyond the scope of a blogpost. If you’re familiar with the Fibonacci Sequence, the following example is self-explanatory:

(defun fib (n)
(pcase n
(0 1)
(1 1)
(n (+ (fib (- n 1)) (fib (- n 2))))))
(mapcar 'fib (number-sequence 0 6))

(1 1 2 3 5 8 13)


Generally, pcase is used as a powerful conditional programming construct. Several examples can be found on this EmacsWiki page. Especially suited to the beforementioned destructuring is pcase-let:

(pcase-let
(((,spec ,month ,day ,name) (nth 3 holiday-general-holidays)))
(princ (format "%s is on 2016-%d-%d" name month day))

"Valentine's Day is on 2016-2-14"


The practical advantage will become patently obvious when trying to do the same with let:

(let* ((l (nth 3 holiday-general-holidays))
(spec (car l))
(princ (format "%s is on 2016-%d-%d" name month day))))


pcase-let in its simplest form resembles Python’s poor man’s destructuring-bind, called tuple and list unpacking:

([a, b, c], d, e) = ([1, 1, 2], 3, 5)
print a + b if a < d else d


Probably even more interesting is pcase-dolist that iterates over the lists of a list:

(let ((l '()))
(pcase-dolist ((,spec ,month ,day . ,rest) holiday-general-holidays)
(push (cons month day
(if (stringp (car rest)) rest (cdr rest)))
l))
(nreverse l))

Table 1: Holidays common throughout the United States
Month Holiday
1 New Year's Day
1 Martin Luther King Day
2 Groundhog Day
2 Valentine's Day
2 President's Day
3 St. Patrick's Day
4 April Fools' Day
5 Mother's Day
5 Memorial Day
6 Flag Day
6 Father's Day
7 Independence Day
9 Labor Day
10 Columbus Day
10 Halloween
11 Veteran's Day
11 Thanksgiving

Digging even further into the library, you’ll discover a pcase-lambda. Yet, I’m still not sure what it does besides accepting pcase patterns. But I won’t worry for now, there is exactly ONE appearance of pcase-lambda in the Emacs sources.

# Adding Mail-Abbrev Expansion to Org-MIME

As has been already mentioned, I overcome the temptation to use one of Emacs’ mail-clients. Yet still, I frequently use message-send-mail from within Emacs with a message-body composed with org-mime-subtree. You will find org-mime in the /contrib directory of your org-mode installation. Since manually filling the mail-header is redundant especially for buffers frequently (re)used, I set related properties to the current subtree that org-mime-send-subtree will parse before creating the email buffer. Unfortunately there is no built-in completion for addresses, which is why I use the following code. Apparently it’s wise to set alias_file in Mutt and mail-personal-alias-file in Emacs to the same file. It’s probably worth pointing out that I remove the commas introduced by define-mail-alias that separate the address from the definition in an alias.

(defvar org-mime-properties '("MAIL_TO" "MAIL_CC" "MAIL_BCC")
"Properties org-mime-send-subtree' parses.")

(setq mail-aliases-only
;; first build mail-aliases', then create a string for org-set-property'.
(mapconcat 'car (build-mail-aliases) " "))

;; hook to replace the alias in the property field with name and email
(defun bp/mail-property-changed (property value)
(let ((el (cdr (assoc value mail-aliases))))
(when (and el (member property org-mime-properties))
(org-set-property
property
(replace-regexp-in-string "," "" el)))))

(nconc org-default-properties org-mime-properties)
(nconc org-global-properties
(("MAIL_TO_ALL"         . ,mail-aliases-only)
("MAIL_BCC_ALL"        . ,mail-aliases-only)
("MAIL_CC_ALL"         . ,mail-aliases-only)))
`