EDA with Org Tables

I’ve enjoyed the occasional wrangling with tabular data in Org tables before1, after all you still come heavy with whatever your texteditor is capable of and most of the time exploratory data analysis (EDA) starts out simple, by looking at tables. Only recently I learned about orgtbl-ascii-draw, which draws an ASCII barplot with values from a given column. Actually, passing some UTF-8 block elements to the characters argument produces this decent plot:

| Key | Value |             |
|-----+-------+-------------|
| c   |  0.05 | ▎           |
| a   |   1.1 | ██▋         |
| a   |   1.2 | ██▉         |
| b   |     2 | ████▉       |
| c   |   4.3 | ██████████▍ |
| d   |   3.1 | ███████▌    |
#+tblfm: $3='(orgtbl-ascii-draw $2 0 5 12 (apply 'string (number-sequence 9615 9608 -1)))

For larger collections a stemplot may be more appropriate. There is a stem function in R, but that should be doable in Elisp as well. I’ll use Org’s “Library of Babel” facilities to later call this implementation from other code blocks:

#+name: stemplot
#+begin_src elisp :results table :lexical t :var data='() stemlen='()
(let* ((dat (sort (mapcar 'car (copy-sequence data)) '<))
       (slen (string-width (number-to-string (truncate (apply 'max dat)))))
       (stemlen (or stemlen 1))
       (dat (mapcar (lambda (x) (floor x (expt 10 (- slen stemlen 1)))) dat))
       acc)
  (setq slen (or stemlen slen))
  (setq fac (expt 10 (- slen (1- stemlen))))
  (dotimes (i (1+ (floor (car (last dat)) fac)) acc)
    (push (list i (mapconcat
                   (lambda (x)
                     (if (and (>= x (* fac i))
                              (< x (* fac (1+ i))))
                         (format "%s" (% x fac))))
                   dat ""))
          acc)))
#+end_src

Now that the stemplot function is defined in a named source code block, we need to add it to the library:

(org-babel-lob-ingest (buffer-file-name))

To celebrate the NBA Playoffs fever that absolutely hit me (again), let’s have a glance at the field goal attempts per game that have been a source of controversy during the first round:

#+begin_src shell :results table :cache yes :post stemplot(data=*this*, stemlen=2)
curl -X GET "https://www.basketball-reference.com/playoffs/NBA_2017_per_game.html" | \
    sed -nr '1,$ s/.*data-stat="fga_per_g" >([0-9]+\.[0-9]+)<\/td>.*/\1/ p'
#+end_src
Table 1: FGA/G NBA Playoffs 2017
Stem Leaves
30 3
29  
28  
27  
26  
25  
24  
23  
22 05
21 36
20 06
19 388
18 0336
17 011358
16 58
15 01
14 3
13 00036889
12 00013
11 0001359
10 56
9 001135589
8 033445566666888
7 0000002345
6 00000022257899
5 000002224555555557
4 00222244555577799
3 00000000000225555566677779
2 00000011255556677
1 0000001333333333555557778888888
0 00022255566

The outlier is, who would have thought, Russell Westbrook, the black hole from OKC.

Text Mining on DerStandard Comments

I did some text mining on DerStandard.at before, back then with a simple HTTP GET and passing the response to BeautifulSoup. Things change, webcontent is created dynamically and we have to resort to other tools these days.

Headless browsers provide a JavaScript API, useful for retrieving the desired data after loading the page. My choice fell on phantomjs, available on pacman:

pacman -Qi phantomjs | head -n3
Name                     : phantomjs
Version                  : 2.1.1-3
Beschreibung             : Headless WebKit with JavaScript API

Since my JavaScript skills were close to non-existent, writing the script was the hard part. After some copypasta and trial-and-error coding, inevitably running into scoping and async issues, this clobbed together piece of code actually works! It reads the URL as its first and the pagenumber in the comment section as its second argument. The innerHTML from a .postinglist element is written to a content.html in the directory the parse-postinglists.js has been invoked from. Actually the data is appended to the file, that’s why I can loop over the available comment pages in R:

for (i in 1:43) {
  system(paste("phantomjs parse-postinglists.js",
               "http://derstandard.at/2000043417670/Das-Vokabular-der-Asylkritiker",
               i, sep=" "))
}

The article of interest is itself a quantitative analysis of word frequencies in recent forum debates on asylum. Presentation and insights are somewhat underwhelming though. After all, there is a lot of information collected and stored. Some of which, found in the attributes, is perfectly appropriate for the metadata in my Corpus object (created with the help of the tm package1), as can be seen from the first document in it:

meta(corp[[1]])
author       : Darth Invadeher
datetimestamp: 16-09-06 14:33
pos          : 104
neg          : 13
badges       : 0
pid          : 1014788751
parentpid    : NA
id           : 1
language     : de

pid is the unique posting ID, a parentpid value is applied when this particular posting refers to another posting, i.e. is a followup posting. This opens up the possibility to relate authors to each other and probably a lot more. badges doesn’t fit too well as an attribute name, it actually denotes the follower count in that forum. pos and neg show the positive resp. negative rating count on that particular posting. At the time of this analysis there were 1064 documents (i.e. postings) in the corpus.

The average lifespan of an online article is rather short. Interestingly, the likelihood to get a lot of votes diminishes even faster. That’s probably because a few posters take their debate further long-since the average voter is done with this article. So don’t be late to the party!

Sorry, your browser does not support SVG.

The bigger part of the data stored relates to the posting content. For now, I’m interested in extracting keywords that define the discussions. Their importance for the particular posting and the whole corpus is defined by a two-fold normalization by use of a TF-IDF weighting function. Obviously, what has been a rather one-sided reflection on terms used by asylum critics was followed by a nomenclatura debate. You can tell that from the dominance of “Begriff”, “Wort”, “Bezeichnung”, “Ausdruck” etc:

Sorry, your browser does not support SVG.

NBA's Free Agent's Relative Salary Increase

While NBA’s salary cap continues to rise – 2016 every team has roughly $24 million more to spend, the projected cap for 2017 is something in-between $102 and $108 million – free agents (FA) take their opportunity to sign staggering contracts. The agreement of Mozgov and the Lakers for example attracted a lot of attention. But is he that bad of a deal for the LAL? I decided to not only trust in my own judgement, but do some simple dataset exploration.

It’s amazing how sometimes you can perform the whole exploratory data analysis life cycle inside Emacs – from the data retrieval to the publishing (as is done in this post). Initially I considered mentioning homoiconicity, since all my text is data – to be exact, though, the internal representation of an org-table is still a list and not what I look at in the buffer. Nonetheless, the excitement is quite similar. The process of getting messy data into shape took me a few minutes.

And these are the outcomes: Whiteside and Drummond are absolutely worth the dough, Conley is more of a surprise. Actually, there are a lot of big men on top of that list, will the small-ball dominance be a blip?

Table 1: NBA’s Free Agent’s Salary Comparison (Top 20)
          Salary
Team Player Pos Age Contract (in y) Total (in MM) 2015-16 2016-17 Increase (in MM)
MIA Hassan Whiteside C 27 4 98 981348 24500000 +23.52
DET Andre Drummond C 23 5 130 3272091 26000000 +22.73
MEM Mike Conley G 29 5 153 9588426 30600000 +21.01
DAL Harrison Barnes F 24 4 95 3873398 23750000 +19.88
TOR DeMar DeRozan G 27 5 145 9500000 29000000 +19.50
WAS Bradley Beal G 23 5 120 5694674 24000000 +18.31
POR Allen Crabbe F 24 4 75 947276 18750000 +17.80
BOS Al Horford F/C 30 4 113 12000000 28250000 +16.25
ATL Kent Bazemore G/F 27 4 70 2000000 17500000 +15.50
ORL Bismack Biyombo C 24 4 72 3000000 18000000 +15.00
ORL Evan Fournier G/F 24 5 85 2288205 17000000 +14.71
POR Evan Turner G/F 28 4 70 3425510 17500000 +14.07
WAS Ian Mahinmi C 30 4 64 4000000 16000000 +12.00
CHA Nicolas Batum G/F 28 5 120 12235750 24000000 +11.76
DAL Dirk Nowitzki F 38 2 40 8333334 20000000 +11.67
MIA Tyler Johnson G 24 4 50 845059 12500000 +11.65
LAL Jordan Clarkson G 24 4 50 845059 12500000 +11.65
NOP Solomon Hill F 25 4 52 1358880 13000000 +11.64
HOU Ryan Anderson F 28 4 80 8500000 20000000 +11.50
LAL Timofey Mozgov C 30 4 64 4950000 16000000 +11.05

These are some teams I’m interested in that decided to offer their new signings a relative salary increase (with one exception). The table pretty much reflects the pecuniary space of these teams before the free agency:

Table 2: Average Salary Increase per signed FA
Team N FAs Average Increase
BOS 1 +16.25
MEM 4 +8.42
LAL 5 +7.12
MIA 7 +5.92
CHI 3 +2.72
SAS 3 +2.26
NYK 5 +0.92
GSW 5 +0.52
CLE 2 -2.00

Discover Destructuring Assignment in Elisp

LISt Processing in Emacs Lisp obviously involves a lot of juggling with lists and their elements. What else would be more convenient than generalizing the access and binding of list elements? Not only does the concept of destructuring assignment come along with code that is easier to write but also easier to read (terse, patterns that visually cue what elements are supposed to be assigned to variables). Alas, some features of Elisp have to be discovered. While pcase got its node in the Elisp Manual, neither is there an explanation what QPATTERN and UPATTERN mean nor are the related macros ever even mentioned. As if this wasn’t enough, the docstrings of pcase-let and its starred equivalent will leave the average Emacs user puzzled, pcase-dolist doesn’t even have one. This will hopefully change in subsequent versions of Emacs. For now, get ready to embark on a journey of discovery!

pcase is by far the most frequently used macro from pcase.el. What it does is pattern matching, a concept that goes beyond the scope of a blogpost. If you’re familiar with the Fibonacci Sequence, the following example is self-explanatory:

(defun fib (n)
  (pcase n
    (`0 1)
    (`1 1)
    (n (+ (fib (- n 1)) (fib (- n 2))))))
(mapcar 'fib (number-sequence 0 6))
(1 1 2 3 5 8 13)

Generally, pcase is used as a powerful conditional programming construct. Several examples can be found on this EmacsWiki page. Especially suited to the beforementioned destructuring is pcase-let:

(pcase-let
    ((`(,spec ,month ,day ,name) (nth 3 holiday-general-holidays)))
  (princ (format "%s is on 2016-%d-%d" name month day))
"Valentine's Day is on 2016-2-14"

The practical advantage will become patently obvious when trying to do the same with let:

(let* ((l (nth 3 holiday-general-holidays))
       (spec (car l))
       (month (cadr l))
       (day (caddr l))
       (name (cadddr l)))
  (princ (format "%s is on 2016-%d-%d" name month day))))

pcase-let in its simplest form resembles Python’s poor man’s destructuring-bind, called tuple and list unpacking:

([a, b, c], d, e) = ([1, 1, 2], 3, 5)
print(a + d)
4

Probably even more interesting is pcase-dolist that iterates over the lists of a list:

(let ((l '()))
  (pcase-dolist (`(,spec ,month ,day . ,rest) holiday-general-holidays)
    (push (cons month (if (stringp (car rest)) rest (cdr rest))) l))
  (nreverse l))
((1 "New Year's Day")
 (1 "Martin Luther King Day")
 (2 "Groundhog Day")
 (2 "Valentine's Day")
 (2 "President's Day")
 (3 "St. Patrick's Day")
 (4 "April Fools' Day")
 (5 "Mother's Day")
 (5 "Memorial Day")
 (6 "Flag Day")
 (6 "Father's Day")
 (7 "Independence Day")
 (9 "Labor Day")
 (10 "Columbus Day")
 (10 "Halloween")
 (11 "Veteran's Day")
 (11 "Thanksgiving"))

Digging even further into the library, you’ll discover a pcase-lambda. Yet, I’m still not sure what it does besides accepting pcase patterns. But I won’t worry for now, there is exactly ONE appearance of pcase-lambda in the Emacs sources.

Der Schönbrunner Schlosspark im Tageslicht

Vor wenigen Wochen habe ich beschlossen, den Ballsport um ein Lauftraining mit geringer Intensität zu ergänzen. Als Abwechslung zur sonst doch einseitigen sportlichen Betätigung gedacht, entpuppt sich das Laufen v.a. als willkommene Gelegenheit, für eine halbe bis dreiviertel Stunde abzuschalten. Dem zuträglich ist natürlich das Ambiente der gewählten Laufroute im Schönbrunner Schlosspark. Die naturgemäß im Herbst sich verkürzende Dauer des Tageslichts und die nach und nach reduzierten Öffnungszeiten des Schlossparks stellen insbesondere Berufstätige vor die Herausforderung, das günstige Zeitfenster nicht zu verpassen. Hier versuche ich auszumachen, wie knapp dieser tägliche Zeitrahmen bemessen ist:

daylight.png

Das Tageslicht wird durch die Öffnungszeiten offensichtlich relativ gut eingefangen. Im Mai, August und September wird etwas Potential verschenkt, nämlich in Summe knapp 11h Tageslicht. Zu beachten gilt auch, dass Einlasszeiten einzuhalten sind. Deshalb hat man, solange der Schlosspark bereits um 17:30 schließt, auch zeitig loszustarten. Das gilt immerhin für 121 Tage im Jahr:

closing.png

Für die restliche Zeit, also immerhin ca. zwei Drittel des Jahres, wird mir der Schlosspark mit all seinen Annehmlichkeiten wohl als Laufroute dienen. Der Code zur Analyse ist hier zu finden.

← Newer  1/2  Older →