EDA with Org Tables

EDA with Org Tables

I’ve enjoyed the occasional wrangling with tabular data in Org tables before1, after all you still come heavy with whatever your texteditor is capable of and most of the time exploratory data analysis (EDA) starts out simple, by looking at tables. Only recently I learned about orgtbl-ascii-draw, which draws an ASCII barplot with values from a given column. Actually, passing some UTF-8 block elements to the characters argument produces this decent plot:

| Key | Value |             |
| c   |  0.05 | ▎           |
| a   |   1.1 | ██▋         |
| a   |   1.2 | ██▉         |
| b   |     2 | ████▉       |
| c   |   4.3 | ██████████▍ |
| d   |   3.1 | ███████▌    |
#+tblfm: $3='(orgtbl-ascii-draw $2 0 5 12 (apply 'string (number-sequence 9615 9608 -1)))

For larger collections a stemplot may be more appropriate. There is a stem function in R, but that should be doable in Elisp as well. I’ll use Org’s “Library of Babel” facilities to later call this implementation from other code blocks:

#+name: stemplot
#+begin_src elisp :results table :lexical t :var data='() stemlen='()
(let* ((dat (sort (mapcar 'car (copy-sequence data)) '<))
       (slen (string-width (number-to-string (truncate (apply 'max dat)))))
       (stemlen (or stemlen 1))
       (dat (mapcar (lambda (x) (floor x (expt 10 (- slen stemlen 1)))) dat))
  (setq slen (or stemlen slen))
  (setq fac (expt 10 (- slen (1- stemlen))))
  ;; upper limit
  (dotimes (i (1+ (floor (car (last dat)) fac)) acc)
    (push (list i (mapconcat
                   (lambda (x)
                     (if (and (>= x (* fac i))
                              (< x (* fac (1+ i))))
                         ;; now that x are integers get the remainder
                         (format "%s" (% x fac))))
                   dat ""))

Now that the stemplot function is defined in a named source code block, we need to add it to the library:

(org-babel-lob-ingest (buffer-file-name))

To celebrate the NBA Playoffs fever that absolutely hit me (again), let’s have a glance at the field goal attempts per game that have been a source of controversy during the first round:

#+begin_src shell :results table :cache yes :post stemplot(data=*this*, stemlen=2)
curl -X GET "https://www.basketball-reference.com/playoffs/NBA_2017_per_game.html" | \
    sed -nr '1,$ s/.*data-stat="fga_per_g" >([0-9]+\.[0-9]+)<\/td>.*/\1/ p'
Table 1: FGA/G NBA Playoffs 2017
Stem Leaves
30 3
22 05
21 36
20 06
19 388
18 0336
17 011358
16 58
15 01
14 3
13 00036889
12 00013
11 0001359
10 56
9 001135589
8 033445566666888
7 0000002345
6 00000022257899
5 000002224555555557
4 00222244555577799
3 00000000000225555566677779
2 00000011255556677
1 0000001333333333555557778888888
0 00022255566

The outlier is, who would have thought, Russell Westbrook, the black hole from OKC.

Levenshtein Sorting for Ido

Levenshtein Sorting for Ido

I’ve been unimpressed so far by third-party completion frontends for Emacs – although there is a plethora of them, all of them in some way reflecting the author’s preferences that may or may not be compatible with yours. That’s why I stayed with the built-in ido-mode and various extensions to it, namely smex, ido-vertical, flx-ido and ido-hacks. Generally, they either hook into the minibuffer setup or advise the ido-set-matches resp. ido-set-matches-1, the workhorse of ido, that returns the final matches. At some point, though, I decided to overcome the shortcomings of ido myself and, en passant, remove a lot of redundant code.

One of the more important aspects to me is providing the list of completion candidates in some reasonable order, such that you do not have to scroll down a lot. Obviously, almost exact matches to the input string should find themselves on top of the list – a shortcoming that really annoys me with for example Ivy’s default regexp matching.

The first step therefore is to hijack the ido-set-matches function as it is inflexible and doesn’t provide the hooks to alter the final ido-matches. I took the opportunity to change the “flex” matching ido provides – which builds a rather naive and inefficient greedy regexp and only kicks in if the previous string-match on the input string as a whole fails:

;; -*- lexical-binding: t -*-

(fset 'orig-ido-set-matches 'ido-set-matches)
(defun ido-set-matches ()
  (when ido-rescan
    (setq ido-rotate nil
          ido-matches (if (and (memq ido-cur-item '(list buffer))
                               (> (string-width ido-text) 1))
                          (let (matches re)
                            (setq re (concat
                                      (regexp-quote (string (aref ido-text 0)))
                                      (mapconcat (lambda (c)
                                                   (concat "[-\.]*" (string c)))
                                                 (substring ido-text 1)
                            (if ido-enable-prefix
                                (setq re (concat "\\`" re)))
                            (setq matches '())
                            (mapc (lambda (s) (when (string-match re s)
                                           (setq matches (cons s matches))))
                        (ido-set-matches-1 ido-cur-list (not ido-rotate))))
    (setq ido-matches (ido-levenshtein-sort-matches))
    ;; show match count like ido-vertical-mode
    (setcar ido-decorations (propertize (format "\n%-5d  " (length ido-matches))
                                        'face '((:weight bold))))
    (run-hooks 'ido-alter-matches-hook)))

I kicked out a lot of computationally expensive stuff, namely a reverse in the end and then some. In return I added that badly missing hook to allow further processing on the matches returned. While the limiting of completion choices immediately kicks in when the input string consists of more than one character and is more efficient than this function’s defaults, we still need the previously mentioned clever sorting.

The Levenshtein algorithm may not be the right choice for long strings like filenames – that’s why I restricted the new ido-set-matches to lists and buffers – but does a decent job for sorting symbols and buffernames. Unfortunately, the Elisp implementations of Levenshtein (levenshtein.el and org-babel-edit-distance in ob-core.el) build the full matrix to hold the distances, which imposes a lot of unnecessary space and time complexity. A more efficient variant would reuse two vectors in each iteration:

(defun ido-levenshtein-distance (s1 s2)
  (let* ((m (length s1))
         (n (length s2)))
    ;; trivial case
    (if (string-equal s1 s2) 0
      (let ((v1 (make-vector (1+ n) nil))
            (v0 (make-vector (1+ n) nil)))
        (dotimes (i (1+ n))
          (setf (aref v0 i) i))
        (dotimes (i m)
          (setf (aref v1 0) (1+ i))
          ;; fill the remaining cells of v1
          (dotimes (j n)
            (setf (aref v1 (1+ j))
                  (min (1+ (aref v1 j))      ;deletion
                       (1+ (aref v0 (1+ j))) ;insertion
                       (+ (aref v0 j)
                          (if (= (aref s1 i) (aref s2 j))
                            1))         ;substitution
          ;; v0 is used to store the results from the previous
          ;; iteration and is reused to fill the appropriate cell in
          ;; later iterations. given 2 strings "aabb" and "aaab",
          ;; these are the results:
          ;; v0: [1 0 1 2 3] v1: [0 1 2 3 4]
          ;; v0: [2 1 0 1 2] v1: [1 0 1 2 3]
          ;; v0: [3 2 1 1 1] v1: [2 1 0 1 2]
          ;; v0: [4 3 2 2 1] v1: [3 2 1 1 1]
          ;; the `setf' below essentially is what `cl-rotatef' expands
          ;; to - rotate v0 and v1
          (setf v1 (prog1 v0 (setf v0 v1))))
        ;; (message "v0: %s v1: %s" v0 v1)
        (aref v0 n)))))

(byte-compile 'ido-levenshtein-distance)

Save your sanity and byte-compile the function. Let’s compare it to the beforementioned org-babel-edit-distance:

(byte-compile 'org-babel-edit-distance)

 (lambda (f)
   (cons f (benchmark-run 1000
             (funcall f "der alte mann und das meer" "alte männer im meer"))))
 '(ido-levenshtein-distance org-babel-edit-distance))
Function t total GC t GC
ido-levenshtein-distance 0.144975254 0 0.0
org-babel-edit-distance 0.6108843020000001 2 0.166570176999997

Definitely worth it! What’s missing? The function that does the actual sorting for ido-set-matches:

(defvar ido-levenshtein-threshold nil
  "Threshold for Levenshtein sorting to kick in.")

(defun ido-levenshtein-sort-matches ()
  (if (< (length ido-matches) (or ido-levenshtein-threshold 2500))
      (let (ilen)
        (setq ilen (string-width ido-text))
        (sort ido-matches (lambda (a b)
                            (when (< (- (string-width a) ilen) 14)
                              (< (ido-levenshtein-distance a ido-text)
                                 (ido-levenshtein-distance b ido-text))))))

The sorting only takes effect when the number of remaining matches is less than a given threshold. In addition to that it leaves out strings with a minimum possible edit distance of 14. Those are appended in no specific order.

So, how does that affect my completion frontend?

(let ((ido-text "list")
      (ido-cur-item 'list)
      (ido-cur-list (all-completions "" obarray 'fboundp)))
  (cl-subseq (mapcar 'substring-no-properties ido-matches) 0 8))
("list" "listp" "-list" "list*" "nlistp" "dolist" "maplist" "up-list")

The perfect match is on top of the list, ready to be completed. My ido-set-matches already adds a ido-vertical-mode like count for the current ido-matches and I use some other functionality in my personal setup, among others fontification for substring matches. These are beyond the scope of this post, but you may get inspiration from various already mentioned extensions to ido.

Text Mining on DerStandard Comments

Text Mining on DerStandard Comments

I did some text mining on DerStandard.at before, back then primarily interested in the comment count per article. What has been a simple HTTP GET and passing the response to BeautifulSoup requires a more sophisticated approach today. Things change, webcontent is created dynamically and we have to resort to other tools these days.

Headless browsers provide a JavaScript API, useful for retrieving the desired data after loading the page. My choice fell on phantomjs, available on pacman:

pacman -Qi phantomjs | head -n3
Name                     : phantomjs
Version                  : 2.1.1-3
Beschreibung             : Headless WebKit with JavaScript API

Since my JavaScript skills were close to non-existent, writing the script was the hard part. After some copypasta and trial-and-error coding, inevitably running into scoping and async issues, this clobbed together piece of code actually works! It reads the URL as its first and the pagenumber in the comment section as its second argument. The innerHTML from a .postinglist element is written to a content.html in the directory the parse-postinglists.js has been invoked from. Actually the data is appended to the file, that’s why I can loop over the available comment pages in R:

for (i in 1:43) {
  system(paste("phantomjs parse-postinglists.js",
               i, sep=" "))

The article of interest is itself a quantitative analysis of word frequencies in recent forum debates on asylum. Presentation and insights are somewhat underwhelming though. After all, there is a lot of information collected and stored. Some of which, found in the attributes, is perfectly appropriate for the metadata in my Corpus object (created with the help of the tm package1), as can be seen from the first document in it:

author       : Darth Invadeher
datetimestamp: 16-09-06 14:33
pos          : 104
neg          : 13
badges       : 0
pid          : 1014788751
parentpid    : NA
id           : 1
language     : de

pid is the unique posting ID, a parentpid value is applied when this particular posting refers to another posting, i.e. is a followup posting. This opens up the possibility to relate authors to each other and probably a lot more. badges doesn’t fit too well as an attribute name, it actually denotes the follower count in that forum. pos and neg show the positive resp. negative rating count on that particular posting. At the time of this analysis there were 1064 documents (i.e. postings) in the corpus.

The average lifespan of an online article is rather short. Interestingly, the likelihood to get a lot of votes diminishes even faster. That’s probably because a few posters take their debate further long-since the average voter is done with this article. So don’t be late to the party!

Sorry, your browser does not support SVG.

The bigger part of the data stored relates to the posting content. For now, I’m interested in extracting keywords that define the discussions. Their importance for the particular posting and the whole corpus is defined by a two-fold normalization by use of a TF-IDF weighting function. Obviously, what has been a rather one-sided reflection on terms used by asylum critics was followed by a nomenclatura debate. You can tell that from the dominance of “Begriff”, “Wort”, “Bezeichnung”, “Ausdruck” etc:

Sorry, your browser does not support SVG.

Decay Score Functions for Gnus

Decay Score Functions for Gnus

For a few weeks now, I’ve been using Gnus as my MUA. The main reason for leaving Mutt is the intriguing adaptive scoring Gnus provides, which seems to be especially tailored towards people like me that read mailing lists a lot. Gnus decays non-permanent scores once a day according to modest rules applied with the gnus-decay-score function. By default, every score entry up to 60 is reduced by a constant (of 3), values greater than 60 are downscaled by 5%. This results in a (almost) linear function that deals gently with the literal scores. While there is nothing wrong with a linear function here, I’d like to restrict adaptive scoring to a certain range.

The first alternative builds on the default decay and multiplies the literal score with a factor derived from the exponentiation of the score with a negative constant. Thus the adapted score is still (almost) a linear function but shrinks output values quite a bit:

(defun gnus-decay-score-1 (score)
   (pcase score
     (`0 0)
     ((and sc (pred (> 0))) (* sc (expt (abs sc) -0.2)))
     (_ (* score (expt score -0.2))))))

For comparison, the following function applies an increasing exponential decay to literal score values. This results in values that increase rapidly first, level off thereafter and essentially max out close to a defined upper limit:

(defun gnus-decay-score-2 (score)
  (let ((ulim 100)
        (exponent -0.12))
     (pcase score
       (`0 0)
       ((and sc (pred (> 0))) (* -1 ulim (- 1 (expt (abs sc) exponent))))
       (_ (* ulim (- 1 (expt score exponent))))))))

Plotting adapted scores from various decay functions for a sequence from -50 to 250, a range that encloses almost all my current score values, will reveal the differences:

(require 'gnus-score)

(mapc 'byte-compile '(gnus-decay-score gnus-decay-score-1 gnus-decay-score-2))
(cl-loop for i in (number-sequence -50 250 1)
   collect (list i (gnus-decay-score i) (gnus-decay-score-1 i)
                 (gnus-decay-score-2 i)))
set terminal svg dashed size 720,320 enhanced font 'PT Sans,12'

set auto y
set xtics 25
set tics out scale 0.4 nomirror
set style line 101 lc rgb '#8c8c8c' lt 1 lw 1
set border 3 front ls 101
set xlabel 'Literal Score'
set ylabel 'Adapted Score'
set key left nobox

set style line 1 lt 1 lc 1
set style line 2 lt 2 lc rgb '#228b22'
set style line 3 lt 3 lc 3
set style line 4 lt 4 lc rgb '#cd00cd'

plot dat using 1:1 with line ls 1 title 'literal score',\
     dat using 1:2 with line ls 2 title 'gnus-decay-score',\
     dat using 1:3 with line ls 3 title 'gnus-decay-score-1',\
     dat using 1:4 with line ls 4 title 'gnus-decay-score-2'
Sorry, your browser does not support SVG.
Figure 1: Gnus Decay Score Function Comparison

NBA’s Free Agent’s Relative Salary Increase

NBA’s Free Agent’s Relative Salary Increase


While NBA’s salary cap continues to rise – 2016 every team has roughly $24 million more to spend, the projected cap for 2017 is something in-between $102 and $108 million – free agents (FA) take their opportunity to sign staggering contracts. The agreement of Mozgov and the Lakers for example attracted a lot of attention. But is he that bad of a deal for the LAL? I decided to not only trust in my own judgement, but do some simple dataset exploration.

It’s amazing how sometimes you can perform the whole exploratory data analysis life cycle inside Emacs – from the data retrieval to the publishing (as is done in this post). Initially I considered mentioning homoiconicity, since all my text is data – to be exact, though, the internal representation of an org-table is still a list and not what I look at in the buffer. Nonetheless, the excitement is quite similar. The process of getting messy data into shape took me a few minutes.

And these are the outcomes: Whiteside and Drummond are absolutely worth the dough, Conley is more of a surprise. Actually, there are a lot of big men on top of that list, will the small-ball dominance be a blip?

Table 1: NBA’s Free Agent’s Salary Comparison (Top 20)
Team Player Pos Age Contract (in y) Salary Total (in MM) Salary 2015-16 Salary 2016-17 Increase (in MM)
MIA Hassan Whiteside C 27 4 98 981348 24500000 +23.52
DET Andre Drummond C 23 5 130 3272091 26000000 +22.73
MEM Mike Conley G 29 5 153 9588426 30600000 +21.01
DAL Harrison Barnes F 24 4 95 3873398 23750000 +19.88
TOR DeMar DeRozan G 27 5 145 9500000 29000000 +19.50
WAS Bradley Beal G 23 5 120 5694674 24000000 +18.31
POR Allen Crabbe F 24 4 75 947276 18750000 +17.80
BOS Al Horford F/C 30 4 113 12000000 28250000 +16.25
ATL Kent Bazemore G/F 27 4 70 2000000 17500000 +15.50
ORL Bismack Biyombo C 24 4 72 3000000 18000000 +15.00
ORL Evan Fournier G/F 24 5 85 2288205 17000000 +14.71
POR Evan Turner G/F 28 4 70 3425510 17500000 +14.07
WAS Ian Mahinmi C 30 4 64 4000000 16000000 +12.00
CHA Nicolas Batum G/F 28 5 120 12235750 24000000 +11.76
DAL Dirk Nowitzki F 38 2 40 8333334 20000000 +11.67
MIA Tyler Johnson G 24 4 50 845059 12500000 +11.65
LAL Jordan Clarkson G 24 4 50 845059 12500000 +11.65
NOP Solomon Hill F 25 4 52 1358880 13000000 +11.64
HOU Ryan Anderson F 28 4 80 8500000 20000000 +11.50
LAL Timofey Mozgov C 30 4 64 4950000 16000000 +11.05

These are some teams I’m interested in that decided to offer their new signings a relative salary increase (with one exception). The table pretty much reflects the pecuniary space of these teams before the free agency:

Table 2: Average Salary Increase per signed FA
Team N FAs Average Increase
BOS 1 +16.25
MEM 4 +8.42
LAL 5 +7.12
MIA 7 +5.92
CHI 3 +2.72
SAS 3 +2.26
NYK 5 +0.92
GSW 5 +0.52
CLE 2 -2.00

Discover Destructuring Assignment in Elisp

Discover Destructuring Assignment in Elisp

LISt Processing in Emacs Lisp obviously involves a lot of juggling with lists and their elements. What else would be more convenient than generalizing the access and binding of list elements? Not only does the concept of destructuring assignment come along with code that is easier to write but also easier to read (terse, patterns that visually cue what elements are supposed to be assigned to variables). Alas, some features of Elisp have to be discovered. While pcase got its node in the Elisp Manual, neither is there an explanation what QPATTERN and UPATTERN mean nor are the related macros ever even mentioned. As if this wasn’t enough, the docstrings of pcase-let and its starred equivalent will leave the average Emacs user puzzled, pcase-dolist doesn’t even have one. This will hopefully change in subsequent versions of Emacs. For now, get ready to embark on a journey of discovery!

pcase is by far the most frequently used macro from pcase.el. What it does is pattern matching, a concept that goes beyond the scope of a blogpost. If you’re familiar with the Fibonacci Sequence, the following example is self-explanatory:

(defun fib (n)
  (pcase n
    (`0 1)
    (`1 1)
    (n (+ (fib (- n 1)) (fib (- n 2))))))
(mapcar 'fib (number-sequence 0 6))
(1 1 2 3 5 8 13)

Generally, pcase is used as a powerful conditional programming construct. Several examples can be found on this EmacsWiki page. Especially suited to the beforementioned destructuring is pcase-let:

    ((`(,spec ,month ,day ,name) (nth 3 holiday-general-holidays)))
  (princ (format "%s is on 2016-%d-%d" name month day))
"Valentine's Day is on 2016-2-14"

The practical advantage will become patently obvious when trying to do the same with let:

(let* ((l (nth 3 holiday-general-holidays))
       (spec (car l))
       (month (cadr l))
       (day (caddr l))
       (name (cadddr l)))
  (princ (format "%s is on 2016-%d-%d" name month day))))

pcase-let in its simplest form resembles Python’s poor man’s destructuring-bind, called tuple and list unpacking:

([a, b, c], d, e) = ([1, 1, 2], 3, 5)
print(a + d)

Probably even more interesting is pcase-dolist that iterates over the lists of a list:

(let ((l '()))
  (pcase-dolist (`(,spec ,month ,day . ,rest) holiday-general-holidays)
    (push (cons month (if (stringp (car rest)) rest (cdr rest))) l))
  (nreverse l))
((1 "New Year's Day")
 (1 "Martin Luther King Day")
 (2 "Groundhog Day")
 (2 "Valentine's Day")
 (2 "President's Day")
 (3 "St. Patrick's Day")
 (4 "April Fools' Day")
 (5 "Mother's Day")
 (5 "Memorial Day")
 (6 "Flag Day")
 (6 "Father's Day")
 (7 "Independence Day")
 (9 "Labor Day")
 (10 "Columbus Day")
 (10 "Halloween")
 (11 "Veteran's Day")
 (11 "Thanksgiving"))

Digging even further into the library, you’ll discover a pcase-lambda. Yet, I’m still not sure what it does besides accepting pcase patterns. But I won’t worry for now, there is exactly ONE appearance of pcase-lambda in the Emacs sources.

Adding Mail-Abbrev Expansion to Org-MIME

Adding Mail-Abbrev Expansion to Org-MIME

As has been already mentioned, I overcome the temptation to use one of Emacs’ mail-clients. Yet still, I frequently use message-send-mail from within Emacs with a message-body composed with org-mime-subtree. You will find org-mime in the /contrib directory of your org-mode installation. Since manually filling the mail-header is redundant especially for buffers frequently (re)used, I set related properties to the current subtree that org-mime-send-subtree will parse before creating the email buffer. Unfortunately there is no built-in completion for addresses, which is why I use the following code. Apparently it’s wise to set alias_file in Mutt and mail-personal-alias-file in Emacs to the same file. It’s probably worth pointing out that I remove the commas introduced by define-mail-alias that separate the address from the definition in an alias.

(defvar org-mime-properties '("MAIL_TO" "MAIL_CC" "MAIL_BCC")
  "Properties `org-mime-send-subtree' parses.")

(setq mail-aliases-only
      ;; first build `mail-aliases', then create a string for `org-set-property'.
      (mapconcat 'car (build-mail-aliases) " "))

;; hook to replace the alias in the property field with name and email
(defun bp/mail-property-changed (property value)
  (let ((el (cdr (assoc value mail-aliases))))
    (when (and el (member property org-mime-properties))
       (replace-regexp-in-string "," "" el)))))

(nconc org-default-properties org-mime-properties)
(nconc org-global-properties
       `(("MAIL_TO_ALL"         . ,mail-aliases-only)
         ("MAIL_BCC_ALL"        . ,mail-aliases-only)
         ("MAIL_CC_ALL"         . ,mail-aliases-only)))
(add-to-list 'org-property-changed-functions

← Newer  1/5  Older →