Levenshtein Sorting for Ido

Levenshtein Sorting for Ido

I’ve been unimpressed so far by third-party completion frontends for Emacs – although there is a plethora of them, all of them in some way reflecting the author’s preferences that may or may not be compatible with yours. That’s why I stayed with the built-in ido-mode and various extensions to it, namely smex, ido-vertical, flx-ido and ido-hacks. Generally, they either hook into the minibuffer setup or advise the ido-set-matches resp. ido-set-matches-1, the workhorse of ido, that returns the final matches. At some point, though, I decided to overcome the shortcomings of ido myself and, en passant, remove a lot of redundant code.

One of the more important aspects to me is providing the list of completion candidates in some reasonable order, such that you do not have to scroll down a lot. Obviously, almost exact matches to the input string should find themselves on top of the list – a shortcoming that really annoys me with for example Ivy’s default regexp matching.

The first step therefore is to hijack the ido-set-matches function as it is inflexible and doesn’t provide the hooks to alter the final ido-matches. I took the opportunity to change the “flex” matching ido provides – which builds a rather naive and inefficient greedy regexp and only kicks in if the previous string-match on the input string as a whole fails:

;; -*- lexical-binding: t -*-

(fset 'orig-ido-set-matches 'ido-set-matches)
(defun ido-set-matches ()
  (when ido-rescan
     ido-rotate nil
     (if (and (memq ido-cur-item '(list buffer))
              (> (string-width ido-text) 1))
         (let (matches re)
             (regexp-quote (string (aref ido-text 0)))
             (mapconcat (lambda (c)
                          (concat "[-\.]*" (string c)))
                        (substring ido-text 1)
           (if ido-enable-prefix
               (setq re (concat "\\`" re)))
           (setq matches '())
            (lambda (s)
              (if (string-match re s)
                  (setq matches (cons s matches))))
       ;; original `ido-set-matches' uses a `reverse' in the next line
       ;; that produces a lot of overhead
       (ido-set-matches-1 ido-cur-list (not ido-rotate))))
    (setq ido-matches (ido-levenshtein-sort-matches))
    ;; show match count like ido-vertical-mode
    (setcar ido-decorations
             (format "\n%-5d  " (length ido-matches))
             'face '((:weight bold))))
    (run-hooks 'ido-alter-matches-hook)))

I kicked out a lot of computationally expensive stuff, namely a reverse in the end and then some. In return I added that badly missing hook to allow further processing on the matches returned. While the limiting of completion choices immediately kicks in when the input string consists of more than one character and is more efficient than this function’s defaults, we still need the previously mentioned clever sorting.

The Levenshtein algorithm may not be the right choice for long strings like filenames – that’s why I restricted the new ido-set-matches to lists and buffers – but does a decent job for sorting symbols and buffernames. Unfortunately, the Elisp implementations of Levenshtein (levenshtein.el and org-babel-edit-distance in ob-core.el) build the full matrix to hold the distances, which imposes a lot of unnecessary space and time complexity. A more efficient variant would reuse two vectors in each iteration:

(defun ido-levenshtein-distance (s1 s2)
  (let* ((m (string-width s1))
         (n (string-width s2)))
    ;; trivial cases
    (if (string-equal s1 s2) 0
      ;; (cond ((= m 0) n)
      ;;       ((= n 0) m))
      (let ((v1 (make-vector (1+ n) nil))
            (v0 (make-vector (1+ n) nil)))
        (dotimes (i (1+ n))
          (setf (aref v0 i) i))
        (dotimes (i m)
          (setf (aref v1 0) (1+ i))
          ;; fill the remaining cells of v1
          (dotimes (j n)
            (setf (aref v1 (1+ j))
                  (min (1+ (aref v1 j))      ;deletion
                       (1+ (aref v0 (1+ j))) ;insertion
                       (+ (aref v0 j)
                          (if (= (aref s1 i) (aref s2 j))
                            1))         ;substitution
          ;; v0 is used to store the results from the previous
          ;; iteration and is reused to fill the appropriate cell in
          ;; later iterations. given 2 strings "aabb" and "aaab",
          ;; these are the results:
          ;; v0: [1 0 1 2 3] v1: [0 1 2 3 4]
          ;; v0: [2 1 0 1 2] v1: [1 0 1 2 3]
          ;; v0: [3 2 1 1 1] v1: [2 1 0 1 2]
          ;; v0: [4 3 2 2 1] v1: [3 2 1 1 1]
          ;; the `setf' below essentially is what `cl-rotatef' expands
          ;; to - rotate v0 and v1
          (setf v1 (prog1 v0 (setf v0 v1))))
        ;; (message "v0: %s v1: %s" v0 v1)
        (aref v0 n)))))

(byte-compile 'ido-levenshtein-distance)

Save your sanity and byte-compile the function. Let’s compare it to the beforementioned org-babel-edit-distance:

(byte-compile 'org-babel-edit-distance)

 (lambda (f)
    f (benchmark-run 1000
        (funcall f "der alte mann und das meer" "alte männer im meer"))))
 '(ido-levenshtein-distance org-babel-edit-distance))
Function t total GC t GC
ido-levenshtein-distance 0.14028309400000002 0 0.0
org-babel-edit-distance 0.70165323 3 0.25587406500000043

Definitely worth it! What’s missing? The function that does the actual sorting for ido-set-matches:

(defvar ido-levenshtein-threshold nil
  "Threshold for Levenshtein sorting to kick in.")

(defun ido-levenshtein-sort-matches ()
  (if (< (length ido-matches) (or ido-levenshtein-threshold 2500))
      (let (ilen)
        (setq ilen (string-width ido-text))
        (sort ido-matches
              (lambda (a b)
                (when (< (- (string-width a) ilen) 14)
                  (< (ido-levenshtein-distance a ido-text)
                     (ido-levenshtein-distance b ido-text))))))

The sorting only takes effect when the number of remaining matches is less than a given threshold. In addition to that it leaves out strings with a minimum possible edit distance of 14. Those are appended in no specific order.

So, how does that affect my completion frontend?

(let ((ido-text "list")
      (ido-cur-item 'list)
      (ido-cur-list (all-completions "" obarray 'fboundp)))
  (cl-subseq (mapcar 'substring-no-properties ido-matches) 0 8))
("list" "listp" "-list" "list*" "nlistp" "dolist" "maplist" "up-list")

The perfect match is on top of the list, ready to be completed. My ido-set-matches already adds a ido-vertical-mode like count for the current ido-matches and I use some other functionality in my personal setup, among others fontification for substring matches. These are beyond the scope of this post, but you may get inspiration from various already mentioned extensions to ido.