Emacs Lisp Performance Oddity
I wrote my 2025 Advent of Code solutions in Emacs Lisp. My Emacs Lisp style was not very lispy or functional. I'll discuss this more, and why, in a future post.
In this post, we will discuss a performance mistake I consistently made in my solutions with two examples.
argmax
I wrote this bit of code to find the location of the max element in a vector:
(cl-loop for i from 0 below (length vec)
with maxidx = nil
do (when (or (not maxidx)
(> (aref vec i) (aref vec maxidx)))
(setq maxidx i))
finally return maxidx)
Claude provided this example which uses a fold/reduce, which is "more idiomatic":
(car (seq-reduce
(lambda (acc i)
(if (or (not (car acc))
(> (aref vec i) (aref vec (car acc))))
(cons i (cdr acc))
acc))
(number-sequence 0 (1- (length vec)))
(cons nil nil)))
But, after producing this example, Claude said "the fold is arguable worse." I certainly think so, for a few reasons:
- the accumulation term
(cons nil nil)is awkward because the purpose of each element is never specified. You have to read through all the other indexing gobbledygook to figure out that its(cons max-idx max-val). - There's a temporary
(number-sequence ...)array being allocated/deleted just to iterate over it. (aref vec (car acc))you can't tell me this is readable!
The correctly-idiomatic answer turns out to be much simpler (though behavior for ties is slightly different):
(seq-position vec (seq-max vec))
This looks like it scans the vector twice.
Both of the above do one scan (but the reduce option creates a potentially giant one then scans both).
Let us take a closer look.
seq-position looks as you'd expect:
(cl-defgeneric seq-position (sequence elt &optional testfn)
(let ((index 0))
(catch 'seq--break
(seq-doseq (e sequence)
(when (funcall (or testfn #'equal) e elt)
(throw 'seq--break index))
(setq index (1+ index)))
nil)))
seq-max converts to a list first (copying it), then calls a C builtin max.
(cl-defgeneric seq-max (sequence)
(apply #'max (seq-into sequence 'list)))
So, the seq solution actually does three passes, which should obviously be much worse than the cl-loop version, right?
Claude said yes, my intuition is probably correct: "the cl-loop is probably faster."
Let's check!
<<timeit-setup-argmax>>
(timeit-report
(list
(timeit* "cl-loop" 100 (argmax-loop bench-vec-random))
(timeit* "seq-*" 100 (argmax-builtin bench-vec-random))))
| Method | Time | Relative |
|---|---|---|
| cl-loop | 0.9196s | 4.2x |
| seq-* | 0.2169s | 1.0x |
Well, hmm, I guess the fast C builtins make a big difference?
If we native compile, that should eliminate a bunch of the overhead, right?
<<timeit-setup-argmax>>
(native-compile 'argmax-loop-native)
(native-compile 'argmax-builtin-native)
(timeit-report
(list
(timeit* "cl-loop" 100 (argmax-loop-native bench-vec-random))
(timeit* "seq-*" 100 (argmax-builtin-native bench-vec-random))))
| Method | Time | Relative |
|---|---|---|
| cl-loop | 0.4801s | 2.3x |
| seq-* | 0.2114s | 1.0x |
Closer, but the cl-loop is still a lot slower.
Fascinating.
This is not at all what I was expecting!
So far, the best explanation I have is that the running the loop and compares in C helps a lot, but Claude and ChatGPT can't make up their mind about the loop being the problem, boxing/unboxing being the problem, or frequent switching between the C runtime and lisp being the problem.
This warrants more investigation, or some expert feedback.
sum
I also apparently wrote this:
(cl-loop for c across vec sum c)
Instead of any of these:
(seq-reduce #'+ vec 0)
(cl-reduce #'+ vec)
(apply #'+ (append vec nil))
seq-reduce and cl-reduce seem like they should do roughly the same thing, so I'm just testing seq-reduce.
I believe there are argument count limits for apply, but it seems to be working.
If what we learned last time is correct, then apply solution will probably run fastest, despite having to convert the vec to a list first.
Let's check!
<<timeit-setup-sum>>
(timeit-report
(list
(timeit* "cl-loop" 100 (sum-cl-loop bench-vec-random))
(timeit* "seq-reduce" 100 (sum-seq-reduce bench-vec-random))
(timeit* "apply" 100 (sum-apply bench-vec-random))))
| Method | Time | Relative |
|---|---|---|
| cl-loop | 5.6799s | 43.1x |
| seq-reduce | 0.1793s | 1.4x |
| apply | 0.1317s | 1.0x |
So, maybe our rule works?
Though, the seq-reduce version isn't that far off from apply.
What if we native compile?
<<timeit-setup-sum>>
(native-compile 'sum-cl-loop-native)
(native-compile 'sum-seq-reduce-native)
(timeit-report
(list
(timeit* "cl-loop" 100 (sum-cl-loop-native bench-vec-random))
(timeit* "seq-reduce" 100 (sum-seq-reduce-native bench-vec-random))
(timeit* "apply" 100 (sum-apply bench-vec-random))))
| Method | Time | Relative |
|---|---|---|
| cl-loop | 0.5304s | 4.1x |
| seq-reduce | 0.1775s | 1.4x |
| apply | 0.1291s | 1.0x |
Wow!
So in this case, the result is a bit different.
Native compilation seems to make a big difference, but the C loop still seems to win.
Note that I've had some variability in running this specific test, but after a few runs the result is pretty consistent.
The conclusion seems to be calling builtins makes a huge difference.
To inspect the source for timeit and other support code, take a look at the raw org file on GitHub: https://github.com/dpzmick/dpzmick.com/blob/master/posts/2026-aoc-seq-lisp-perf.org