The log factor isn't free: a Dijkstra rabbit hole

2026-04-06•byAritro Roy

I went into this thinking I'd found a clever way to shave the log n off Dijkstra. I came out of it with a priority queue that's asymptotically worse than the one I started with and runs more than twice as fast, a benchmark that lied to me twice, and a much healthier respect for the gap between big-O and a cache line. Here's the whole thing.

The full implementation and stress tests are on GitHub.

The idea that started it

Standard Dijkstra with a binary heap is O((n + m) log n). The log n lives entirely in the priority queue: every pop and every decrease-key sifts through the heap. 0-1 BFS removes it by replacing the heap with a deque (push front for weight-0 edges, push back for weight-1). So the natural question: what if weights aren't just 0 and 1? Could I keep one bucket per distance value, hold a doubly linked list of nodes in each, and pop from the smallest non-empty bucket? No heap, no log.

I was briefly very pleased with myself. Then I looked it up. This is Dial's algorithm, and it's from 1969.

That's the first thing worth saying out loud: the log factor doesn't disappear. It gets traded. Dial's runs in O(m + nC) where C is the max edge weight, because finding the next non-empty bucket means walking a cursor through the distance axis, and with large weights most of those buckets are empty. You swapped log n for C. Great when weights are tiny, catastrophic when they're 10^9.

The fix for that is the radix heap, which is the structure this post is really about.

Radix heaps in one paragraph

Instead of one bucket per distance, use buckets whose widths grow as powers of two, and only ever have about log(maxDist) of them. A key k goes into the bucket given by the most significant bit at which k differs from last, where last is the most recently extracted minimum:

fn idx(&self, k: i64) -> usize {
    if k == self.last { return 0; }
    64 - ((k ^ self.last) as u64).leading_zeros() as usize
}

k ^ last zeroes the shared high bits; leading_zeros finds where they first diverge. Equal keys XOR to zero and fall into bucket 0. On extract-min you find the lowest non-empty bucket, take its minimum as the new last, and re-bucket only that bucket's contents against it. Every element shares more high bits with the new last, so each one moves strictly downward, and any element is redistributed at most log C times in its life. The monotonicity that makes this legal is exactly what Dijkstra hands you: when you pop u, dist[u] = last, and every relaxation produces dist[u] + w >= last because w >= 0.

The punchline for anyone doing competitive programming: when weights fit in a machine word, log C <= 64, which is a constant. The radix heap turns Dijkstra into effectively O(m + n). The only requirement is integer weights, same as Dial's.

First implementation, and where it actually hurt

I wrote the first version in Rust with intrusive doubly linked lists: four parallel arrays (next, prev, bucket_of, key), O(1) decrease-key by splicing a node out of its bucket and into a lower one. Clean in theory.

I also wanted to benchmark it against a normal binary-heap Dijkstra without duplicating the relaxation loop, so I put both queues behind a tiny trait and wrote the loop once:

trait MonotonePq {
    fn mp_push(&mut self, node: usize, key: i64);
    fn mp_pop(&mut self) -> Option<usize>;
    fn mp_decrease(&mut self, node: usize, key: i64);
}

The binary heap implements mp_decrease by pushing a duplicate and skipping stale pops (lazy deletion); the radix heap does a real remove-and-reinsert. Same loop drives both, monomorphized, zero overhead. The one trick that lets a single loop serve both is having pop return only the node id and reading dist[u] from the array, so the binary heap's lazy staleness is handled by the same visited check the radix heap uses.

Then I ran it on my laptop (release build), source distances over random sparse graphs, weights up to 10^6:

graph	binary heap	radix (intrusive)	radix / binary
sparse, 50K nodes	8.0 ms	4.4 ms	0.54x
sparse, 200K	47.9 ms	31.6 ms	0.66x
sparse, 1M	302.7 ms	295.0 ms	0.97x
grid, 490K	26.2 ms	35.1 ms	1.34x
dense, 3K (deg 1000)	3.7 ms	3.3 ms	0.89x

Look at what happens as the sparse graphs grow. At 50K the radix heap is nearly twice as fast. At 1M it's a dead heat. And on the grid it actually loses. The win evaporates exactly as n grows, at fixed density.

That's not noise. The four arrays are about 32 bytes per node, so the working set goes from 1.6 MB at 50K to 32 MB at 1M. Somewhere in there it stops fitting in cache, and the redistribution step starts chasing next[] pointers through main memory instead of streaming. The grid is the worst case for the structure: degree 4, so almost no decrease-key work to amortize against, plus a working set right at the edge of cache. The binary heap, meanwhile, is one contiguous array that the prefetcher loves.

So the honest version of "radix heaps are faster" is: faster when the queue work is a big fraction of runtime and the structure still fits in cache. Outside that, the constant factors eat the asymptotics alive.

The benchmark lied to me. Twice.

Before I trusted any of this I almost published the opposite conclusion, because of two things that have nothing to do with the algorithm.

Debug vs release. I ran the same code through my IDE and through the terminal and got numbers that were 5x apart and disagreed on the winner. The IDE was running a debug build. In debug, every operation is fat: bounds checks live, nothing inlined, the binary heap's comparisons never optimized. That inflates the per-operation constant uniformly, which inflates the queue-bound fraction of runtime, which makes the radix heap look better than it is. Release stripped all that away and flipped sparse-large and grid against the radix heap. If I'd benchmarked in debug I'd have written "radix wins everywhere," and I'd have been wrong.

The machine. I ran the identical binary on two boxes. Same graphs (the checksums matched byte for byte), same build. On one, the radix heap lost the 1M sparse case at 1.06x. On the other it won at 0.59x. The difference is cache size. On the box with more last-level cache, the 32 MB of arrays that thrashed the small machine fit fine, and the algorithmic edge showed up.

The lesson I'd put on a sticky note: for pointer-chasing data structures, the build profile and the CPU's cache are bigger variables than the algorithm. If you report Dijkstra benchmarks without saying "release, codegen-units=1, here's my L3," your numbers don't mean much, and a reader who reruns them on a different box will reasonably conclude you regressed.

The redesign that won

The intrusive linked list was the problem. Scattered nodes, random-access redistribution, exactly the access pattern caches punish. So I threw it out and made each bucket a flat Vec of (key, node) pairs, with lazy deletion instead of eager decrease-key:

pub struct RadixHeap {
    last: i64,
    sz: usize,
    buckets: [Vec<(i64, u32)>; 65],
    key: Vec<i64>,
}

decrease doesn't hunt down the old entry and remove it. It just pushes a new (k, v) and updates key[v]. Stale entries (where the stored key no longer matches key[v]) get filtered out lazily during redistribution and at pop time. This drops three whole arrays plus the visited array in the driver, and turns redistribution into a sequential scan over contiguous memory.

I reran the whole suite on the same laptop, lazy radix against the same binary-heap baseline as before:

graph	binary heap	radix (lazy)	radix / binary
sparse, 50K	8.7 ms	3.4 ms	0.39x
sparse, 200K	46.2 ms	25.7 ms	0.56x
sparse, 1M	297.0 ms	146.4 ms	0.49x
sparse, 5M	2600.3 ms	986.3 ms	0.38x
grid, 490K	26.2 ms	35.7 ms	1.36x
dense, 3K	3.8 ms	3.3 ms	0.87x
dense, 2K	3.9 ms	3.7 ms	0.94x

Now compare the sparse column to the intrusive table above, same binary baseline, same machine. The intrusive heap's win evaporated as n grew, sliding from 0.54x at 50K to a 0.97x tie at 1M. The lazy heap does the opposite. It holds around 0.4 to 0.5x across three orders of magnitude and then pulls further ahead at 5M, where it's 0.38x, the biggest win in the whole suite. The redesign didn't just flatten the size-gradient, it reversed it: bigger graphs now favor the radix heap more, because the binary heap pays an ever-growing log n and its own memory pressure while the radix heap's per-operation cost stays effectively constant. At 5M nodes the binary heap is spending about 520 ns per node settled and the radix heap about 197 ns. The grid still loses at 1.36x, but that's structural and I'll come back to it.

And here's the part I find genuinely funny. The lazy version is asymptotically worse. Counting redistributions, the eager intrusive heap moves O(n log C) nodes; the lazy one lets every relaxation spawn a duplicate, so it moves O(m log C) entries. That's a strictly larger bound. It still wins on the clock, because at these sizes you are not paying for operations, you are paying for memory latency, and the lazy heap pays it in cheap sequential bursts. Big-O told me the eager version was better. The cache disagreed, and the cache was right.

(One sharp edge if you copy this: pop does buckets[0].pop().unwrap() in a loop, which assumes bucket 0 always holds a live entry when reached. That holds under monotone inserts, which Dijkstra guarantees and 3000 random-graph stress tests confirmed. Lift it into a non-monotone context and that unwrap is a landmine.)

Two micro-optimizations, and why one of them is a trap

Once the heap was good I tried two more things, expecting both to help. Only one did, and not where I expected. (These two I only measured on the constrained VM, so read the percentages as directions, not laptop numbers.)

CSR adjacency instead of Vec<Vec<...>>. I assumed this was a universal win. It isn't. It sped the grid up by 21% and did nothing (or slightly hurt) everywhere else. The grid has half a million tiny adjacency lists, each its own heap allocation in a random place, so packing them contiguously helps a lot. Dense graphs have a few enormous lists that are already contiguous, so CSR's split to/weight arrays just add an indirection. CSR is the right call for road-network-shaped graphs and pointless for dense ones.

Packing each bucket entry into 8 bytes instead of 16. (i64, u32) pads to 16 bytes; stuffing the distance into the high bits of a u64 and the node id into the low bits halves the bucket memory traffic. This helped the heap-bound cases (13% on 200K sparse, 6% on 1M) and was a wash on the grid, which isn't heap-bound. It costs you a range constraint: with 24 bits for the node and 40 for the distance, you have to actually check n < 2^24 and maxDist < 2^40 or it silently corrupts. Worth it only when the bucket scan is your bottleneck.

The complexity, stated honestly

Putting it together for integer weights with max value C:

priority queue	Dijkstra
binary heap (lazy)	`O(m log n)`
Fibonacci heap	`O(m + n log n)`
Dial's (one bucket per value)	`O(m + nC)`
eager radix heap	`O(m + n log C)`
lazy radix heap (this post)	`O(m log C)`

The lazy heap is a log factor worse than the eager one on paper and faster in practice. When weights fit a word, log C is a constant and both radix variants are linear in everything but the constant the benchmark actually measures.

For arbitrary real weights none of this applies, since the bit trick needs integers. There O(m + n log n) stood as the bound for decades, until Duan, Mao, Mao, Shu, and Yin broke it in 2025 with a deterministic O(m log^(2/3) n) algorithm (STOC best paper). Their trick is roughly that Dijkstra over-sorts: it produces a full ordering when it only needs a partial one. Different rabbit hole, also worth a read.

What I'd actually use

If your weights are integers and you want speed: lazy bucket radix heap. It's less code than the intrusive version, it has no decrease-key bookkeeping, and it's the fastest thing I measured on large sparse graphs. Add CSR only if your graph is low-degree, add entry packing only if profiling says the bucket scan is hot.

If your weights are real, or you just want something boring and correct: a binary heap with lazy deletion. It's a few lines, it's cache-friendly, and it's almost never the bottleneck.

The thing I'll carry out of this isn't a data structure. It's that I twice nearly published a wrong conclusion from a benchmark that ran fine and gave clean numbers, and the only reason I didn't is that the numbers disagreed with each other across a build flag and a machine. The final result is real and it's good: a priority queue that's a log factor worse on paper, twice as fast at a million nodes, measured on one machine with one build. But I'd never have trusted that number if I hadn't watched the earlier ones lie. Measure, then measure again somewhere else, and write down the conditions. The algorithm was the easy part.