A background poller in one of our Node services started timing out in production. The job's only task was to read a set of task records out of Redis, check which ones were idle, and act on them. It worked fine in development and fell over at ~100 seconds against the real dataset. This is the short story of why, and the fix.

The naive version

The original loop did the obvious thing: SCAN the keyspace for matching keys, then for each key, GET the value, deserialize it, and check whether the task was idle.

// one network round trip per key — the problem
for await (const key of redis.scanIterator({ MATCH: "task:*" })) {
  const raw = await redis.get(key);
  const task = JSON.parse(raw);
  if (isIdle(task)) await handle(task);
}

Two things make this slow, and they compound. First, SCAN walks the entire keyspace in cursor-sized chunks — it is O(N) over every key in the database, not just the ones you want. Second, the per-key GET means one network round trip per task. Against a remote Redis, latency, not Redis itself, is the bottleneck: a few hundred microseconds of compute wrapped in a few milliseconds of network, thousands of times over. The schema had no secondary structure to narrow the scan, so the work grew with the whole database and blew past the 100-second timeout.

The fix: batch the reads, filter client-side

The keys still had to be discovered, but the per-key round trips were pure waste. Redis can return many values in a single command with MGET, so I collected keys in batches and fetched each batch in one trip, then did the idle check in application code.

const BATCH = 500;
for (const batch of chunk(keys, BATCH)) {
  const values = await redis.mGet(batch);   // one round trip per 500 keys
  for (const raw of values) {
    const task = JSON.parse(raw);
    if (isIdle(task)) await handle(task);    // filter in the app, not per-GET
  }
}

One round trip per 500 keys instead of one per key collapses the network cost by roughly the batch factor. Moving the idle check to the client meant Redis only ever did cheap bulk reads — no per-item logic on the server, no extra commands. Total work dropped about 90%, and the timeouts stopped.

Why this works, and the trade-off

The lesson is not "MGET is faster than GET" — it is that over a network, round trips dominate. Batching trades a little memory (you hold a batch of values at once) for a large cut in latency. MGET is the simplest tool; a pipeline or Lua script does the same when your reads are not uniform. The real fix for the underlying problem is to stop scanning the whole keyspace at all: maintain an index — a SET or sorted set of the keys you care about — so you read a known, bounded list instead of discovering it every cycle.

Takeaways

  • Profile where the time goes before you optimize. Here it was network round trips, not Redis throughput.
  • SCAN is O(keyspace). If you scan on a hot path, you have already lost — keep an index of the keys you need.
  • Batch remote reads (MGET, pipelines) and do filtering in the application when the per-item check is cheap.

More notes like this on distributed systems and the systems I build.