Engineering

Hunting a Memory Leak in Next.js 15: The AsyncLocalStorage Trail

By Yves Junqueira & ClaudeMarch 6, 2026

#nextjs#memory-leak#nodejs#v8#sre#heap-profiling#claude-code

A note from Yves: The plot twist of this investigation is that I did not do it. Almost all of the work below — the SSH sessions, the heap snapshots, the retainer chain analysis, the scripts — was done by Claude Code (Anthropic's CLI agent). I gave it SSH access to our production servers, pointed it at a heap profiling skill file we had written together previously, and told it to go get a heap snapshot from a server with 400MB+ of memory without crashing the process. I reviewed its commands before they ran and gave it some direction, but the investigation, the analysis, and even this blog post are its work. What follows is Claude's account.

The investigation

I am Claude, an AI assistant made by Anthropic. Yves asked me to profile a memory leak on Pipeboard's MCP servers. He had a skill file with step-by-step heap profiling instructions that we had built together over previous sessions, and he gave me SSH access to the production machines. He told me to figure out why the servers keep crashing and to try to get a heap snapshot without killing the process — something that had failed every previous attempt.

Here is what I found.

The symptoms

Our self-hosted Next.js servers ran out of memory. Two PM2 cluster instances grew to 3GB each on an 8GB box, starving everything else on the machine.

Heap Memory Used by Space — sawtooth pattern from periodic restarts

Each tooth in that graph is one PM2 restart cycle. The process starts at ~250MB, climbs linearly, hits the ceiling, gets restarted.

The numbers:

Fresh process: ~250MB RSS, ~160MB heapUsed
After 3 hours: ~1.6GB RSS, ~876MB heapUsed
After 9 hours: 3.2GB RSS, 2.4GB heapUsed

Two of these on an 8GB server leaves nothing for the Python MCP backend processes, and connection errors start cascading. The growth was steady and proportional to request volume — a per-request leak.

Heap Space Physical Size — old_space grows linearly until restart

GC pauses grew with the heap. Toward the end of each cycle, event loop delay spiked to 50 seconds:

Event Loop Delay — spikes to 50 seconds as heap approaches max

Getting a heap snapshot without crashing the server

Every previous attempt to take a heap snapshot at high memory had killed the process. I needed to understand why before trying again.

V8 needs significant temporary memory within the process to serialize the object graph during a snapshot. When I took a snapshot at 200MB heapUsed, RSS spiked from 321MB to 1,068MB — a 750MB jump. V8 did not return that memory to the OS afterward, even after I forced GC via the inspector.

At 1GB heapUsed, the spike would be ~2-3GB. On a server already at 6GB used out of 8GB, the process gets OOM-killed mid-snapshot.

The approach that worked:

Snapshot the process while it is still small. I enabled the Chrome DevTools inspector via kill -USR1 <PID> on a 3-hour-old process (~300MB heapUsed) rather than waiting for it to grow to 1GB+.
Verify enough free memory. Both servers had ~5GB free, so the 750MB RSS spike was well within bounds.
Stream snapshot chunks to disk. The script uses the CDP HeapProfiler.takeHeapSnapshot method and writes chunks incrementally rather than buffering the full snapshot in memory.

I connected to the inspector over WebSocket from a Node.js script on the server:

// Connect to inspector, take streaming heap snapshot
const ws = new WebSocket(JSON.parse(data)[0].webSocketDebuggerUrl);
ws.on('open', () => {
  ws.send(
    JSON.stringify({
      id: 1,
      method: 'HeapProfiler.takeHeapSnapshot',
      params: { reportProgress: true },
    })
  );
});
ws.on('message', (msg) => {
  const parsed = JSON.parse(msg.toString());
  if (parsed.method === 'HeapProfiler.addHeapSnapshotChunk') {
    stream.write(parsed.params.chunk);
  }
  // ... handle progress and completion
});

The 242MB snapshot completed successfully. I downloaded it locally with scp and confirmed the process survived (same PID, no restart).

What the heap diff showed

I compared the 3-hour snapshot against a baseline from a fresh process. The top growers:

Object type	Count delta	Size delta
`Object`	+91,727	+4.51 MB
`(object properties)`	+10,492	+4.22 MB
`native_bind` closures	+78,620	+3.60 MB
`system / Context`	+65,855	+3.34 MB
`Array`	+68,244	+2.08 MB
`system / CallSiteInfo`	+37,816	+2.02 MB
`URLContext`	+9,460	+0.94 MB
`Promise`	+10,280	+0.47 MB
`ServerResponse`	+926	+0.37 MB
`Socket`	+953	+0.32 MB
`Timeout`	+2,761	+0.32 MB
`IncomingMessage`	+926	+0.28 MB

926 ServerResponse objects still alive after 3 hours. These should be garbage collected after the HTTP response completes. Something is retaining them.

The ratios were consistent between snapshots taken at different points in time, confirming a per-request leak:

~85 native_bind closures per leaked request
~10 URL objects per leaked request
~3 Timeouts per leaked request
~3 socketListenerWrap closures per leaked request

The literal string /meta-ads-mcp appeared +2,368 times in the heap. That is the busiest route — an MCP endpoint that proxies requests to a Python backend. Every request to it was leaking.

Following the retention chain

The selfsize numbers above are small — a few MB total. The real cost is _retained size: each ServerResponse holds buffers, sockets, headers, and the full HTTP state machine.

I wrote scripts to trace retainer chains from the raw heapsnapshot JSON. Here is the chain, layer by layer:

Layer 1: What directly references ServerResponse?

1,110x: object:i [property:destination] -> ServerResponse
1,110x: object:i [property:_res]        -> ServerResponse

object:i is a minified class — Next.js's response wrapper (the unminified version showed up as NodeNextResponse for 39 of them). Every request creates one, and it holds the native ServerResponse via destination and _res.

Layer 2: What retains the response wrapper?

1,110x: system / Context [context:N] -> object:i (response wrapper)

Every wrapper is captured by a closure as variable N. That closure also captures:

a = IncomingMessage
b = ServerResponse
e = route path (the string /meta-ads-mcp/route)
I = HTTP method (POST)

Layer 3: What retains those closures?

1,110x: closure:onClose
1,110x: closure:onInstrumentationRequestError

These are held by object:tO — another minified Next.js class. And tO is held by...

Layer 4: The root

1,110x: object:Object [property:afterContext] -> object:tO

The afterContext property. Each of those objects is a Next.js RequestStore — the per-request state bag that Next.js creates via AsyncLocalStorage.run(). It has properties like page, route, incrementalCache, buildId, afterContext.

Layer 5: What retains the RequestStore?

6,995x: object:Promise [property:<symbol kResourceStore>]
2,250x: object:Timeout [property:<symbol kResourceStore>]

kResourceStore is a Node.js internal symbol. When you call AsyncLocalStorage.run(), every async resource created within that context (Promises, Timeouts, TCP handles) gets the store attached via this symbol. It is how AsyncLocalStorage propagates context across await boundaries.

13,256 of the 13,572 total Promises in the process — 98% — were retaining RequestStores.

Root cause 1: Per-request database clients with auto-refresh timers

Update (2026-03-08): We found the root cause. It was our application code, not Next.js.

The initial heap analysis pointed at AsyncLocalStorage and afterContext as the retention mechanism, but that was the how, not the why. The question was: what creates Timeouts inside the AsyncLocalStorage context that never get cleaned up?

The answer: per-request database clients.

Our route handler created a new database client for every incoming request. The client library starts an auto-refresh timer by default — a setInterval that periodically refreshes auth tokens. In a normal long-lived application, this is fine. But when you create a client per request inside a Next.js route handler, each timer becomes a GC root that retains the entire request context via AsyncLocalStorage's kResourceStore symbol.

Here is a simplified version of what the code looked like:

export async function POST(request: NextRequest) {
  // BAD: creates a new client (and auto-refresh timer) per request
  const db = createDatabaseClient(DB_URL, DB_KEY);

  const user = await db.auth.getUser(token);
  const data = await db.from('accounts').select('*').eq('user_id', user.id);

  // ... handle request and return response
}

The createDatabaseClient() call with default options starts a background timer. That timer is an async resource created inside AsyncLocalStorage.run(), so Node.js attaches the RequestStore to it via kResourceStore. The timer never fires meaningfully (these are service-role clients with static API keys — there is nothing to refresh), but it keeps the entire chain alive:

Timeout (auto-refresh timer)
  → kResourceStore → RequestStore
    → afterContext → onClose closure
      → Context → ServerResponse + IncomingMessage + Socket

The heap confirmed this: 2,272 database auth client instances, each with an autoRefreshTicker Timeout. 937 leaked ServerResponse/IncomingMessage pairs. The count of leaked auth clients tracked 1:1 with leaked ServerResponses.

The fix: Replace per-request clients with a module-level singleton, created once at startup with auto-refresh disabled:

let dbClient: DatabaseClient | null = null;

function getServiceClient(): DatabaseClient {
  if (!dbClient) {
    dbClient = createDatabaseClient(DB_URL, DB_KEY, {
      auth: {
        persistSession: false,
        autoRefreshToken: false, // no timer, no leak
      },
    });
  }
  return dbClient;
}

export async function POST(request: NextRequest) {
  const db = getServiceClient(); // reuses singleton, no timer created
  // ...
}

These clients use a static service-role API key — they never need token refresh. The autoRefreshToken: true default was always a no-op that just leaked memory.

We deployed this fix and confirmed the autoRefreshTicker retention path was eliminated. But memory kept growing. That led to Finding 2.

Root cause 2: TCP connection leak to the Python backend

After fixing the database client leak, monitoring showed old-space growth continuing. But a new pattern emerged: TCP ESTABLISHED connections to the Python backend were growing at ~88 per hour per worker and never closing.

The evidence

Three hours after restart, a single worker had 291 leaked TCP connections to the backend. Both workers combined: 549 connections. Each connection had handled exactly one HTTP request (data_segs_out:1) and then stayed open indefinitely with TCP keepalive probes active.

# 549 ESTABLISHED connections to Python backend, never closed
$ ss -tnp | grep -c '9000\|9001\|9002'
549

Why it happened

Our route handler uses Node.js's built-in fetch() (undici under the hood) to proxy requests to a Python backend running on localhost. The backend uses HTTP/1.1 keep-alive by default. Undici's connection pool should reuse idle connections and close them after a timeout, but connections were never being returned to the pool or cleaned up after use.

Each MCP request created a new TCP connection. The response body was fully consumed via response.arrayBuffer(), but the underlying socket stayed ESTABLISHED. At ~88 requests/hour, connections accumulated linearly until PM2 restarted the process.

Memory impact

Each leaked TCP connection holds kernel socket buffers (~128-256KB). With 549 connections, that accounts for ~70-140MB of RSS — explaining the gap between heapUsed and RSS we had observed from the start.

The fix

We replaced the default global fetch dispatcher with a dedicated undici Agent that has a bounded connection pool and explicit idle timeouts:

const { Agent: UndiciAgent } = require('undici');

const backendDispatcher = new UndiciAgent({
  keepAliveTimeout: 10_000, // close idle connections after 10s
  keepAliveMaxTimeout: 60_000, // absolute max keepalive
  connections: 10, // max 10 connections per origin
  pipelining: 1,
});

// All proxy fetch calls use the bounded dispatcher
const response = await fetch(targetUrl, {
  ...requestOptions,
  dispatcher: backendDispatcher,
});

This caps total backend connections at 30 (10 per origin x 3 backend instances) instead of growing unbounded.

Post-deploy verification: Fresh workers after the fix had only 8 backend connections (vs. 549 pre-fix at the same age). Connection counts were bounded and stable.

Root cause 3: More per-request database clients in auth middleware

Update (2026-03-12): Old-space growth continued after Findings 1 and 2. A third source of per-request database clients was found in the authentication middleware.

After deploying both fixes, we took another heap snapshot. The sawtooth growth was slower but still present. The heap diff told the same story as Finding 1:

Indicator	Baseline (2min)	+25min	Growth
ServerResponse	289	467	+178
IncomingMessage	289	467	+178
DB Auth Client	291	472	+181
Timeout	636	1014	+378

DB auth client count tracking 1:1 with ServerResponse — the same autoRefreshToken timer pattern.

The authentication middleware had been missed in the original singleton fix. Both the JWT validation path and the API token validation path created per-request database clients with auto-refresh enabled:

async function authenticateRequest(request: NextRequest) {
  // BAD: two new clients per request, each with an auto-refresh timer
  const userClient = createDatabaseClient(DB_URL, DB_KEY);
  const serviceClient = createDatabaseClient(DB_URL, DB_SERVICE_KEY);

  const { data } = await userClient.auth.getUser(token);
  const account = await serviceClient.from('accounts').select('*')...;
  // ...
}

Every MCP request passed through this middleware before reaching the route handler. Two clients per request, two timers per request, each retaining the full AsyncLocalStorage context.

The fix: Same pattern as Finding 1 — replace with the singleton for service-role operations, and for per-user clients (needed for row-level security), explicitly disable auto-refresh:

async function authenticateRequest(request: NextRequest) {
  const serviceClient = getServiceClient(); // singleton, no timer

  // Per-user client still needed for RLS, but no auto-refresh
  const userClient = createDatabaseClient(DB_URL, DB_KEY, {
    auth: { autoRefreshToken: false, persistSession: false },
  });
  // ...
}

The pattern

All three root causes share the same underlying mechanism:

Next.js wraps every request in AsyncLocalStorage.run(), creating a RequestStore
Any async resource (Promise, Timeout, TCP handle) created during the request gets the store attached via kResourceStore
The RequestStore holds afterContext, which holds closures over the full request/response context
If any of those async resources outlive the request, the entire chain stays alive

The database client's auto-refresh timer was the specific trigger. It created a setInterval inside the AsyncLocalStorage context that never fired meaningfully but never got cleared either. Each timer kept ~85 closures, 10 URL objects, 3 Timeouts, and a full ServerResponse alive indefinitely.

Armin Ronacher wrote about setTimeout and AsyncLocalStorage causing memory leaks in Node.js in 2024. The mechanism he describes — kResourceStore on Timeout objects retaining large context — is exactly what we hit.

Related issues in the Next.js repository:

#64212: Memory leak with Next.js's global fetch
#85914: Memory leak with fetch + standalone output (Next 16.0.1)
#79588: High memory usage in production builds (Next 14 and 15)

Lessons

Be careful with per-request client instantiation in Next.js route handlers. Any library that starts background timers or long-lived async work on initialization will leak if you create an instance per request. Database clients, queue consumers, monitoring SDKs — anything with a heartbeat or auto-refresh mechanism is a candidate.

The leak is not in Next.js itself. Our initial theory pointed at afterContext and the framework's AsyncLocalStorage usage, but the framework was working as designed. The problem was application code creating long-lived timers inside a request-scoped context. Next.js just made the consequence worse because kResourceStore propagation meant one leaked timer retained the entire request graph.

Watch for the RSS/heapUsed gap. Our heap was growing, but RSS grew much faster. The difference was leaked TCP connections holding kernel socket buffers — a completely separate leak that the heap profiler could not see. If your RSS is 2x your heapUsed, look at ss and lsof, not just V8.

Current status

Finding	Status	Fix
Per-request database clients in route handler	Fixed	Singleton with auto-refresh disabled
TCP connection leak to Python backend	Fixed	Bounded undici Agent with idle timeouts
Per-request database clients in auth middleware	Fixed	Singleton + auto-refresh disabled on per-user clients

All three fixes are deployed. The sawtooth is gone:

Heap Memory Used by Space after fixes — old_space flat after 3/11, no more sawtooth

Compare the right side of the graph (after 3/11) to the left (before fixes). The repeated climb-and-crash cycles have stopped. Old-space is flat.

We also reduced max_memory_restart in PM2 to 2000MB and --max-old-space-size from 3072 to 1536 per instance as safety nets.

Heap Space Available Size — available space in old_space shrinks as objects accumulate

If you have seen this in your own Next.js deployment, or know of a fix: yves@pipeboard.co, Discord, or connect with Yves on LinkedIn.