Batch AI Processing: Why Multithreading is the Wrong Instinct
When developers first encounter a large-scale AI classification job — say, two million records that each need to be sent to an LLM for analysis — the instinct is immediately familiar: spin up threads, parallelise the work, saturate the API. It’s the same pattern that works for database processing, file I/O, HTTP scraping. More threads, more throughput.
With LLM APIs, that instinct leads you straight into a wall. And the wall has a name: TPM.
The Problem with Multithreading LLM Calls
Most LLM APIs — OpenAI included — impose a Tokens Per Minute (TPM) limit. This is a rolling window, not a per-request limit. Every token you send in a prompt, and every token the model returns, counts against it.
The naive multithreaded approach burns through this budget in a way that’s both wasteful and hard to control:
The system prompt repeats on every request. If your prompt is 700 tokens and you’re running 20 threads firing one request each, you’re spending 14,000 tokens per second just on prompt overhead — before the model has classified a single record. With a 200,000 TPM limit, you’ve consumed 4.2 minutes of budget in one second.
Burst behaviour triggers rate limits unpredictably. The TPM limit is a rolling window. Twenty threads firing simultaneously create a spike that can exceed the per-minute budget in seconds, even if your average rate would be well within limits. The API returns 429 errors, your retry logic kicks in, those retries themselves consume tokens, and the situation compounds.
Thread count is a blunt instrument. Dialling concurrency up and down doesn’t map cleanly to token consumption because request latency varies. A batch that takes 500ms doesn’t consume the same tokens as one that takes 1,500ms, but both hold a thread slot for their duration.
The Better Model: Semantic Batching
The insight that changes everything is this: the system prompt is a fixed overhead, and you should amortise it across as many classifications as possible per API call.
Instead of:
Thread 1: [system prompt 700 tokens] + [address 1: 15 tokens] → [result: 15 tokens]Thread 2: [system prompt 700 tokens] + [address 2: 15 tokens] → [result: 15 tokens]...× 20 threadsTotal: 14,000 tokens for 20 classifications
You send:
[system prompt 700 tokens] + [addresses 1-20: 300 tokens] → [results 1-20: 100 tokens]Total: 1,100 tokens for 20 classifications
That’s a 12× reduction in token consumption for the same work. Suddenly your 200,000 TPM budget — which could only sustain ~270 single-record requests per minute — supports ~3,600 classifications per minute. No extra threads needed.
Key Implementation Details
1. Include an ID in Both Request and Response
The most important correctness detail in batch processing is never rely on positional alignment.
If you send 20 addresses and ask the model to return 20 results, it might return 19. Now you don’t know which one it dropped. If you’re matching by position, records from item 7 onwards get silently misclassified.
The fix is to include a unique identifier in both directions:
User message:id=548033: product Xid=548034: product Y...System prompt format instruction:Reply ONLY with a JSON array. Format: [{"id":548033,"c":"E"}, ...]
Now you build a dictionary from the response keyed on id, and match each input item explicitly. A missing id means that specific record gets skipped and retried on the next run. Everything else classifies correctly regardless of what the model dropped.
2. Resolve Labels Locally
The model doesn’t need to return the full label text. "Prime City Professionals" costs tokens on every response item. A single letter costs one token.
Keep a static dictionary in your code:
csharp
private static readonly Dictionary<string, string> Labels = new(){ { "A", "Prime Product" }, { "B", "Budget Product" }, // ...};
The model returns "c":"A", you look up the label locally. This also eliminates a class of hallucination errors where the model invents a label name slightly different from your taxonomy.
Note: even "category" vs "c" matters at scale. In the OpenAI tokenizer, "category" is 3 tokens; "c" is 1. Across 100,000 batch calls, that’s 200,000 tokens — small but free.
3. Track TPM with a Rolling Window, Not Concurrency
Rather than trying to infer safe concurrency from trial and error, measure what you’re actually consuming and throttle directly on that signal.
csharp
// On each successful response, record tokens used with a timestamptokenWindow.Enqueue((DateTime.UtcNow, inputTokens + outputTokens));// Before each request, prune entries older than 60 seconds and sum the restvar cutoff = DateTime.UtcNow.AddSeconds(-60);while (window.Peek().t < cutoff) window.Dequeue();long tpmUsed = window.Sum(x => x.tok);// Throttle graduated to usageif (tpmUsed > tpmLimit * 0.98) Thread.Sleep(2000);else if (tpmUsed > tpmLimit * 0.95) Thread.Sleep(800);else if (tpmUsed > tpmLimit * 0.85) Thread.Sleep(300);
This gives you automatic, self-correcting throttling that responds to real consumption rather than guessing from thread counts. If a batch of records happens to have longer addresses, the window fills faster and the delay kicks in sooner. No manual tuning required.
4. Resumability via Cursor Pagination
For a job that takes hours or days, stopping and restarting must be safe and cheap. The key is two things working together:
Write results immediately after each batch, not at the end of a page. If you crash mid-page, you’ve lost one batch (20 records), not a thousand.
Use a NULL-check filter combined with cursor pagination. The query for unclassified records looks like:
sql
WHERE segment_category IS NULL AND id > {lastId} ORDER BY id LIMIT 1000
On restart, lastId resets to 0, but the IS NULL filter automatically skips everything already classified. The cursor (id > lastId) keeps the query fast on large tables — OFFSET pagination slows to a crawl at millions of rows because the database still has to scan all preceding rows to find the offset position.
5. Handle Partial Batches Gracefully with Skip vs Error
Not all failures are equal. Distinguish between:
- Error: something went wrong that warrants logging (HTTP 500, persistent 429 after retries, DB connection failure). These need attention.
- Skip: the record wasn’t returned in this batch response. Leave it NULL in the database, it will be picked up automatically on the next run. No log noise needed.
This distinction keeps your error output meaningful. If every missing batch item logs as an error, a run with 0.1% skip rate produces thousands of error lines that mask real problems.
The Result
What started as a job estimated at 16–67 days with a naive multithreaded approach settled to around 7 hours using semantic batching — processing two million records through a rate-limited API without a single configuration change to the API account.
The throughput improvement didn’t come from more concurrency. It came from being smarter about what gets sent in each request.
The general principle applies beyond LLM classification: whenever you have a fixed overhead per API call (authentication, context, schema), the correct optimisation is to amortise that overhead across as much work as possible per call, not to fire more calls in parallel.
Summary of Patterns
| Pattern | Naive approach | Better approach |
|---|---|---|
| Throughput | More threads | Larger batches |
| Rate limiting | Catch 429, retry | Track TPM rolling window, throttle proactively |
| Result matching | Positional array index | ID-keyed dictionary |
| Label resolution | Ask model for full text | Return code, resolve locally |
| Resumability | Track page offset | NULL-check filter + cursor pagination |
| Failure handling | All failures are errors | Skip vs Error distinction |
| DB resilience | Crash on connection drop | Exponential backoff retry |
The instinct to parallelise is correct in principle — you want to keep the API busy. But with token-limited LLM APIs, the right parallelism is within a single request, not across many simultaneous ones.