Batch AI Processing: Why Multithreading is the Wrong Instinct

Home > Uncategorized > Batch AI Processing: Why Multithreading is the Wrong Instinct

Batch AI Processing: Why Multithreading is the Wrong Instinct

April 17, 2026 Infinite Loop Development Ltd Leave a comment Go to comments

When developers first encounter a large-scale AI classification job — say, two million records that each need to be sent to an LLM for analysis — the instinct is immediately familiar: spin up threads, parallelise the work, saturate the API. It’s the same pattern that works for database processing, file I/O, HTTP scraping. More threads, more throughput.

With LLM APIs, that instinct leads you straight into a wall. And the wall has a name: TPM.

The Problem with Multithreading LLM Calls

Most LLM APIs — OpenAI included — impose a Tokens Per Minute (TPM) limit. This is a rolling window, not a per-request limit. Every token you send in a prompt, and every token the model returns, counts against it.

The naive multithreaded approach burns through this budget in a way that’s both wasteful and hard to control:

The system prompt repeats on every request. If your prompt is 700 tokens and you’re running 20 threads firing one request each, you’re spending 14,000 tokens per second just on prompt overhead — before the model has classified a single record. With a 200,000 TPM limit, you’ve consumed 4.2 minutes of budget in one second.

Burst behaviour triggers rate limits unpredictably. The TPM limit is a rolling window. Twenty threads firing simultaneously create a spike that can exceed the per-minute budget in seconds, even if your average rate would be well within limits. The API returns 429 errors, your retry logic kicks in, those retries themselves consume tokens, and the situation compounds.

Thread count is a blunt instrument. Dialling concurrency up and down doesn’t map cleanly to token consumption because request latency varies. A batch that takes 500ms doesn’t consume the same tokens as one that takes 1,500ms, but both hold a thread slot for their duration.

The Better Model: Semantic Batching

The insight that changes everything is this: the system prompt is a fixed overhead, and you should amortise it across as many classifications as possible per API call.

Instead of:

			
Thread 1: [system prompt 700 tokens] + [address 1: 15 tokens] → [result: 15 tokens]
Thread 2: [system prompt 700 tokens] + [address 2: 15 tokens] → [result: 15 tokens]
...× 20 threads
Total: 14,000 tokens for 20 classifications

You send:

			
[system prompt 700 tokens] + [addresses 1-20: 300 tokens] → [results 1-20: 100 tokens]
Total: 1,100 tokens for 20 classifications

That’s a 12× reduction in token consumption for the same work. Suddenly your 200,000 TPM budget — which could only sustain ~270 single-record requests per minute — supports ~3,600 classifications per minute. No extra threads needed.

Key Implementation Details

1. Include an ID in Both Request and Response

The most important correctness detail in batch processing is never rely on positional alignment.

If you send 20 addresses and ask the model to return 20 results, it might return 19. Now you don’t know which one it dropped. If you’re matching by position, records from item 7 onwards get silently misclassified.

The fix is to include a unique identifier in both directions:

			
User message:
id=548033: product X
id=548034: product Y
...
System prompt format instruction:
Reply ONLY with a JSON array. Format: [{"id":548033,"c":"E"}, ...]

		

Now you build a dictionary from the response keyed on id, and match each input item explicitly. A missing id means that specific record gets skipped and retried on the next run. Everything else classifies correctly regardless of what the model dropped.

2. Resolve Labels Locally

The model doesn’t need to return the full label text. "Prime City Professionals" costs tokens on every response item. A single letter costs one token.

Keep a static dictionary in your code:

csharp

			
private static readonly Dictionary<string, string> Labels = new()
{
    { "A", "Prime Product"   },
    { "B", "Budget Product"        },
    // ...
};

		

The model returns "c":"A", you look up the label locally. This also eliminates a class of hallucination errors where the model invents a label name slightly different from your taxonomy.

Note: even "category" vs "c" matters at scale. In the OpenAI tokenizer, "category" is 3 tokens; "c" is 1. Across 100,000 batch calls, that’s 200,000 tokens — small but free.

3. Track TPM with a Rolling Window, Not Concurrency

Rather than trying to infer safe concurrency from trial and error, measure what you’re actually consuming and throttle directly on that signal.

csharp

			
// On each successful response, record tokens used with a timestamp
tokenWindow.Enqueue((DateTime.UtcNow, inputTokens + outputTokens));
// Before each request, prune entries older than 60 seconds and sum the rest
var cutoff = DateTime.UtcNow.AddSeconds(-60);
while (window.Peek().t < cutoff) window.Dequeue();
long tpmUsed = window.Sum(x => x.tok);
// Throttle graduated to usage
if (tpmUsed > tpmLimit * 0.98) Thread.Sleep(2000);
else if (tpmUsed > tpmLimit * 0.95) Thread.Sleep(800);
else if (tpmUsed > tpmLimit * 0.85) Thread.Sleep(300);

		

This gives you automatic, self-correcting throttling that responds to real consumption rather than guessing from thread counts. If a batch of records happens to have longer addresses, the window fills faster and the delay kicks in sooner. No manual tuning required.

4. Resumability via Cursor Pagination

For a job that takes hours or days, stopping and restarting must be safe and cheap. The key is two things working together:

Write results immediately after each batch, not at the end of a page. If you crash mid-page, you’ve lost one batch (20 records), not a thousand.

Use a NULL-check filter combined with cursor pagination. The query for unclassified records looks like:

sql

WHERE segment_category IS NULL AND id > {lastId} ORDER BY id LIMIT 1000

On restart, lastId resets to 0, but the IS NULL filter automatically skips everything already classified. The cursor (id > lastId) keeps the query fast on large tables — OFFSET pagination slows to a crawl at millions of rows because the database still has to scan all preceding rows to find the offset position.

5. Handle Partial Batches Gracefully with Skip vs Error

Not all failures are equal. Distinguish between:

Error: something went wrong that warrants logging (HTTP 500, persistent 429 after retries, DB connection failure). These need attention.
Skip: the record wasn’t returned in this batch response. Leave it NULL in the database, it will be picked up automatically on the next run. No log noise needed.

This distinction keeps your error output meaningful. If every missing batch item logs as an error, a run with 0.1% skip rate produces thousands of error lines that mask real problems.

The Result

What started as a job estimated at 16–67 days with a naive multithreaded approach settled to around 7 hours using semantic batching — processing two million records through a rate-limited API without a single configuration change to the API account.

The throughput improvement didn’t come from more concurrency. It came from being smarter about what gets sent in each request.

The general principle applies beyond LLM classification: whenever you have a fixed overhead per API call (authentication, context, schema), the correct optimisation is to amortise that overhead across as much work as possible per call, not to fire more calls in parallel.

Summary of Patterns

Pattern	Naive approach	Better approach
Throughput	More threads	Larger batches
Rate limiting	Catch 429, retry	Track TPM rolling window, throttle proactively
Result matching	Positional array index	ID-keyed dictionary
Label resolution	Ask model for full text	Return code, resolve locally
Resumability	Track page offset	NULL-check filter + cursor pagination
Failure handling	All failures are errors	Skip vs Error distinction
DB resilience	Crash on connection drop	Exponential backoff retry

The instinct to parallelise is correct in principle — you want to keep the API busy. But with token-limited LLM APIs, the right parallelism is within a single request, not across many simultaneous ones.

Categories: Uncategorized Tags: ai, artificial-intelligence, chatgpt, llm, technology

Comments (0) Trackbacks (0) Leave a comment Trackback

No comments yet.

No trackbacks yet.

Network Programming in .NET

Batch AI Processing: Why Multithreading is the Wrong Instinct

The Problem with Multithreading LLM Calls

The Better Model: Semantic Batching

Key Implementation Details

1. Include an ID in Both Request and Response

2. Resolve Labels Locally

3. Track TPM with a Rolling Window, Not Concurrency

4. Resumability via Cursor Pagination

5. Handle Partial Batches Gracefully with Skip vs Error

The Result

Summary of Patterns

Leave a comment Cancel reply

Follow me on Twitter

Archives

Like us on Facebook

Network Programming in .NET

Batch AI Processing: Why Multithreading is the Wrong Instinct

The Problem with Multithreading LLM Calls

The Better Model: Semantic Batching

Key Implementation Details

1. Include an ID in Both Request and Response

2. Resolve Labels Locally

3. Track TPM with a Rolling Window, Not Concurrency

4. Resumability via Cursor Pagination

5. Handle Partial Batches Gracefully with Skip vs Error

The Result

Summary of Patterns

Share this:

Leave a comment Cancel reply

Follow me on Twitter

Archives

Like us on Facebook