Archive

Archive for April, 2026

Taiwan motorcycle plate lookup via #API

Our vehicle API network expands into Taiwan with full motorcycle registration lookups — make, age, engine size, and emissions test history, all in one call.


Taiwan has one of the highest motorcycle densities in the world, with over 14 million registered two-wheelers on its roads. Today, we’re pleased to announce that the /CheckTaiwan endpoint is live — bringing motorcycle registration data from Taiwan’s national vehicle database into our global API network.

What the endpoint returns

A single call to /CheckTaiwan with a Taiwanese plate number returns structured vehicle data covering identity, registration history, and emissions compliance records:

  • Make — Manufacturer name (Chinese & romanised)
  • Age / Registration year — Year of manufacture and license issue date
  • Engine size — Displacement in cc and engine cycle type
  • Inspection records — Full emissions test history with HC, CO, and CO₂ readings

Sample lookup: MWN-0076

Here’s what a real response looks like for a 2018 Kymco (光陽) motorcycle:

{
"Description": "光陽",
"RegistrationYear": "2018",
"CarMake": { "CurrentTextValue": "光陽" },
"EngineSize": "149",
"ManufactureDate": "01/08/2018",
"LicenseIssueDate": "05/09/2018",
"EngineCycle": "四行程",
"TestRecords": [
{
"LicensePlate": "MWN-0076",
"InspectionType": "定期檢驗",
"HC_ppm": "102",
"CO_pct": "0.1",
"CO2_pct": "14.8",
"Result": "合格",
"TestDate": "20240822"
},
{
"LicensePlate": "MWN-0076",
"InspectionType": "定期檢驗",
"HC_ppm": "315",
"CO_pct": "0.1",
"CO2_pct": "14.9",
"Result": "合格",
"TestDate": "20240718"
}
]
}

The TestRecords array is particularly valuable — it provides a full chronological emissions test history, with pass/fail status (合格 = pass), hydrocarbon and carbon monoxide readings, and the serial number of each inspection. This supports fleet compliance monitoring, insurance underwriting, and second-hand vehicle verification use cases.

API endpoint

The endpoint is live now at:

https://www.chepaiapi.tw/api/reg.asmx?op=CheckTaiwan

Full documentation and interactive testing are available at chepaiapi.tw.

Expanding Chinese-language coverage

This launch also deepens our Chinese-language vehicle data coverage. Alongside Taiwan, our mainland China vehicle lookup service at chepaiapi.cn continues to serve customers requiring PRC plate data — together forming a comprehensive Chinese-language API offering across both sides of the strait.

Use cases

The Taiwan endpoint is well-suited to:

  • Insurers pricing two-wheeler policies
  • Logistics platforms operating scooter fleets
  • Used vehicle marketplaces verifying provenance
  • KYC and compliance workflows touching Taiwanese vehicle assets

The inclusion of emissions test records is a differentiator that goes beyond simple registration confirmation — providing genuine due diligence depth for any platform that needs it.

Taiwan joins our network of 55+ country vehicle lookup APIs. We’ll continue expanding coverage across Asia Pacific throughout 2026.

Visit chepaiapi.tw to get started →

Categories: Uncategorized

Batch AI Processing: Why Multithreading is the Wrong Instinct

When developers first encounter a large-scale AI classification job — say, two million records that each need to be sent to an LLM for analysis — the instinct is immediately familiar: spin up threads, parallelise the work, saturate the API. It’s the same pattern that works for database processing, file I/O, HTTP scraping. More threads, more throughput.

With LLM APIs, that instinct leads you straight into a wall. And the wall has a name: TPM.


The Problem with Multithreading LLM Calls

Most LLM APIs — OpenAI included — impose a Tokens Per Minute (TPM) limit. This is a rolling window, not a per-request limit. Every token you send in a prompt, and every token the model returns, counts against it.

The naive multithreaded approach burns through this budget in a way that’s both wasteful and hard to control:

The system prompt repeats on every request. If your prompt is 700 tokens and you’re running 20 threads firing one request each, you’re spending 14,000 tokens per second just on prompt overhead — before the model has classified a single record. With a 200,000 TPM limit, you’ve consumed 4.2 minutes of budget in one second.

Burst behaviour triggers rate limits unpredictably. The TPM limit is a rolling window. Twenty threads firing simultaneously create a spike that can exceed the per-minute budget in seconds, even if your average rate would be well within limits. The API returns 429 errors, your retry logic kicks in, those retries themselves consume tokens, and the situation compounds.

Thread count is a blunt instrument. Dialling concurrency up and down doesn’t map cleanly to token consumption because request latency varies. A batch that takes 500ms doesn’t consume the same tokens as one that takes 1,500ms, but both hold a thread slot for their duration.


The Better Model: Semantic Batching

The insight that changes everything is this: the system prompt is a fixed overhead, and you should amortise it across as many classifications as possible per API call.

Instead of:

Thread 1: [system prompt 700 tokens] + [address 1: 15 tokens] → [result: 15 tokens]
Thread 2: [system prompt 700 tokens] + [address 2: 15 tokens] → [result: 15 tokens]
...× 20 threads
Total: 14,000 tokens for 20 classifications

You send:

[system prompt 700 tokens] + [addresses 1-20: 300 tokens] → [results 1-20: 100 tokens]
Total: 1,100 tokens for 20 classifications

That’s a 12× reduction in token consumption for the same work. Suddenly your 200,000 TPM budget — which could only sustain ~270 single-record requests per minute — supports ~3,600 classifications per minute. No extra threads needed.


Key Implementation Details

1. Include an ID in Both Request and Response

The most important correctness detail in batch processing is never rely on positional alignment.

If you send 20 addresses and ask the model to return 20 results, it might return 19. Now you don’t know which one it dropped. If you’re matching by position, records from item 7 onwards get silently misclassified.

The fix is to include a unique identifier in both directions:

User message:
id=548033: product X
id=548034: product Y
...
System prompt format instruction:
Reply ONLY with a JSON array. Format: [{"id":548033,"c":"E"}, ...]

Now you build a dictionary from the response keyed on id, and match each input item explicitly. A missing id means that specific record gets skipped and retried on the next run. Everything else classifies correctly regardless of what the model dropped.

2. Resolve Labels Locally

The model doesn’t need to return the full label text. "Prime City Professionals" costs tokens on every response item. A single letter costs one token.

Keep a static dictionary in your code:

csharp

private static readonly Dictionary<string, string> Labels = new()
{
{ "A", "Prime Product" },
{ "B", "Budget Product" },
// ...
};

The model returns "c":"A", you look up the label locally. This also eliminates a class of hallucination errors where the model invents a label name slightly different from your taxonomy.

Note: even "category" vs "c" matters at scale. In the OpenAI tokenizer, "category" is 3 tokens; "c" is 1. Across 100,000 batch calls, that’s 200,000 tokens — small but free.

3. Track TPM with a Rolling Window, Not Concurrency

Rather than trying to infer safe concurrency from trial and error, measure what you’re actually consuming and throttle directly on that signal.

csharp

// On each successful response, record tokens used with a timestamp
tokenWindow.Enqueue((DateTime.UtcNow, inputTokens + outputTokens));
// Before each request, prune entries older than 60 seconds and sum the rest
var cutoff = DateTime.UtcNow.AddSeconds(-60);
while (window.Peek().t < cutoff) window.Dequeue();
long tpmUsed = window.Sum(x => x.tok);
// Throttle graduated to usage
if (tpmUsed > tpmLimit * 0.98) Thread.Sleep(2000);
else if (tpmUsed > tpmLimit * 0.95) Thread.Sleep(800);
else if (tpmUsed > tpmLimit * 0.85) Thread.Sleep(300);

This gives you automatic, self-correcting throttling that responds to real consumption rather than guessing from thread counts. If a batch of records happens to have longer addresses, the window fills faster and the delay kicks in sooner. No manual tuning required.

4. Resumability via Cursor Pagination

For a job that takes hours or days, stopping and restarting must be safe and cheap. The key is two things working together:

Write results immediately after each batch, not at the end of a page. If you crash mid-page, you’ve lost one batch (20 records), not a thousand.

Use a NULL-check filter combined with cursor pagination. The query for unclassified records looks like:

sql

WHERE segment_category IS NULL AND id > {lastId} ORDER BY id LIMIT 1000

On restart, lastId resets to 0, but the IS NULL filter automatically skips everything already classified. The cursor (id > lastId) keeps the query fast on large tables — OFFSET pagination slows to a crawl at millions of rows because the database still has to scan all preceding rows to find the offset position.

5. Handle Partial Batches Gracefully with Skip vs Error

Not all failures are equal. Distinguish between:

  • Error: something went wrong that warrants logging (HTTP 500, persistent 429 after retries, DB connection failure). These need attention.
  • Skip: the record wasn’t returned in this batch response. Leave it NULL in the database, it will be picked up automatically on the next run. No log noise needed.

This distinction keeps your error output meaningful. If every missing batch item logs as an error, a run with 0.1% skip rate produces thousands of error lines that mask real problems.


The Result

What started as a job estimated at 16–67 days with a naive multithreaded approach settled to around 7 hours using semantic batching — processing two million records through a rate-limited API without a single configuration change to the API account.

The throughput improvement didn’t come from more concurrency. It came from being smarter about what gets sent in each request.

The general principle applies beyond LLM classification: whenever you have a fixed overhead per API call (authentication, context, schema), the correct optimisation is to amortise that overhead across as much work as possible per call, not to fire more calls in parallel.


Summary of Patterns

PatternNaive approachBetter approach
ThroughputMore threadsLarger batches
Rate limitingCatch 429, retryTrack TPM rolling window, throttle proactively
Result matchingPositional array indexID-keyed dictionary
Label resolutionAsk model for full textReturn code, resolve locally
ResumabilityTrack page offsetNULL-check filter + cursor pagination
Failure handlingAll failures are errorsSkip vs Error distinction
DB resilienceCrash on connection dropExponential backoff retry

The instinct to parallelise is correct in principle — you want to keep the API busy. But with token-limited LLM APIs, the right parallelism is within a single request, not across many simultaneous ones.