When Running the Model Was the Easy Part

Client

Kyros Insights

Python Distributed Computing Assembly Actuarial Pipeline Architecture

In June 2024, the team at Kyros Insights had a predictive model they trusted. Actuarial-grade, well-validated, accurate on the data they fed it. The problem wasn't the model. The problem was everything around it.

On production-scale datasets — the 20TB workloads their actuaries actually needed to run — the model simply wouldn't fit in memory. Standard machines threw OOM errors. Larger instances ran, eventually, but a single end-to-end job took 22 hours. And because the pipeline was a monolith, there was no way to stop at a stage, inspect intermediate state, or benchmark one piece against another. When something failed at hour 18, you started over.

That was the shape of it when they brought us in.

Listening before building

Before we wrote any code, we sat with the actuaries and watched them work. This turned out to matter more than we expected. The original API treated every run as a fresh training cycle, but the actuaries weren't retraining — they were applying a single, well-tuned model to new batches of data, over and over. That's a very different optimization problem. Once we saw it, a lot of decisions got easier.

Three fronts, in order

We attacked the problem in three phases because the bottlenecks were stacked. Solving one before the others would have moved nothing.

Memory came first

The model was being broadcast once per CPU rather than once per machine, which meant a 32-core box was holding 32 copies of the same weights. Collapsing that to one instance per machine freed up enormous headroom. We then found places where intermediate columns were being materialized across the full dataset when they were only needed in specific stages, and cut those. Finally, we added disk-backed checkpointing so long runs could spill gracefully instead of crashing. After this pass, the 20TB dataset ran on standard machines. That was the unblock.

Then the architecture

We split the monolith into four stages that mirrored the actuary's mental model: data cleaning, data preparation, model training, and prediction. Each stage runs independently and can be benchmarked in isolation. The feedback loop went from "run the whole thing and see what happens" to "run the stage you're debugging."

Then speed

With memory and architecture sorted, we were out of obvious wins — the remaining bottleneck was just how fast Python executes numerical code. So we compiled the model down to Assembly. That took prediction time from 22 hours to 11.

What we didn't see coming

Compiling a model and running it on one machine is straightforward. Compiling it and running it across a distributed cluster is not. CPUs varied from node to node, and the wrong compile flags produced binaries that crashed on half the fleet. Worse, some of the more complex model configurations would send the compiler into what looked like infinite loops.

We leaned on AI-assisted research to work through the flag matrix and identify parameter thresholds that kept compilation deterministic. Rough estimate: it saved us at least a week of trial and error. Without it, we'd still have gotten there — just slower.

Where Kyros landed

Before After
20TB dataset OOM on standard machines Runs on standard machines
End-to-end prediction 22 hours 11 hours
Debugging a failed run Restart from scratch Inspect the failing stage
"The team at Qwertee has helped us tackle highly specific and complex problems with impressive technical depth, the kind most teams wouldn't know where to start with. They've proven themselves as dependable partners we're glad to bring our hardest problems to."
— Rob Chin, CTO, Kyros Insights

If any of this sounds familiar

If your team has a model that works beautifully on a laptop and falls apart at scale, or a pipeline where every failure costs half a day, we'd be interested in hearing how it's shaped. No pitch — just a conversation about the architecture.

Get in Touch
data engineering consulting