Zero-Downtime Databricks & MLFlow Upgrade

Client

Kyros Insights

Databricks MLFlow PyTorch DDP Databricks Asset Bundles Python Azure

A look at how Qwertee Technology took a production ML platform from DBR 14.3 LTS + MLFlow 1.x to DBR 16.4 LTS + MLFlow 2.x — without losing a pipeline run or a single experiment record.

The setup

Kyros Insights runs a production ML platform on Databricks. By mid-2025, their team had outgrown its runtime. They wanted PyTorch DDP for distributed training, modern deployment tooling (Databricks Asset Bundles, private PIP index URLs), and the MLFlow 2.x feature surface. None of it was available on DBR 14.3 LTS.

The catch: their existing pipelines couldn't go down — and the upgrade wasn't one upgrade. It was two, tangled together.

The problem nobody wants to inherit

The constraints were the kind that turn quick wins into multi-quarter projects:

  • Stuck on Horovod. DBR 14.3 LTS shipped library versions incompatible with PyTorch DDP, blocking the team's distributed-training roadmap.

  • No modern deployment tooling. Databricks Asset Bundles and DATABRICKS_PIP_INDEX_URL weren't options on the old runtime. Packages were being moved by hand.

  • MLFlow split-brain. The production MLFlow ran outside Databricks on 1.x — fine in isolation, but it had to be migrated to 2.x in lockstep with DBR 16.4 LTS, since the new runtime depended on the new server.

  • Zero downtime, zero data loss. Both upgrades had to ship together, in production, without dropping a run or losing experiment history.

Kyros's ask wasn't just "get us upgraded." It was "get us upgraded, and leave us with a process we can reuse the next time this comes up."

How we approached it

We kicked off in June 2025 with a strategy session, then ran the work in four phases.

Phase 1 — Research and planning

We started with a side-by-side of 14.3 LTS and 16.4 LTS to surface the breaking changes and frame the discussion fast. Then we walked the Databricks release notes between runtimes with the Kyros team, locking in which features they actually needed. For MLFlow, we anchored on the version bundled with 16.4 LTS and mapped each existing pipeline against backward-compatibility requirements. Because the runtime upgrade depended on MLFlow being on 2.x first, we sequenced the two tracks to run in parallel — with MLFlow ahead by design.

Phase 2 — Implementation

MLFlow went first, in a Dockerized local dry run. We rehearsed the library upgrade and the database migration end-to-end before touching anything in production, building in safeguards to prevent accidental auto-upgrades from backward-compatible pipelines, plus a manual import/export path in case the old and new servers ever drifted out of sync. For DBR, we ran a local PoC across the full dependency stack — Ubuntu, Spark, Python — until syntax and library conflicts were resolved. From there we cut an Alpha (simple pipelines plus the new MLFlow), then a Beta (full CI/CD integration), and handed Beta to the Kyros team for hands-on testing.

Phase 3 — Stabilization and benchmarking

During the Release Candidate phase we worked through bugs the Kyros team surfaced in real use, and re-ran historical pipelines to check for performance regressions.

Phase 4 — Documentation

With the RC stable, we wrote up the playbook: the steps, the gotchas, the decision points. Kyros now owns it for the next major upgrade.

What changed

PyTorch DDP unblocked

What was impossible on 14.3 LTS is now standard on 16.4 LTS. The distributed-training roadmap is moving again.

Modern deployment tooling

Packages now flow from Azure Feeds via Databricks Asset Bundles. No more manual drag-and-drop.

Runtime support runway extended

From a 2027 EOL to 2028 — and, more importantly, a documented process for the next jump.

Zero production downtime

Both upgrades shipped in sync, no data loss, no dropped runs.

"The team at Qwertee has helped us tackle highly specific and complex problems with impressive technical depth, the kind most teams wouldn't know where to start with. They've proven themselves as dependable partners we're glad to bring our hardest problems to."
— Rob Chin, CTO, Kyros Insights

If this sounds familiar

If you're staring at a Databricks or MLFlow upgrade with production pipelines that can't go down, we've done this before. Happy to compare notes — even if you're not looking to bring anyone in.

Get in Touch
data engineering consulting