Healthcare AI Startup

Ship faster without simplifying your architecture

Your platform runs on Kubernetes with dozens of microservices. Every code change requires rebuilding containers, pushing images, and redeploying to a cluster. Feedback loops stretch from seconds to minutes, and developers spend more time waiting than building. How do you accelerate delivery without simplifying the architecture that makes your platform work?

digital healthcare background

Reducing friction in a microservices environment

A healthcare AI startup that runs its clinical platform on AKS has made the right architectural choices for production microservices, a service mesh, and Kubernetes-native deployments. Those choices came with a problem in the development workflow. As the number of interdependent services grew, the inner development loop became a bottleneck: a single code change triggered a full container rebuild, image push, and cluster redeployment before a developer could validate anything. In a team of 25-30 engineers shipping ML and backend services in parallel, that friction multiplied across dozens of daily iterations. The platform architecture was sound, but the development workflow around it was not.

Quick facts

Clinical AI Startup

Platform for clinical data analysis

Canadian ML platform for clinical decision support, operating in hospital environments across Canada and the USA. The platform processes patient vitals, lab results, clinical notes, and historical records in HIPAA-sensitive contexts.

Faster iteration

Feedback loops cut from minutes to seconds

By introducing cluster-integrated local development with Telepresence, developers could test changes against real cluster services without rebuilding containers. Iteration time per change dropped from several minutes to near-immediate feedback.

Telepresence + Internal DNS

Local development intercepting live Kubernetes services means developers test against real cluster dependencies, databases, APIs, and service mesh without replicating the entire environment locally or waiting for a redeploy cycle.

"Before optimization, even small changes required full container rebuilds and redeployments. It slowed the iteration significantly. After restructuring our development workflow, engineers could test locally while interacting with real cluster services. It noticeably improved delivery speed."

Senior Platform Engineer, Clinical AI Startup

What we did for the Clinical AI Startup

Diagnosing the development workflow bottleneck

Microservices architecture optimizes for production reliability and independent deployability, but it does not optimize for developer iteration speed by default. Each service boundary that improves production isolation adds a coordination cost to local development. In a platform with dozens of interdependent services, service mesh, and Kubernetes-native networking, the gap between "how the application runs in production" and "how a developer can test a change locally" becomes a daily tax on engineering velocity. Before proposing any tooling change, we mapped the existing workflow precisely to understand where time was being lost and why.

  1. Development loop audit and bottleneck mapping: We analyzed the existing inner development loop end-to-end: code change, container build, image push, cluster deployment, and validation. For most services, a single iteration took several minutes, not because any individual step was slow, but because the steps were sequential and mandatory regardless of change scope. A one-line logic fix triggered the same pipeline as a significant feature change. Multiplied across the team's daily change volume, the cumulative waste was significant. The audit also identified that developers had begun working around the problem, informally maintaining local mock environments that diverged from real cluster behavior, introducing a secondary risk of changes that passed local validation but failed against actual service dependencies.
  2. Telepresence implementation for cluster-integrated local development: We introduced Telepresence to replace the rebuild-push-redeploy cycle for inner loop development. Telepresence intercepts traffic for a specific Kubernetes service and redirects it to a locally running process, while the rest of the cluster continues operating normally. A developer modifying a single service runs it locally, and the cluster routes traffic to their local runtime as if it were the deployed version. Changes are testable in seconds, against real cluster dependencies, databases, downstream services, and service mesh policies without a container rebuild. We pushed back on an initial suggestion to solve the problem with faster CI/CD pipelines: build optimization addresses the outer loop, not the inner one, and the team's bottleneck was emphatically in the inner loop.

Establishing isolated development domains

Shared development environments create a different kind of friction, one that is less visible but equally damaging to delivery speed. When multiple developers deploy work-in-progress changes to a shared staging namespace, they interfere with each other's validation. A broken service deployed by one engineer blocks testing for everyone else in that environment. The solution to the inner-loop speed cannot create an outer-loop coordination problem.

  1. Feature-based development namespaces with internal DNS routing: We introduced isolated development domains mapped to Kubernetes namespaces, with internal DNS configured to route each developer's traffic to their own namespace context. Each engineer could deploy and validate work independently without affecting shared environments or other developers' in-progress changes. The namespace isolation also aligned with the security boundaries already established through Istio development. Traffic stayed within defined perimeters rather than bleeding across service boundaries in ways that would have been inconsistent with the platform's HIPAA-sensitive posture.
  2. Decoupling local runtime from cluster-wide redeployments: With Telepresence handling service interception and namespace isolation containing environment collisions, we decoupled the local development runtime from the cluster redeployment cycle entirely. Container rebuilds became a CI/CD concern triggered by merged, reviewed code rather than a prerequisite for hypothesis testing. Engineers working on ML service logic, API changes, or integration behavior could iterate locally at near-zero cycle time, then submit changes for the full pipeline only when confident in the outcome. The change in developer experience was immediate: the waiting that had been normalized as part of the job was largely eliminated.

Kubernetes workflow optimization: FAQ

Because Docker Compose replicates containers, not cluster behavior.

In a small system with simple service dependencies, Docker Compose provides a reasonable local approximation. In a platform running Istio service mesh, AKS-native networking, internal DNS, and Kubernetes-specific configuration, Docker Compose produces a local environment that behaves differently from the cluster in ways that matter.

Developers testing against a Docker Compose environment are testing against a simplified model, and the gap between that model and production is where integration bugs live. Telepresence preserves real cluster behavior by keeping the cluster running and intercepting only the service under development.

No, interception is scoped to defined development contexts and never touches production traffic.

Telepresence operates within specific namespaces and requires explicit configuration to intercept a service. Production namespaces are not exposed to development interception. The isolation model we implemented for this client used separate development namespaces with DNS routing, keeping all local development traffic within bounded, non-production contexts.

The Istio policies already in place provided an additional enforcement layer, ensuring development traffic could not inadvertently reach production services.

No, they address different parts of the delivery workflow, and both matter.

CI/CD pipelines optimize the outer loop: the path from a merged, reviewed change to a deployed artifact in a target environment. Inner loop optimization addresses the iteration cycle before a change is ready for review. The feedback a developer needs to know whether their approach works at all. Mixing the two leads to solving the wrong problem.

Faster CI/CD does not help a developer who is rebuilding containers to test a hypothesis. Telepresence does not replace the validation, testing, and deployment automation that CI/CD provides. This client needed both, and we addressed the inner loop specifically because that was where time was being lost.

By giving each developer their own cluster context that shares infrastructure but not deployment state.

In a shared staging namespace, a broken deployment from one engineer becomes everyone's problem: blocked tests, false failures, and the coordination overhead of figuring out whose change caused what. Namespace isolation with internal DNS routing means each developer's work in progress exists in its own context. Services they haven't modified resolve to the shared cluster versions.

Services are actively developing a route to their local or namespace-specific instance. The result is parallel development without interference, a prerequisite for maintaining delivery speed in a team of the client's size.

The inner loop is the cycle a developer repeats dozens of times per day: change, run, observe, adjust.

It is the tightest feedback cycle in software development, and its speed directly determines how many hypotheses an engineer can test in a working session. Pipeline speed matters for the outer loop, getting a validated change to production quickly. But a developer whose inner loop takes five minutes per iteration will make fundamentally different decisions about how to explore a problem than one whose inner loop takes ten seconds.

On a platform as complex as this client's, compressing the inner loop had a greater impact on overall delivery velocity than any CI/CD optimization would have.

Background Image

We’d love to hear from you

Ready to migrate critical systems without disrupting your business?

Talk to our team about your needs.

Contact us