When hospitals lost access to patient records and flights were grounded in the wake of CrowdStrike’s 2024 software update mishap, it wasn’t the result of a cyberattack or ransomware. It was a routine security patch gone wrong: one that cost over $5 billion in disruptions and damage.
It also revealed something many outside of engineering circles don’t yet grasp: modern infrastructure is balanced on top of systems no one fully owns or understands. No breached firewalls, and no exotic exploits. Just failure cascading through interdependent systems no one had planned to fail.
Karan Luniya, a senior software engineer at DoorDash, sees that fragility less as a fluke, but as a design problem. And he argues that most industries are underestimating just how exposed they are. “The danger is in assuming your systems will behave well under stress, without knowing exactly how they’ll fail.” That assumption, he argues, is what’s breaking business continuity across sectors.
From logistics platforms handling millions of deliveries, to streaming services broadcasting live to millions of viewers, Luniya has spent his career building infrastructure that works when things go wrong. Recognized as an industry expert, he serves as an associate editor for the SARC Journal of Economics and Business Management and the Journal of Economics Intelligence and Technology, contributing to the broader discourse on operational continuity and systems resilience. He argues that many organizations are mistaking historical stability for readiness—and may be far less prepared than they realize.
Build for Failure, Instead of Rebuilding Afterwards
The typical disaster recovery mindset is reactive: something breaks, then a war room forms. But with infrastructure disruptions now ranked among the most severe global risks, Luniya argues it’s time to stop treating outages like rare surprises.
In 2024, Luniya led a high-stakes migration at DoorDash, moving 200 tebibytes of delivery data out of a brittle legacy system. Rather than relying on cross-team firefighting during cutover, Luniya designed an ingestion pipeline that could independently parse a dozen mismatched data tables, resolve schema conflicts on the fly, and shift data without downtime. The project wrapped five times faster than projected
Most importantly, it embedded fault-tolerance into the design. Luniya introduced a hot-cold storage design, allowing fallback layers to kick in when real-time systems hit snags. Instead of collapsing, services degraded gracefully. “It’s very easy to think of resilience like insurance,” he says. “But in architecture, it’s really about shaping system behavior so failure is something you can manage.”
That means building for recovery upfront: workflows that retry cleanly, APIs that fail in predictable ways, and rollback plans that are tested like fire drills, rather than bolted on at the last minute.
Predictability Is Power
Before joining DoorDash, Luniya led infrastructure at Conviva, which delivers real-time analytics to streaming giants like Hulu and Sky. If analytics don’t sound mission-critical, consider this: when data is delayed or lost, platforms can’t spot performance issues or keep subscribers from churning.
To solve this, Luniya moved the system from a monolith to segmented client-specific clusters, each with automatic failover. This not only added redundancy but, more critically, kept performance consistent, even under duress.
“That kind of determinism is really underrated,” Luniya explains. “It gives teams the confidence to act under pressure. More importantly, it means customers don’t notice when something’s wrong.”
Predictability is what gives organizations leverage. They can alert on real signals and fix issues before customers feel them. Without it, every incident becomes guesswork, and recovery turns into damage control.
Clarity Over Complexity
The weakest point in most infrastructure isn’t a server or a line of code—it’s a lack of clarity. Who owns what? Where does the failure stop? Can it be reproduced? Studies show that nearly 85% of organizations suffered multiple data-loss incidents last year, and half reported significant business disruption as a result. Not because the tech failed, but because people weren’t ready.
At DoorDash, Luniya has pushed for practices that eliminate that uncertainty. He shortened CI build times by over 60%, redesigned service boundaries to prevent cascading failures, and pushed for environments where failure modes were simulated as a matter of habit, rather just after an incident. One rule of thumb he shares: if you need to page five people to figure out what broke, your system isn’t resilient. It’s confusing. And as systems grow, it only gets worse.
His advice often centers on the basics: well-documented API contracts, layered observability, and workflows that can be safely retried. These may not stop outages, but they stop them from spreading. They create systems that engineers trust, and that customers don’t have to worry about.
Rethinking Infrastructure for a Fragile World
It used to be that infrastructure was something only tech companies worried about. Today, any business that relies on software, from banks to hospitals to logistics networks, has a direct stake in how that infrastructure behaves under stress. And increasingly, they’re relying on pieces they don’t own: public cloud platforms and a laundry list of SaaS tools with third-party APIs. That makes resilience a shared problem that can’t be patched over with dashboards or compliance checklists.
With the global network infrastructure market projected to exceed $100 billion by 2032, Luniya says the companies that do well in the worldwide digital transformation will be the ones that treat resilience as a first principle.
“Every system breaks eventually,” says Luniya. “The smart ones are built to break in ways you can control.”