To most people, artificial intelligence still looks like magic. But as the AI infrastructure market barrels toward a projected $200 billion valuation by 2028, it’s clear that the heavy lifting isn’t happening in model design labs. What matters in deployment is reliability and cost-at-scale, both of which feed into a user experience that enables that magic. And these constraints don’t disappear with better models. In many cases, they only get harder.
“Large-scale AI deployments are entirely shored up on infrastructure,” says Smarth Behl, a leading ML infrastructure engineer and IEEE Senior Member who has led infrastructure efforts across some of the most demanding AI deployments on the planet. “Most models today are good enough. What’s difficult is running them to meet the surge in demand and people’s expectations.”
Behl’s work spans AI’s most demanding use cases, ranging from large-scale content generation to real-time recommendation systems. And his message is consistent: model performance means little if the system around it can’t keep up.
Research Breakthroughs to Production Bottlenecks
The industry is now facing a different kind of bottleneck. Powerful models are increasingly accessible, thanks to open weights and foundation model APIs. What’s missing is the scaffolding required to deploy them in live environments, under unpredictable load, with tight latency guarantees and real-world constraints.
“The actual bottleneck for most production AI systems is working as fast as users expect an AI product to work,” Behl explains. ” There’s this assumption baked into the user experience that AI should feel instant. That assumption is entirely dependent on infrastructure.”
This is particularly true for platforms that serve millions of users simultaneously. A single request might trigger dozens of model evaluations—ranking, filtering, reranking, personalization—all with millisecond budgets. Any slowdown or outage has immediate consequences for user experience and revenue.
These challenges are being tackled through orchestration systems that coordinate multiple AI models across asynchronous pipelines. Designed to manage real-time demand while preserving throughput and accuracy, these frameworks help ensure that generative outputs are delivered within strict latency budgets. As companies scale AI deployments, this kind of infrastructure becomes essential for meeting user expectations without overwhelming backend systems.
Behl has also explored these production realities through thought leadership, notably in his HackerNoon article, Why High-Performance AI/ML is Essential in Modern Cybersecurity. In it, he draws parallels between the need for low-latency AI in security environments and the broader demands of real-time AI applications, reinforcing how infrastructure readiness is as critical as model sophistication.
Latency Is the Business Metric AI Can’t Ignore
Of all the infrastructure concerns, latency is the most directly tied to business performance. Small improvements in response time—on the order of milliseconds—can dramatically impact engagement and revenue. In ad tech and e-commerce, this relationship has long been established: As early as the 2010s, Amazon found that every 100 milliseconds of latency could reduce sales by 1%. Today’s users are even less patient.
In high-demand environments, delivering machine learning-driven recommendations at scale requires infrastructure that can handle thousands of inference calls per second. These systems must support rapid personalization without introducing latency, particularly for platforms serving small businesses or dynamic marketplaces. The challenge lies in maintaining accuracy and responsiveness even as demand spikes, making it critical to design infrastructure that is both resilient and performance-optimized from the start.
In generative AI, latency takes on a different dimension. A clever image caption or chatbot response loses value if it takes too long to appear. AI that feels slow, says Behl, is indistinguishable from AI that doesn’t work.
And that’s where many teams are getting tripped up. They’re investing in ever-larger models while neglecting the systems that make those models usable. “We’re still in the early days of figuring out how to deploy AI,” he says. “Most companies are just beginning to hit the operational debt that comes with putting these models into production.”
Systems Engineering Is the Real Frontier
The idea that models are the heart of AI is fading. Real-world AI is a performance-sensitive, infrastructure-intensive product, which depends much more on systems maturity. Companies that treat AI as a performance-sensitive, infrastructure-heavy product will be better positioned to compete, especially as the market becomes increasingly driven by the data center advantage.
It means building infrastructure teams with the same seriousness usually reserved for research labs. And it means accepting that deployment, not development, is the hard part.
“If you want AI to work in the real world,” Behl says, “you have to build for the real world. And the real world moves faster and breaks more things than anything a lab environment prepares you for.”