DigitalOcean Blog

At DigitalOcean, we’re committed to providing high-performance infrastructure for the next generation of AI, which is why we’ve been focused on hosting frontier Large Language Models (LLMs) on frontier GPUs—including AMD GPUs . We see inference performance as an intricate systems-level challenge. For frontier open-weight models, achieving peak output speed is not just about the raw hardware. It a…

aideep-learningmachine-learningnlp

Earlier this year, we needed to hire a cohort of engineers in Seattle, fast. We had a product launching at our marquee conference, Deploy , a hard deadline, and a clear picture of what the work would actually require. What we didn’t want was an interview process designed for a world that no longer exists. So we rebuilt it from scratch and opened a brand-new office in Bellevue for everyone we hire…

Most teams running inference at scale do not fail because they cannot find a “good” model. They fail because they ship a routing policy that looks fine in a playground, but drifts the moment it sees real prompts, real latency tails, and real per-token cost. The routing policy breaks on the prompts you never tested and your users find out before you do. Now you can use Model Evaluations, available…

aimachine-learningnlp

Deploy 2026 came and went, and we’re still buzzing. For one day at Convene 100 Stockton in San Francisco, developers, startup founders, customers, and partners filled the room to talk about a shared challenge: how to build and scale AI products without unnecessary complexity. Conversations moved from infrastructure to inference costs, production workloads, vector databases, and what teams actuall…

aicloud-computinginfrastructuremachine-learning

Building an AI-native application requires a data layer that can do two things at once: handle the structured, transactional queries your application runs on, and understand meaning well enough to power semantic search across unstructured content. An AI application needs both — precise SQL for account balances and transaction records, and vector search to surface conceptually related patterns, an…

aimachine-learningnlp

The growth of generative AI isn’t driven solely by AI companies with proprietary models. Open-source AI is reshaping the developer ecosystem, fueled by a growing community of builders. But what does it take to go from open models to production-ready agentic AI, and what do developers need to know to get there? This question was the focus of the DigitalOcean Deploy session, “Open by Design: How NV…

aigenerative-aiinfrastructuremachine-learning

Introduction Inference demand is growing fast, and it’s only accelerating. By 2030, inference is expected to account for the majority of AI compute globally. But scaling inference isn’t just a hardware problem. Most teams discover too late that a significant portion of their compute spend is avoidable, primarily because their systems are silently repeating work they have already done, recomputing…

aimachine-learning

The Problem: Inference Gets Hard at Scale If you’ve shipped an AI feature to production, you already know: the hard part isn’t making a model respond to a prompt. The hard part is making it respond more reliably, at scale, across multiple models, without burning through your budget. The moment real users show up, you’re dealing with GPU resource contention, traffic unpredictability (a single ente…

aimachine-learning

Getting your hands on a capable AI model is the easy part now. Every team can reach the same frontier models through an API, so a strong model is not what sets a product apart. What separates a working product from a demo is everything around the model. You have to measure whether the agent is actually doing its job, then keep grinding on reliability until it stops making expensive mistakes in fr…

aigenerative-aimachine-learning

Coding agents today have a massive spending problem. Every request, whether you’re designing system architecture or writing a single-line docstring, often gets routed to the same expensive frontier model. The result: unnecessary token usage, higher inference costs, and little awareness of task complexity or budget constraints. This high cost stems from a “one-size-fits-all” approach to model usag…

aimachine-learningnlp

At Deploy 2026, we introduced the DigitalOcean AI-Native Cloud, built for the inference era. Batch Inference on the DigitalOcean Inference Engine enables high-volume asynchronous workloads. As developers move from AI prototypes to production-scale applications, the challenges of cost and rate limits often become a bottleneck. Batch Inference addresses these hurdles by allowing you to process high…

aimachine-learningnlp

Traffic doesn’t spike on a schedule. A product launch, a viral moment, or a flash sale can send request volume through the roof in seconds, long before your CPU metrics catch up. That gap is where performance suffers. Today, we’re excited to announce that request-based autoscaling on DigitalOcean App Platform is now generally available. Your apps can now automatically scale based on live HTTP tra…

Most teams building on LLMs today make a single model decision and apply it uniformly across every request. They reach for a frontier model not because every task demands it, but because building the infrastructure to do anything smarter is hard, time-consuming, and easy to get wrong. When the tooling isn’t there, the path of least resistance is to use a single model, even if it means that you en…

aiautonomous-systemsmachine-learningnlp

Everyone calling an LLM API has access to the same models. So what actually sets technical teams apart? It’s everything around the model like the routing logic, the live data pipelines, and the ability to scale from prototype to production without ever rewriting your code. Which LLM tops a benchmark matters less than what becomes possible when infrastructure stops being an afterthought, when one …

aicloud-computingcomputer-sciencemachine-learningnlp
Vinay Kumar·Chief Product & Technology Officer
5/5/2026

I’ve spent the last fifteen years building cloud services: early days of AWS building S3 and EBS, helping launch Oracle Cloud Infrastructure from inception, and now building the agentic cloud at DigitalOcean for AI-natives. Every cloud I’ve worked on was designed for the workloads of its era. Those clouds were built for human-centric SaaS applications: a few users, a handful of requests per sessi…

aicloud-computingmachine-learningtechnology

Welcome to What’s New on the DigitalOcean Inference Engine —your weekly roundup of the latest inference updates at DigitalOcean. Week of April 27 OpenAI’s GPT-5.5 is now available across DigitalOcean’s inference cloud products bringing a new level of autonomous, agent-like intelligence to production AI workflows. Designed to go beyond single prompt responses, GPT-5.5 can plan, reason, and execute…

aimachine-learningnlp

The AI industry has a compounding bottleneck, and it isn’t the models. It’s inference. What used to be a single model call has become a system of continuous interaction. Applications now orchestrate multiple models, retrieve and synthesize data, execute tools, and repeat this cycle in production. These are no longer stateless requests. They are dynamic systems that behave more like infrastructure…

aideep-learningmachine-learning

Today at Deploy, we are announcing the general availability of DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B on DigitalOcean Serverless Inference. On DeepSeek V3.2 and Qwen 3.5 397B, we deliver #1 output speed across all providers Artificial Analysis tested . On DeepSeek V3.2 specifically, that translates to 230 output tokens per second and sub-1-second Time-to-First-Token (TTFT) for 10,000 inpu…

aideep-learningmachine-learning

Getting a model to answer 10 inference requests concurrently is tricky but simple enough; getting it to handle 2,000 engineers hitting a coding assistant with long contexts, all day, without runaway costs, is where teams stall. A working endpoint is only the beginning. Teams need to identify the supporting hardware and wire up the right components—serving, scaling, observability, and cost guardra…

aideep-learningmachine-learning

In large-scale cloud environments, unpredictable hypervisor crashes carry real operational cost. While traditional reactive monitoring that relies on static thresholds and post-hoc alerts were once the industry standard, this monitoring misses the non-linear, stochastic signals that precede hardware failure. In an era where high availability is the norm, the transition from reactive observation …

aicloud-computingmachine-learningtechnology
research.ioresearch.io

Sign up to keep scrolling

Create your feed subscriptions, save articles, keep scrolling.

Already have an account?