distributed-systems

DEV Community
Carlos Talavera
18h ago

“Distributed systems” is a commonly used concept today. Perhaps the first time you read it it sounds daunting, and while there are plenty of challenges, the concept itself is simple and it might even give you more clarity when it comes to building this kind of system. Let’s start from… the beginning. When software went from being a possibility to being a reality (a virtual one, of course), things…

computer-sciencedistributed-systems
DEV Community

DynamoDB Global Tables replicate data across regions in seconds, but replication is still asynchronous. That means a simple read from a replica region can occasionally return stale data, which is acceptable in most application as the user doesn’t require the latest available data all the time, but in some systems, stale reads can break important processes and stability of a platform. So the quest…

cloud-computingcomputer-sciencedistributed-systems
DEV Community

Introduction Picture two doctors updating the same patient record at the same time - one in São Paulo, the other in London. Both are offline. When connectivity returns, whose changes prevail? This is not a hypothetical. It is the everyday reality of distributed systems: multiple nodes, no shared clock, no guaranteed network. The conventional answer has long been locking - one node waits while ano…

computer-sciencedistributed-systems
DEV Community

This is not a journey of starting over. For most of my life, I believed change meant restarting—health, career, finances, relationships. Every time, it began with motivation and intensity. And almost every time, it faded within weeks. Then the cycle would repeat. Not this time. I’m not restarting anything. I’m continuing from exactly where I am, even if it’s messy, inconsistent, or incomplete. Th…

computer-sciencedistributed-systems
DEV Community

After 18 months of battling 2.1-second p99 message processing latency with RabbitMQ 3.12, our team migrated to Kafka 4.0 and slashed that metric by 55% to 945ms, while reducing infrastructure costs by 32% in the first quarter post-migration. This isn't a marketing pitch—it's a data-backed breakdown of why we made the switch, how we executed the migration without downtime, and the exact benchmarks…

computer-sciencedistributed-systems
DEV Community

The transition from isolated conversational models to collaborative artificial intelligence swarms requires a fundamental shift in how machines transmit data. When developers design multi-agent systems, they must select a transport mechanism that supports autonomous, low-latency, and reliable data exchange. While the Model Context Protocol standardizes an agent's local tool access and Agent-to-Ag…

aidistributed-systemsmachine-learning
DEV Community

TL;DR Ingero Fleet v0.10 FOSS is live. We validated the full pipeline end-to-end on two 3-node Lambda Cloud clusters: 3x A100 SXM4 (x86_64) and 3x GH200 (aarch64, 64k pages, Grace kernel 6.8.0-1013-nvidia-64k ). Same Fleet + agent + straggler-sink stack on both. One straggler per cluster, injected by removing the matmul workload from one node. A100 GH200 Region us-east-1 us-east-3 Kernel 6.8.0-60…

computer-sciencedistributed-systems
DEV Community

AI agents are distributed systems. They fan out across LLM calls, tool invocations, memory lookups, and multi-step reasoning loops — often asynchronously. But until recently, the observability tooling hadn't caught up. You'd get logs, maybe a dashboard, but no trace of what actually happened across a full agent run. That's the gap Jaeger v2 is positioned to close — and it's not a stretch. What ac…

aidistributed-systemsmachine-learning
DEV Community

From HTTP Chaos to Kafka: How We Fixed Inter-Service Communication in a NestJS Microservices Platform A technical deep-dive into replacing synchronous HTTP calls with Kafka-based async messaging — covering architecture decisions, NestJS implementation, Redis caching, and BullMQ for background processing. The Problem: Synchronous HTTP in a Distributed System On a production NestJS microservices pl…

computer-sciencedistributed-systems
Metadata
Murat (noreply@blogger.com)
9d ago

Continuing with notes from the BugBash talks. Yes, all of this goodness, including Will Wilson's keynote was before lunch the first day. Where all the ladders start Peter Alvaro, Associate Professor of Computer Science @ UC Santa Cruz In this talk, Peter reflects back on his 20 years of distributing systems work. The cover image is Don Quixote (which is Peter)  attacking the windmill (robust dist…

computer-sciencedistributed-systems
DEV Community

Design Netflix/YouTube: A Complete System Design Interview Walkthrough You're sitting across from your interviewer, and they drop the question: "Design a video streaming platform like Netflix or YouTube." Your heart rate spikes. This isn't just about storing videos and serving them up. You're looking at one of the most complex distributed systems on the planet, handling billions of hours of conte…

computer-sciencedistributed-systems
Hacker News
Claudio Basile; Kat Ko; Ben Wilson; Lee Howes; Bill Jia; Joe Pamer; Michael Voznesensky; Robert Hundt
11d ago

The challenges of building for modern AI infrastructure have fundamentally shifted. The modern frontier of machine learning now requires leveraging distributed systems, spanning thousands of accelerators. As models scale to run on clusters of O(100,000) chips, the software that powers these models must meet new demands for performance, hardware portability, and reliability. At Google, our Tensor …

aicomputer-sciencedistributed-systemsmachine-learning
DEV Community

Raft solves the hardest problem in distributed systems: keeping replicas synchronized while nodes fail. What We're Building We are dissecting the Raft consensus protocol to understand how a cluster maintains a single source of truth. Unlike Paxos, Raft is designed to be human-readable and easier to implement correctly. Our scope is not building a complete key-value store, but modeling the core st…

computer-sciencedistributed-systems
DEV Community

Building distributed systems in Python? Here is how python-cqrs tackles consistency with orchestrated sagas, the mediator pattern, and a transactional outbox—without preaching theory for ten pages first. TL;DR Commands and queries stay in plain handlers: nothing in the handler depends on HTTP, Kafka, or CLI. Sagas: persisted state, automatic compensation, recovery after crashes; see the docs for …

computer-sciencedistributed-systems
DEV Community

If you put two sidecars in a pod and ask them to talk to each other over HTTP, sooner or later one of them crashes mid-request and you lose a message. If you do it enough times, you reinvent a message bus. This post is about the small in-pod message bus we ended up writing for k8s4claw , a Kubernetes operator for AI agent runtimes. The bus sits between channel sidecars (Slack, Discord, Webhook) a…

computer-sciencedistributed-systemssoftware-engineering
C
Cryptology and Data Security
Cryptology and Data Security Research Group
14d ago

David Lehnherr has successfully defended his Ph.D. thesis on 8 December 2025; the thesis is titled “Simplicial Structures for Epistemic Reasoning in Multi-agent Systems”. As the title reveals, this work is truly interdisciplinary and relates to logic and to distributed computing, spanning the fields between the research groups on Logic and Theory and Distributed Computing and Cryptography.

computer-sciencedistributed-systemslogic
C
Cryptology and Data Security
Cryptology and Data Security Research Group
14d ago

The asymmetric trust model lets each participant in a distributed system make its own trust assumptions about others, captured by an asymmetric quorum system. This contrasts with ordinary, symmetric quorum systems and threshold models, where trust assumptions are uniformly shared among participants. Fundamental problems like reliable broadcast and consensus are unsolvable in the asymmetric model …

computer-sciencedistributed-systems
DEV Community

In distributed systems, what we actually want is not “the correct time.” We want two things: Ordering : don’t lose which came first. Replay : later, explain the same decision with the same grounds and procedure. But in real systems we casually lean on created_at (wall clock). And then everything breaks: NTP adjustments / VM migration / suspend-resume makes clocks jump backward or forward Promethe…

computer-sciencedistributed-systems
DEV Community

Introduction I was debugging Istio routing the other day, and honestly, I had a moment where I felt a bit "creeped out." You tweak a VirtualService YAML file, hit kubectl apply , and within seconds, the routing rules across hundreds of Envoy proxies scattered throughout the cluster switch over perfectly. There's no process restart. You aren't running nginx -s reload . Rolling out configuration ch…

computer-sciencedistributed-systems
DEV Community

With over two billion active users, WhatsApp is a masterclass in distributed systems engineering. To achieve massive concurrency and near-zero latency, the platform relies on a specialized stack that balances legacy reliability with bleeding-edge innovation. As a systems architect, I’ve analyzed the 24 essential components that allow this ecosystem to function seamlessly. 1. The Backend (Server) …

computer-sciencedistributed-systemssoftware-engineering
research.ioresearch.io

Sign up to keep scrolling

Create your feed subscriptions, save articles, keep scrolling.

Already have an account?