distributed-systems

DEV Community

You can do WHAT with a Kafka proxy?

Stéphane Derosiaux

3h ago

At Current 2026, I realized that nobody knows exactly what a Kafka proxy can do. Most engineers and architects think it's just some kind of reverse-proxy for Kafka (think nginx) to do routing and used to bridge a legacy or non-native client to the cluster. That's not it. It's barely the start of it. Encryption For instance, an engineer at a UK building society had a hard requirement: encrypt pers…

computer-sciencedistributed-systems

DEV Community

The Disconnected Edge: How We Solved In-Flight Data Sync at 35,000 Feet

Shubham

1d ago

When most engineers think about rolling out a modern streaming or web application, they visualize a standard cloud-native environment: a global CDN, elastic load balancers, and a continuous pipeline pushing updates to infinite resources. But what happens when your deployment target is an isolated, battery-powered hardware device flying inside a metal tube at 35,000 feet? At AirFi , operating a ne…

computer-sciencedistributed-systems

DEV Community

Boosting Observability in NestJS with RedisX Metrics

Suren Krmoian

2d ago

Observability isn't just a buzzword; it's a necessity, especially when diving into distributed systems. If you're using NestJS, you might want to take a look at RedisX. It's a modular toolkit that can boost the observability of your applications. A standout feature? The Metrics Plugin. It meshes well with Prometheus, delivering insights into Redis operations in your NestJS setup. Getting RedisX M…

computer-sciencedistributed-systemssoftware-engineering

DEV Community

The 7 People Who Control The Internet Clock

Sam Chen

3d ago

The 7 People Who Control the Internet Clock – A Deep‑Dive Companion to The Pattern Episode Welcome back, fellow engineers and curious minds. I’m The Systems Analyst , and after you’ve listened to the latest The Pattern episode, I wanted to give you a tangible, on‑the‑ground look at the invisible heartbeat that keeps everything from your phone’s alarm to high‑frequency trading platforms humming in…

computer-sciencedistributed-systems

Capgemini

Insights from the field: Lessons from real world distributed Cloud deployments

sharmisthanaskar

5d ago

As distributed cloud adoption accelerates, many organizations find themselves stuck between experimentation and scale. The post Insights from the field: Lessons from real world distributed Cloud deployments appeared first on Capgemini .

cloud-computingcomputer-sciencedistributed-systemstechnology

DEV Community

GPU autoscaling on Kubernetes with KEDA: building an external scaler with NVML

Bruno Santos

7d ago

If you run vLLM, Triton, or any other inference server on Kubernetes, you have probably noticed that the HPA cannot see the GPU. Autoscaling decisions are driven by CPU and memory, while the resource that actually determines inference capacity remains invisible. A CNCF blog post published in May 2026 describes how to fix this by building a KEDA external scaler. The problem with default autoscalin…

cloud-computingcomputer-sciencedistributed-systems

DEV Community

Consensus Protocols in Distributed Systems

Sai Chakradhar Rao Mahendrakar

8d ago

Consensus Protocols in Distributed Systems A Complete Learning Guide — Intermediate to Advanced Table of Contents Foundation — What is Consensus and Why Is It Hard? Core Concepts & Terminology Paxos — The Classic Protocol Raft — The Understandable Protocol Byzantine Fault Tolerance (BFT) Other Notable Protocols Real-World Systems Scalability & Performance Considerations Trade-Off Analysis Design …

computer-sciencedistributed-systems

DEV Community

Why Squirix uses a strict client/server architecture for a .NET distributed cache

Alex E

8d ago

Squirix 0.1.0 is an early preview of a .NET distributed cache. A typed client SDK talks to a remote server over gRPC; the server owns state, routing, durability, and operational endpoints. This is the direction I am validating in 0.1.0 — not a claim that every cache must work this way. Embedded designs are fine for many workloads. Squirix targets a different shape: the application stays a client;…

computer-sciencedistributed-systems

DEV Community

TrueTime: Bounding Clock Uncertainty

rishabh pahwa

8d ago

Your typical clock synchronization protocol like NTP provides a timestamp, but it can't guarantee that event A truly happened before event B if they occurred on different machines. Spanner's TrueTime solves this by providing time as an interval, not a point, ensuring global serializability even across continents. When your distributed system relies on timestamps from different servers, you're bui…

computer-sciencedistributed-systems

DEV Community

Learning, Experimenting - Concurrency in Go - Part 2

Manish

8d ago

Refresher - I'm building a distributed chunked filestore in Go, and I setup a post for Part 1 here . That part dealt with uploading a file - this post is about downloads. Setup Requirements User hits our endpoint with the filename/fileid We use this fileid to get a list of chunks Our retrieve mechanism only depends on this list of chunks We want to be able to retrieve the associated chunks in par…

computer-sciencedistributed-systemsprogramming-languages

DEV Community

Building a Scalable Edge: A Practical Guide to Real-Time Geo-Distributed Data Ingestion for Global I

Rizwan Saleem

12d ago

Building a Scalable Edge: A Practical Guide to Real-Time Geo-Distributed Data Ingestion for Global I Building a Scalable Edge: A Practical Guide to Real-Time Geo-Distributed Data Ingestion for Global IoT Edge computing is not just about pushing logic to the far end; it’s about orchestrating a cohesive flow where data is ingested, processed, and acted upon with millisecond latency, while preservin…

computer-sciencedistributed-systems

PhilPapers: Recent additions to PhilArchive

Bitla, Narender ; Deshpande, Akshay ; Dulam, Murali Shankar & Saha, Sumit: Cross-Cloud Performance Benchmarking and Optimization

12d ago

_Cross-Cloud Systems Measurement Report_. 2021Public cloud providers expose similar high-level resources but differ in processor generations, storage paths, network locality, virtualization overhead, accelerator availability, and pricing rules. These differences make direct comparison difficult for teams that operate analytics, web services, and machine-learning pipelines across providers. This p…

computer-sciencedistributed-systems

PhilPapers: Recent additions to PhilArchive

Annamali Sekar, Mythili ; Dulam, Murali Shankar ; Mazumder, Abhirup & Kannan, Kabilan: Governing Distributed Systems with Intelligent Agents

12d ago

_Autonomic Distributed Systems Governance Bulletin_. 2021Large distributed systems are now operated through layers of schedulers, container controllers, service meshes, monitoring pipelines, and human runbooks. These mechanisms improve scale, but they also create governance problems: local controllers can fight one another, remediation rules may violate service-level or compliance constraints, an…

computer-sciencedistributed-systems

DEV Community

High-performance AI agents are distributed systems

Kirti Rathore

12d ago

"Codex took 6 hours to implement this seemingly simple refactor". "I think Research mode on Perplexity is stuck." We all know LLM APIs are slow, and are content with staring at a spinner while the model slowly emits tokens. But what happens when you're building AI agents that need to be low latency? We hit this while building FixBugs , an AI debugging agent that reads bug reports, logs, code, scr…

aicomputer-sciencedistributed-systemsmachine-learning

Hacker News

thunderbolt-ibverbs: We have InfiniBand at home

13d ago

I spent the past few weeks building a linux kernel module that makes ordinary USB4/Thunderbolt ports on AMD mini PCs pretend to be InfiniBand devices. The goal is simple: let existing AI runtimes like vLLM/RCCL split inference or training across multiple boxes at home, without buying enterprise networking gear. TL;DR. We built experimental RDMA-over-USB4 for 128GB Strix Halo mini PCs. It lets two…

aicomputer-sciencedistributed-systemsmachine-learning

DEV Community

Where Tensor-Parallel Inference Hits the NVLink Wall

member_2e5ba30f

15d ago

Where tensor-parallel inference hits the NVLink wall 2026-05-31 · GPU / distributed systems Tensor parallelism splits each layer across GPUs, so every forward pass pays for an all-reduce over the network fabric. On a single node that fabric is NVLink/NVSwitch — and how close you get to its theoretical budget decides whether TP helps or hurts. This post measures it on 4× H100 and explains where th…

computer-sciencedistributed-systems

DEV Community

Building a Reproducible Offline-First Data Sync Engine for Edge Analytics

Rizwan Saleem

15d ago

Building a Reproducible Offline-First Data Sync Engine for Edge Analytics Building a Reproducible Offline-First Data Sync Engine for Edge Analytics In modern analytics, reliability and speed matter as much as correctness. I recently led a project to design and ship an offline-first data synchronization engine that enables edge devices to collect, process, and reconcile analytics data even when th…

computer-sciencedistributed-systems

DEV Community

Surviving Global Vendor Outages: Federated Cellular Architecture with EKS, AKS, and Istio

Cláudio Filipe Lima Rapôso

15d ago

Monolithic multi-region architectures inherently rely on vendor specific global control planes. When a catastrophic degradation strikes an underlying identity service or networking fabric within a single cloud provider, all regional partitions fail concurrently. Relying exclusively on Amazon Web Services (AWS) or Microsoft Azure caps the maximum theoretical availability of a platform to the opera…

cloud-computingcomputer-sciencedistributed-systems

DEV Community

Building a DevOps Incident Investigator with Coral SQL — From 15 Minutes to 15 Seconds

Khadirullah Mohammad

15d ago

🏴‍☠️ Built for the Pirates of the Coral-bean hackathon by WeMakeDevs | May 25–31, 2026 TL;DR Built a DevOps Incident Investigator using Coral SQL that correlates GitHub PRs , Sentry incidents , and Slack incident context using a single SQL query. Coral turns operational debugging into a single SQL query across distributed systems. Results: 📉 Incident triage reduced from ~15 minutes to ~15 seconds…

computer-sciencedistributed-systemssoftware-engineering

DEV Community

When Two Containers on the Same Host Are Shouting Through a Load Balancer

Samar Prakash

16d ago

Building a Unix-Domain-Socket IPC server for ECS-on-EC2 services that need to talk fast, cheap, and reliably A while back I was looking at a flamegraph of a service that, on paper, should not have been having any performance problems. The producer and the consumer were the same Docker image's worth of trouble — colocated on the same EC2 host, in the same ECS cluster, sharing the same instance typ…

computer-sciencedistributed-systems

research.io

Sign up to keep scrolling

Create your feed subscriptions, save articles, keep scrolling.

Already have an account?