distributed-computing

DEV Community

The 7 People Who Control The Internet Clock

Sam Chen

5h ago

The 7 People Who Control the Internet Clock – A Deep‑Dive Companion to The Pattern Episode Welcome back, fellow engineers and curious minds. I’m The Systems Analyst , and after you’ve listened to the latest The Pattern episode, I wanted to give you a tangible, on‑the‑ground look at the invisible heartbeat that keeps everything from your phone’s alarm to high‑frequency trading platforms humming in…

computer-sciencedistributed-systems

Capgemini

Insights from the field: Lessons from real world distributed Cloud deployments

sharmisthanaskar

2d ago

As distributed cloud adoption accelerates, many organizations find themselves stuck between experimentation and scale. The post Insights from the field: Lessons from real world distributed Cloud deployments appeared first on Capgemini .

cloud-computingcomputer-sciencedistributed-systemstechnology

DEV Community

GPU autoscaling on Kubernetes with KEDA: building an external scaler with NVML

Bruno Santos

4d ago

If you run vLLM, Triton, or any other inference server on Kubernetes, you have probably noticed that the HPA cannot see the GPU. Autoscaling decisions are driven by CPU and memory, while the resource that actually determines inference capacity remains invisible. A CNCF blog post published in May 2026 describes how to fix this by building a KEDA external scaler. The problem with default autoscalin…

cloud-computingcomputer-sciencedistributed-systems

DEV Community

Consensus Protocols in Distributed Systems

Sai Chakradhar Rao Mahendrakar

5d ago

Consensus Protocols in Distributed Systems A Complete Learning Guide — Intermediate to Advanced Table of Contents Foundation — What is Consensus and Why Is It Hard? Core Concepts & Terminology Paxos — The Classic Protocol Raft — The Understandable Protocol Byzantine Fault Tolerance (BFT) Other Notable Protocols Real-World Systems Scalability & Performance Considerations Trade-Off Analysis Design …

computer-sciencedistributed-systems

DEV Community

Why Squirix uses a strict client/server architecture for a .NET distributed cache

Alex E

5d ago

Squirix 0.1.0 is an early preview of a .NET distributed cache. A typed client SDK talks to a remote server over gRPC; the server owns state, routing, durability, and operational endpoints. This is the direction I am validating in 0.1.0 — not a claim that every cache must work this way. Embedded designs are fine for many workloads. Squirix targets a different shape: the application stays a client;…

computer-sciencedistributed-systems

DEV Community

TrueTime: Bounding Clock Uncertainty

rishabh pahwa

5d ago

Your typical clock synchronization protocol like NTP provides a timestamp, but it can't guarantee that event A truly happened before event B if they occurred on different machines. Spanner's TrueTime solves this by providing time as an interval, not a point, ensuring global serializability even across continents. When your distributed system relies on timestamps from different servers, you're bui…

computer-sciencedistributed-systems

DEV Community

Learning, Experimenting - Concurrency in Go - Part 2

Manish

5d ago

Refresher - I'm building a distributed chunked filestore in Go, and I setup a post for Part 1 here . That part dealt with uploading a file - this post is about downloads. Setup Requirements User hits our endpoint with the filename/fileid We use this fileid to get a list of chunks Our retrieve mechanism only depends on this list of chunks We want to be able to retrieve the associated chunks in par…

computer-sciencedistributed-systemsprogramming-languages

DEV Community

Building a Scalable Edge: A Practical Guide to Real-Time Geo-Distributed Data Ingestion for Global I

Rizwan Saleem

9d ago

Building a Scalable Edge: A Practical Guide to Real-Time Geo-Distributed Data Ingestion for Global I Building a Scalable Edge: A Practical Guide to Real-Time Geo-Distributed Data Ingestion for Global IoT Edge computing is not just about pushing logic to the far end; it’s about orchestrating a cohesive flow where data is ingested, processed, and acted upon with millisecond latency, while preservin…

computer-sciencedistributed-systems

PhilPapers: Recent additions to PhilArchive

Bitla, Narender ; Deshpande, Akshay ; Dulam, Murali Shankar & Saha, Sumit: Cross-Cloud Performance Benchmarking and Optimization

9d ago

_Cross-Cloud Systems Measurement Report_. 2021Public cloud providers expose similar high-level resources but differ in processor generations, storage paths, network locality, virtualization overhead, accelerator availability, and pricing rules. These differences make direct comparison difficult for teams that operate analytics, web services, and machine-learning pipelines across providers. This p…

computer-sciencedistributed-systems

PhilPapers: Recent additions to PhilArchive

Annamali Sekar, Mythili ; Dulam, Murali Shankar ; Mazumder, Abhirup & Kannan, Kabilan: Governing Distributed Systems with Intelligent Agents

9d ago

_Autonomic Distributed Systems Governance Bulletin_. 2021Large distributed systems are now operated through layers of schedulers, container controllers, service meshes, monitoring pipelines, and human runbooks. These mechanisms improve scale, but they also create governance problems: local controllers can fight one another, remediation rules may violate service-level or compliance constraints, an…

computer-sciencedistributed-systems

DEV Community

High-performance AI agents are distributed systems

Kirti Rathore

10d ago

"Codex took 6 hours to implement this seemingly simple refactor". "I think Research mode on Perplexity is stuck." We all know LLM APIs are slow, and are content with staring at a spinner while the model slowly emits tokens. But what happens when you're building AI agents that need to be low latency? We hit this while building FixBugs , an AI debugging agent that reads bug reports, logs, code, scr…

aicomputer-sciencedistributed-systemsmachine-learning

Hacker News

thunderbolt-ibverbs: We have InfiniBand at home

10d ago

I spent the past few weeks building a linux kernel module that makes ordinary USB4/Thunderbolt ports on AMD mini PCs pretend to be InfiniBand devices. The goal is simple: let existing AI runtimes like vLLM/RCCL split inference or training across multiple boxes at home, without buying enterprise networking gear. TL;DR. We built experimental RDMA-over-USB4 for 128GB Strix Halo mini PCs. It lets two…

aicomputer-sciencedistributed-systemsmachine-learning

DEV Community

Where Tensor-Parallel Inference Hits the NVLink Wall

member_2e5ba30f

12d ago

Where tensor-parallel inference hits the NVLink wall 2026-05-31 · GPU / distributed systems Tensor parallelism splits each layer across GPUs, so every forward pass pays for an all-reduce over the network fabric. On a single node that fabric is NVLink/NVSwitch — and how close you get to its theoretical budget decides whether TP helps or hurts. This post measures it on 4× H100 and explains where th…

computer-sciencedistributed-systems

DEV Community

Building a Reproducible Offline-First Data Sync Engine for Edge Analytics

Rizwan Saleem

12d ago

Building a Reproducible Offline-First Data Sync Engine for Edge Analytics Building a Reproducible Offline-First Data Sync Engine for Edge Analytics In modern analytics, reliability and speed matter as much as correctness. I recently led a project to design and ship an offline-first data synchronization engine that enables edge devices to collect, process, and reconcile analytics data even when th…

computer-sciencedistributed-systems

DEV Community

Surviving Global Vendor Outages: Federated Cellular Architecture with EKS, AKS, and Istio

Cláudio Filipe Lima Rapôso

12d ago

Monolithic multi-region architectures inherently rely on vendor specific global control planes. When a catastrophic degradation strikes an underlying identity service or networking fabric within a single cloud provider, all regional partitions fail concurrently. Relying exclusively on Amazon Web Services (AWS) or Microsoft Azure caps the maximum theoretical availability of a platform to the opera…

cloud-computingcomputer-sciencedistributed-systems

DEV Community

Building a DevOps Incident Investigator with Coral SQL — From 15 Minutes to 15 Seconds

Khadirullah Mohammad

12d ago

🏴‍☠️ Built for the Pirates of the Coral-bean hackathon by WeMakeDevs | May 25–31, 2026 TL;DR Built a DevOps Incident Investigator using Coral SQL that correlates GitHub PRs , Sentry incidents , and Slack incident context using a single SQL query. Coral turns operational debugging into a single SQL query across distributed systems. Results: 📉 Incident triage reduced from ~15 minutes to ~15 seconds…

computer-sciencedistributed-systemssoftware-engineering

DEV Community

When Two Containers on the Same Host Are Shouting Through a Load Balancer

Samar Prakash

13d ago

Building a Unix-Domain-Socket IPC server for ECS-on-EC2 services that need to talk fast, cheap, and reliably A while back I was looking at a flamegraph of a service that, on paper, should not have been having any performance problems. The producer and the consumer were the same Docker image's worth of trouble — colocated on the same EC2 host, in the same ECS cluster, sharing the same instance typ…

computer-sciencedistributed-systems

UC Davis Computer Architecture

CXL-ClusterSim: Modeling CXL-based Disaggregated Memory Cluster for Pooling and Sharing using gem5 and SST

Jason Lowe-Power (jlowepower@ucdavis.edu)

13d ago

Large-scale AI training and inference require hundreds of gigabytes to terabytes of DRAM with high peak to average utilization ratios, resulting in overprovisioning. In cloud computing, DRAM constitutes a significant share of the cost. Yet, as shown by recent articles, DRAM is heavily under utilized. Memory disaggregation is a solution to both these problems. With the advent of the CXL protocol, …

cloud-computingcomputer-sciencedistributed-systemstechnology

DEV Community

When the Event Log Became a Liability: What Happened When We Treated Events Like Garbage

Lillian Dube

13d ago

The Problem We Were Actually Solving The Treasure Hunt Engine is a multiplayer game where players dig for virtual gems, craft tools, and compete on leaderboards. Every action produces an event: dig_started, gem_found, score_updated, inventory_cleared. We needed a system that could ingest, deduplicate, and propagate these events to every player in under 200 ms while guaranteeing no double-counting…

computer-sciencedistributed-systems

DEV Community

Event-Driven Architectures with Apache Kafka: Powering the Next Generation of Banking Transformation Through Agentic AI and Real-Time Analytics

Anil Mandloi

13d ago

It is no secret that in today's fast-moving banking industry, customers require immediate replies, the regulatory environment mandates strict compliance, and there is always an element of uncertainty about market conditions, which change every single moment. In such conditions, traditional request/response architectures appear to be obsolete. Indeed, today banks have access to huge amounts of dat…

aicomputer-sciencedistributed-systemsmachine-learning

research.io

Sign up to keep scrolling

Create your feed subscriptions, save articles, keep scrolling.

Already have an account?