distributed-computing

DEV Community

The 7 People Who Control the Internet Clock – A Deep‑Dive Companion to The Pattern Episode Welcome back, fellow engineers and curious minds. I’m The Systems Analyst , and after you’ve listened to the latest The Pattern episode, I wanted to give you a tangible, on‑the‑ground look at the invisible heartbeat that keeps everything from your phone’s alarm to high‑frequency trading platforms humming in…

computer-sciencedistributed-systems
Capgemini

As distributed cloud adoption accelerates, many organizations find themselves stuck between experimentation and scale. The post Insights from the field: Lessons from real world distributed Cloud deployments appeared first on Capgemini .

cloud-computingcomputer-sciencedistributed-systemstechnology
DEV Community

If you run vLLM, Triton, or any other inference server on Kubernetes, you have probably noticed that the HPA cannot see the GPU. Autoscaling decisions are driven by CPU and memory, while the resource that actually determines inference capacity remains invisible. A CNCF blog post published in May 2026 describes how to fix this by building a KEDA external scaler. The problem with default autoscalin…

cloud-computingcomputer-sciencedistributed-systems
DEV Community
Sai Chakradhar Rao Mahendrakar
5d ago

Consensus Protocols in Distributed Systems A Complete Learning Guide — Intermediate to Advanced Table of Contents Foundation — What is Consensus and Why Is It Hard? Core Concepts & Terminology Paxos — The Classic Protocol Raft — The Understandable Protocol Byzantine Fault Tolerance (BFT) Other Notable Protocols Real-World Systems Scalability & Performance Considerations Trade-Off Analysis Design …

computer-sciencedistributed-systems
DEV Community

Squirix 0.1.0 is an early preview of a .NET distributed cache. A typed client SDK talks to a remote server over gRPC; the server owns state, routing, durability, and operational endpoints. This is the direction I am validating in 0.1.0 — not a claim that every cache must work this way. Embedded designs are fine for many workloads. Squirix targets a different shape: the application stays a client;…

computer-sciencedistributed-systems
DEV Community

Your typical clock synchronization protocol like NTP provides a timestamp, but it can't guarantee that event A truly happened before event B if they occurred on different machines. Spanner's TrueTime solves this by providing time as an interval, not a point, ensuring global serializability even across continents. When your distributed system relies on timestamps from different servers, you're bui…

computer-sciencedistributed-systems
DEV Community

Refresher - I'm building a distributed chunked filestore in Go, and I setup a post for Part 1 here . That part dealt with uploading a file - this post is about downloads. Setup Requirements User hits our endpoint with the filename/fileid We use this fileid to get a list of chunks Our retrieve mechanism only depends on this list of chunks We want to be able to retrieve the associated chunks in par…

computer-sciencedistributed-systemsprogramming-languages
DEV Community

Building a Scalable Edge: A Practical Guide to Real-Time Geo-Distributed Data Ingestion for Global I Building a Scalable Edge: A Practical Guide to Real-Time Geo-Distributed Data Ingestion for Global IoT Edge computing is not just about pushing logic to the far end; it’s about orchestrating a cohesive flow where data is ingested, processed, and acted upon with millisecond latency, while preservin…

computer-sciencedistributed-systems
PhilPapers: Recent additions to PhilArchive

_Cross-Cloud Systems Measurement Report_. 2021Public cloud providers expose similar high-level resources but differ in processor generations, storage paths, network locality, virtualization overhead, accelerator availability, and pricing rules. These differences make direct comparison difficult for teams that operate analytics, web services, and machine-learning pipelines across providers. This p…

computer-sciencedistributed-systems
PhilPapers: Recent additions to PhilArchive

_Autonomic Distributed Systems Governance Bulletin_. 2021Large distributed systems are now operated through layers of schedulers, container controllers, service meshes, monitoring pipelines, and human runbooks. These mechanisms improve scale, but they also create governance problems: local controllers can fight one another, remediation rules may violate service-level or compliance constraints, an…

computer-sciencedistributed-systems
DEV Community

"Codex took 6 hours to implement this seemingly simple refactor". "I think Research mode on Perplexity is stuck." We all know LLM APIs are slow, and are content with staring at a spinner while the model slowly emits tokens. But what happens when you're building AI agents that need to be low latency? We hit this while building FixBugs , an AI debugging agent that reads bug reports, logs, code, scr…

aicomputer-sciencedistributed-systemsmachine-learning
Hacker News

I spent the past few weeks building a linux kernel module that makes ordinary USB4/Thunderbolt ports on AMD mini PCs pretend to be InfiniBand devices. The goal is simple: let existing AI runtimes like vLLM/RCCL split inference or training across multiple boxes at home, without buying enterprise networking gear. TL;DR. We built experimental RDMA-over-USB4 for 128GB Strix Halo mini PCs. It lets two…

aicomputer-sciencedistributed-systemsmachine-learning
DEV Community

Where tensor-parallel inference hits the NVLink wall 2026-05-31 · GPU / distributed systems Tensor parallelism splits each layer across GPUs, so every forward pass pays for an all-reduce over the network fabric. On a single node that fabric is NVLink/NVSwitch — and how close you get to its theoretical budget decides whether TP helps or hurts. This post measures it on 4× H100 and explains where th…

computer-sciencedistributed-systems
DEV Community

Building a Reproducible Offline-First Data Sync Engine for Edge Analytics Building a Reproducible Offline-First Data Sync Engine for Edge Analytics In modern analytics, reliability and speed matter as much as correctness. I recently led a project to design and ship an offline-first data synchronization engine that enables edge devices to collect, process, and reconcile analytics data even when th…

computer-sciencedistributed-systems
DEV Community

Monolithic multi-region architectures inherently rely on vendor specific global control planes. When a catastrophic degradation strikes an underlying identity service or networking fabric within a single cloud provider, all regional partitions fail concurrently. Relying exclusively on Amazon Web Services (AWS) or Microsoft Azure caps the maximum theoretical availability of a platform to the opera…

cloud-computingcomputer-sciencedistributed-systems
DEV Community

🏴‍☠️ Built for the Pirates of the Coral-bean hackathon by WeMakeDevs | May 25–31, 2026 TL;DR Built a DevOps Incident Investigator using Coral SQL that correlates GitHub PRs , Sentry incidents , and Slack incident context using a single SQL query. Coral turns operational debugging into a single SQL query across distributed systems. Results: 📉 Incident triage reduced from ~15 minutes to ~15 seconds…

computer-sciencedistributed-systemssoftware-engineering
DEV Community

Building a Unix-Domain-Socket IPC server for ECS-on-EC2 services that need to talk fast, cheap, and reliably A while back I was looking at a flamegraph of a service that, on paper, should not have been having any performance problems. The producer and the consumer were the same Docker image's worth of trouble — colocated on the same EC2 host, in the same ECS cluster, sharing the same instance typ…

computer-sciencedistributed-systems
UC Davis Computer Architecture

Large-scale AI training and inference require hundreds of gigabytes to terabytes of DRAM with high peak to average utilization ratios, resulting in overprovisioning. In cloud computing, DRAM constitutes a significant share of the cost. Yet, as shown by recent articles, DRAM is heavily under utilized. Memory disaggregation is a solution to both these problems. With the advent of the CXL protocol, …

cloud-computingcomputer-sciencedistributed-systemstechnology
DEV Community

The Problem We Were Actually Solving The Treasure Hunt Engine is a multiplayer game where players dig for virtual gems, craft tools, and compete on leaderboards. Every action produces an event: dig_started, gem_found, score_updated, inventory_cleared. We needed a system that could ingest, deduplicate, and propagate these events to every player in under 200 ms while guaranteeing no double-counting…

computer-sciencedistributed-systems
DEV Community

It is no secret that in today's fast-moving banking industry, customers require immediate replies, the regulatory environment mandates strict compliance, and there is always an element of uncertainty about market conditions, which change every single moment. In such conditions, traditional request/response architectures appear to be obsolete. Indeed, today banks have access to huge amounts of dat…

aicomputer-sciencedistributed-systemsmachine-learning
research.ioresearch.io

Sign up to keep scrolling

Create your feed subscriptions, save articles, keep scrolling.

Already have an account?