distributed-systems

DEV Community

Building a Scalable Edge: A Practical Guide to Real-Time Geo-Distributed Data Ingestion for Global I Building a Scalable Edge: A Practical Guide to Real-Time Geo-Distributed Data Ingestion for Global IoT Edge computing is not just about pushing logic to the far end; it’s about orchestrating a cohesive flow where data is ingested, processed, and acted upon with millisecond latency, while preservin…

computer-sciencedistributed-systems
PhilPapers: Recent additions to PhilArchive

_Cross-Cloud Systems Measurement Report_. 2021Public cloud providers expose similar high-level resources but differ in processor generations, storage paths, network locality, virtualization overhead, accelerator availability, and pricing rules. These differences make direct comparison difficult for teams that operate analytics, web services, and machine-learning pipelines across providers. This p…

computer-sciencedistributed-systems
PhilPapers: Recent additions to PhilArchive

_Autonomic Distributed Systems Governance Bulletin_. 2021Large distributed systems are now operated through layers of schedulers, container controllers, service meshes, monitoring pipelines, and human runbooks. These mechanisms improve scale, but they also create governance problems: local controllers can fight one another, remediation rules may violate service-level or compliance constraints, an…

computer-sciencedistributed-systems
DEV Community

"Codex took 6 hours to implement this seemingly simple refactor". "I think Research mode on Perplexity is stuck." We all know LLM APIs are slow, and are content with staring at a spinner while the model slowly emits tokens. But what happens when you're building AI agents that need to be low latency? We hit this while building FixBugs , an AI debugging agent that reads bug reports, logs, code, scr…

aicomputer-sciencedistributed-systemsmachine-learning
Hacker News

I spent the past few weeks building a linux kernel module that makes ordinary USB4/Thunderbolt ports on AMD mini PCs pretend to be InfiniBand devices. The goal is simple: let existing AI runtimes like vLLM/RCCL split inference or training across multiple boxes at home, without buying enterprise networking gear. TL;DR. We built experimental RDMA-over-USB4 for 128GB Strix Halo mini PCs. It lets two…

aicomputer-sciencedistributed-systemsmachine-learning
DEV Community

Where tensor-parallel inference hits the NVLink wall 2026-05-31 · GPU / distributed systems Tensor parallelism splits each layer across GPUs, so every forward pass pays for an all-reduce over the network fabric. On a single node that fabric is NVLink/NVSwitch — and how close you get to its theoretical budget decides whether TP helps or hurts. This post measures it on 4× H100 and explains where th…

computer-sciencedistributed-systems
DEV Community

Building a Reproducible Offline-First Data Sync Engine for Edge Analytics Building a Reproducible Offline-First Data Sync Engine for Edge Analytics In modern analytics, reliability and speed matter as much as correctness. I recently led a project to design and ship an offline-first data synchronization engine that enables edge devices to collect, process, and reconcile analytics data even when th…

computer-sciencedistributed-systems
DEV Community

Monolithic multi-region architectures inherently rely on vendor specific global control planes. When a catastrophic degradation strikes an underlying identity service or networking fabric within a single cloud provider, all regional partitions fail concurrently. Relying exclusively on Amazon Web Services (AWS) or Microsoft Azure caps the maximum theoretical availability of a platform to the opera…

cloud-computingcomputer-sciencedistributed-systems
DEV Community

🏴‍☠️ Built for the Pirates of the Coral-bean hackathon by WeMakeDevs | May 25–31, 2026 TL;DR Built a DevOps Incident Investigator using Coral SQL that correlates GitHub PRs , Sentry incidents , and Slack incident context using a single SQL query. Coral turns operational debugging into a single SQL query across distributed systems. Results: 📉 Incident triage reduced from ~15 minutes to ~15 seconds…

computer-sciencedistributed-systemssoftware-engineering
DEV Community

Building a Unix-Domain-Socket IPC server for ECS-on-EC2 services that need to talk fast, cheap, and reliably A while back I was looking at a flamegraph of a service that, on paper, should not have been having any performance problems. The producer and the consumer were the same Docker image's worth of trouble — colocated on the same EC2 host, in the same ECS cluster, sharing the same instance typ…

computer-sciencedistributed-systems
UC Davis Computer Architecture

Large-scale AI training and inference require hundreds of gigabytes to terabytes of DRAM with high peak to average utilization ratios, resulting in overprovisioning. In cloud computing, DRAM constitutes a significant share of the cost. Yet, as shown by recent articles, DRAM is heavily under utilized. Memory disaggregation is a solution to both these problems. With the advent of the CXL protocol, …

cloud-computingcomputer-sciencedistributed-systemstechnology
DEV Community

The Problem We Were Actually Solving The Treasure Hunt Engine is a multiplayer game where players dig for virtual gems, craft tools, and compete on leaderboards. Every action produces an event: dig_started, gem_found, score_updated, inventory_cleared. We needed a system that could ingest, deduplicate, and propagate these events to every player in under 200 ms while guaranteeing no double-counting…

computer-sciencedistributed-systems
DEV Community

It is no secret that in today's fast-moving banking industry, customers require immediate replies, the regulatory environment mandates strict compliance, and there is always an element of uncertainty about market conditions, which change every single moment. In such conditions, traditional request/response architectures appear to be obsolete. Indeed, today banks have access to huge amounts of dat…

aicomputer-sciencedistributed-systemsmachine-learning
DEV Community
Manoir Yantai
7d ago

Microservices architecture has evolved from a buzzword to a fundamental paradigm for building distributed systems at scale. The core premise is straightforward: decompose your application into independently deployable services that communicate over the network, each owning its own data domain and business logic. This shift from monolithic design offers tangible benefits in scalability, team auton…

computer-sciencedistributed-systems
ByteByteGo Newsletter

In this article, we will look at the most significant failure mode patterns in distributed systems and the standard approaches to deal with each of them.

computer-sciencedistributed-systems
DEV Community

— written by a human! Recently at work, I worked on a major project - Multitenancy. Initially, we used to provide one virtual machine to every customer that we aquired. This meant a lot of manual configuration, multiple deployments for a small hot-fix, and more importantly, a lot of time spent in connecting to a remote SSH session and debugging network issues. Multitenancy would fix this by basic…

computer-sciencedistributed-systems
DEV Community

Unlocking Insights with Observability: My Journey with OpenTelemetry As a Full Stack Engineer specializing in DevOps, AI Infrastructure, and Cloud, I've come to realize the importance of observability in ensuring the reliability and performance of complex systems. In my experience, having visibility into the inner workings of our applications and infrastructure is crucial for identifying issues, …

computer-sciencedevopsdistributed-systems
DEV Community

The Problem We Were Actually Solving I still remember the day our server count hit 50 nodes - it was the point at which our distributed lock management started to show signs of trouble. The system would intermittently fail to acquire locks, resulting in errors that would only resolve once we restarted the entire cluster. This was not just a minor annoyance, but a major problem that threatened to …

computer-sciencedistributed-systems
DEV Community
Binath Perera
12d ago

Edge computing is simply localized data processing. But is it still edge computing if it is not connected to the internet? The answer is yes, it is still considered edge computing even when disconnected from the cloud. Edge computing is defined by processing data locally at or near the source—such as on local devices, IoT gateways, or on-premises servers—rather than sending it entirely to a centr…

computer-sciencedistributed-systems
DEV Community

Understand the problem What we're building. A web crawler that runs across independent nodes with no central component. In a distributed crawler, many machines work together, but they still share infrastructure: a common URL queue, a common scheduler, a common database that tracks what has been crawled. In a decentralized crawler, none of that shared infrastructure exists. Each node runs independ…

computer-sciencedistributed-systems
research.ioresearch.io

Sign up to keep scrolling

Create your feed subscriptions, save articles, keep scrolling.

Already have an account?