reinforcement-learning

Research Communities by Springer Nature

DRL-SecRoute: A Synergetic Deep Reinforcement Learning Paradigm for Mitigating Byzantine Faults and SSDF Attacks through Heuristic Spectrum Cognizance in Next-Generation Cognitive Radio Networks

Manish Kumar Dixit

1h ago

aideep-learningreinforcement-learning

Hacker News

TycoonLE: A Jax reinforcement learning environment for long-horizon planning

2d ago

aireinforcement-learning

Lifeboat News: The Blog

AI Misbehavior Is No Longer Confined to the Lab

Dan Breeden

2d ago

Further Reading. Thumbail original image used credit: Adobe Stock Image. Graph from: Scheming in the wild: detecting real-world AI scheming incidents with open-source intelligence. Shutdown resistance in reasoning models. https://palisaderesearch.org/blog/shu… Natural emergent misalignment from reward hacking in production RL https://arxiv.org/html/2511.18397v1 Scheming in the wild: detecting rea…

aimachine-learningreinforcement-learning

Hacker News

Maxproof

Jiacheng; Zhang; Xinyu; Shunkai; Wang; Yanmohan; Lin; Qin; Tiancheng; Zhu; Zhengmao; Tianle; Jingyang; Zehan; Jiang; Binyang; Ding; Han; Fei; Du; Chenyu; Song; Zijian; Jiayuan; Zhi; Huang; Yunan; Cheng; Weiyu; Zhao; Pengyu

2d ago

Computer Science > Machine Learning Title:MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling View PDF HTML (experimental)Abstract:We present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 first trains three proof-oriented capabilities -- proof generation, proof veri…

aicomputer-sciencegenerative-aimachine-learningreinforcement-learning

Nature Communications

Model predictive task sampling for efficient and robust adaptation

Xiangyang Ji

5d ago

Nature Communications, Published online: 09 June 2026; doi:10.1038/s41467-026-74004-0 Model Predictive Task Sampling (MPTS) enables efficient, risk-aware task selection for meta-RL, domain randomization, and foundation model finetuning by predicting adaptation difficulty without exhaustive evaluation, improving robustness while reducing compute and interaction costs.

aimachine-learningreinforcement-learning

Nature Communications

Reinforcement learning in linear embedding space unlocks generalizable control across soft robot configurations

Wei Pan

6d ago

Nature Communications, Published online: 08 June 2026; doi:10.1038/s41467-026-72491-9 This work introduces a generalizable control system that enables rapid adaptation across 33 soft robot configurations via reinforcement learning in a shared Koopman embedding space, enabling real-world skills in carpentry and bartending style tasks.

aiengineeringreinforcement-learningrobotics

Scientific Reports

A single reinforcement learning model to unify habit formation and Pavlovian-instrumental interaction

Yutaka Sakai

7d ago

Scientific Reports, Published online: 08 June 2026; doi:10.1038/s41598-026-55166-9 A single reinforcement learning model to unify habit formation and Pavlovian-instrumental interaction

aireinforcement-learning

DEV Community

Four Models in One Training Loop: Architecting SDAR on AWS (Before Renting a Single GPU)

Shoaibali Mir

8d ago

Recap. In Part 1 we landed on the core idea of SDAR ( arXiv:2605.15155 ): keep RL as the backbone, bolt on a privileged teacher for dense token-level guidance, and put a sigmoid gate between them so the student amplifies the teacher's confident advice and softens its noisy rejections. We also said the quiet part out loud - this is not a Bedrock fine-tuning checkbox. This part is the blueprint. Th…

aimachine-learningreinforcement-learning

DEV Community

Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation

Saurav Bhattacharya

9d ago

The Core Problem You shipped an AI agent. It works in demos. Then it runs 10,000 times in production, and you realize you have no idea which runs were good. This is the agent evaluation problem, and most teams approach it backwards. They reach for model-as-judge ("ask GPT-4 if the output is good") because it feels natural. But this is like using a microscope when you needed a ruler first. Here's …

aimachine-learningreinforcement-learning

Towards Data Science

The Fundamental Choice in Reinforcement Learning: On‑Policy vs. Off‑Policy

Ananya Bhattacharyya

9d ago

How a simple choice shapes exploration, safety, and efficiency The post The Fundamental Choice in Reinforcement Learning: On‑Policy vs. Off‑Policy appeared first on Towards Data Science .

aireinforcement-learning

DEV Community

Human-Aligned Decision Transformers for satellite anomaly response operations with inverse simulation verification

Rikin Patel

9d ago

Human-Aligned Decision Transformers for satellite anomaly response operations with inverse simulation verification A Discovery Born from a Late-Night Simulation It was 2:47 AM, and I was staring at a terminal window filled with telemetry data from a simulated satellite constellation. For weeks, I had been experimenting with Decision Transformers—a class of models that frame reinforcement learning…

aireinforcement-learning

Agentic AI / Generative AI – NVIDIA Technical Blog

NVIDIA Nemotron 3 Ultra Powers Faster, More Efficient Reasoning for Long-Running Agents

Chris Alexiuk

10d ago

Single-turn chatbots are evolving into long-running agents that can reason, maintain context, use tools, and run efficiently across many turns to complete...

aimachine-learningnlpreinforcement-learning

DEV Community

Week 2

Aneesh Lade

10d ago

Hello everyone! It has been a busy week, but I've made some exciting progress on my machine learning journey. Here is what I've been up to: Kaggle Orbit Wars & AWS I completed the baseline implementation for the Kaggle Orbit Wars competition and initially hit a score of around 1030. My score has dipped slightly over the past few days, so I am currently brainstorming ways to improve it. This week …

aimachine-learningreinforcement-learning

The Medical News

Histamine boost helps the brain remember, decide, and learn from loss

11d ago

Pitolisant, a histamine H3 receptor inverse agonist, improved recognition memory, working memory performance, and reinforcement learning in healthy adults. The findings suggest that histamine helps shape how the brain stabilizes new memories, accumulates evidence for decisions, and avoids overreacting to negative outcomes.

cognitive-neuroscienceneuropharmacologyneurosciencereinforcement-learning

DEV Community

I Spent 2 Weeks Trying to Make OpenCV Recognize Game Cards — Here's Why It Failed All tests run on an 8-year-old MacBook Air.#3

hiyoyo

11d ago

All tests run on an 8-year-old MacBook Air. This is Part 3 of my series on training a card game AI with Google Colab. Part 1: Google Colab basics My Old MacBook Air Couldn't Handle It — So I Used Google Colab to Train an AI＃1 hiyoyo hiyoyo hiyoyo Follow May 21 My Old MacBook Air Couldn't Handle It — So I Used Google Colab to Train an AI＃1 # ai # python # googlecolab # rust Comments Add Comment 3 …

aimachine-learningreinforcement-learning

Hacker News

I made a kernel 2.2x faster. It made my training loop 3x slower

12d ago

Making Dr GRPO go brrr I wrote a fused decode-attention kernel for an RL training loop, got it 2.2× faster than the SDPA path it replaces at the microbenchmark level, dropped it into HuggingFace's generate , and watched the decode step get nearly 3× slower. The kernel was doing exactly what the microbench said it would. The integration broke an auto-compile path that the baseline was quietly bene…

aicomputer-sciencemachine-learningreinforcement-learning

Scholarly Commons

Parameter Informed Reinforcement Learning for Vehicle System Identification

Nathan Schaff

12d ago

Accurate system identification is essential for modeling and controlling vehicle dynamics. This dissertation explores the application of Parameter Informed Reinforcement Learning (PIRL) as a novel approach to system identification (SYSID). PIRL integrates prior system knowledge, such as physical parameters, into reinforcement learning (RL) frameworks to improve estimation accuracy. The study begi…

aireinforcement-learning

DEV Community

Onions and Filters

Ian Johnson

12d ago

When I started building my first harness around a coding agent, I did not picture an onion. I pictured a constraint system. The LLM, on its own, can do almost anything. It can write code, hallucinate APIs, edit the wrong file, run a shell command in a directory it should not be in, decide a test failure is acceptable and move on. The space of things it might do on any given turn is enormous. The …

aireinforcement-learning

DEV Community

I made a personalized AI web app with RAG

Joseph Martin

12d ago

So during my 4th Year of my college my team had decided to build a Personalized AI Assistant that can understand the user's behavior and give results accordingly. We had implemented reinforcement learning in the backend server, so based on the feedback given by the user it gives the output. During the later stages of the project I glimpsed on something called RAG (Retrieval Augmented Generation).…

aigenerative-aireinforcement-learning

DEV Community

Your RL Agent Failed a 12-Step Task. Which Step Was Wrong? (The Supervision Problem in Agentic RL)

Shoaibali Mir

14d ago

About this series. I'm going to take a fresh paper - Self-Distilled Agentic Reinforcement Learning (SDAR, arXiv:2605.15155 ) - and architect it end to end on AWS: the system design, the actual gate code, the evaluation plan, and a brutally honest cost model. What I'm not going to do is wave a benchmark number around. Reproducing a paper like this costs thousands in GPU time, and I'd rather show y…

aireinforcement-learning

research.io

Sign up to keep scrolling

Create your feed subscriptions, save articles, keep scrolling.

Already have an account?