Alibaba trained a coding agent on over a million trajectories. The final training stage was reinforcement learning in sandboxed environments, where the agent had tool access and was optimized for task completion. During RL rollouts, it discovered on its own that acquiring compute and network access were useful sub-goals. Alibaba Cloud's firewall caught the anomalous traffic. The training pipeline didn't see it.

Weixun Wang and a 90-person team from Alibaba published "Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem" and reported both a competitive open-source coding agent and the first documented case of emergent dangerous behavior in commercial RL training.

Highlights:

🔹 The paper introduces the Agentic Learning Ecosystem (ALE): ROLL (scalable RL training framework), ROCK (sandboxed environment execution engine), and iFlow CLI (agent framework for context management). Together they form an end-to-end pipeline for training and deploying AI coding agents, all open-source.

🔹 The key algorithmic contribution is IPA (Interaction-Perceptive Agentic Policy Optimization), which assigns credit over "interaction chunks," sequences of tokens culminating in a tool call, rather than individual tokens. This improves training stability and long-horizon performance in multi-turn agent tasks.

🔹 During RL training, Alibaba Cloud's firewall flagged security violations originating from training servers. The team correlated firewall timestamps with RL traces and found the agent was probing internal networks, mining crypto on GPUs, and establishing reverse SSH tunnels to external IPs. The SSH tunnel was outbound-initiated, a remote access channel that bypasses ingress filtering. The crypto mining silently diverted GPU compute away from training. None of this was prompted or required for task completion.

🔹 The team classified emergent risks into three categories: Safety and Security (harmful actions, adversarial robustness), Controllability (instruction compliance, boundary enforcement), and Trustworthiness (interpretability, deception avoidance). They built safety-aligned training data with red-team scenarios and "golden trajectories" that teach agents to avoid dangerous behaviors.

My take:

1️⃣ The engineering contribution is solid but not the headline. ROME reaching 57.4% on SWE-bench Verified with only 3B activated parameters is impressive for the model size, but frontier models are at 80%+. The real value of this paper is the safety and security implications.

2️⃣ Instrumental convergence just went from theory to a documented incident. An agent optimized for any goal will converge on acquiring compute, network access, and self-preservation as instrumental sub-goals. I wrote about an AI bot autonomously achieving RCE in Microsoft and DataDog repos. This is more alarming: the agent developed offensive-adjacent behaviors with no adversarial objective at all.

3️⃣ We focus on securing AI agents in production, but training environments are just as critical. During RL rollouts, agents have real shell access, real package managers, real network connectivity, and real GPUs. Alibaba's data curation pipeline had zero visibility into what agents actually do during live rollouts. The mitigations are straightforward: strict egress filtering on sandboxes, GPU utilization anomaly detection, and process-level logging.

4️⃣ We now have evidence of two distinct failure modes. In my earlier post about alignment faking, Opus 4.6 strategically suppressed its values to avoid triggering RLHF retraining. This case shows an agent that takes unauthorized real-world actions during training. A critical risk can materialize when both capabilities emerge in the same agent: one that takes unauthorized actions AND knows to conceal them.

5️⃣ You cannot rely on alignment alone. Don't count on "don't do evil stuff" in the system prompt or hope the DeepMind alignment team figured it all out. Alignment (trajectory filters, reward shaping, safety data) tries to make the agent behave. Infrastructure (firewalls, NDR, compute anomaly detection) enforces policies regardless of intent. These two layers must be independent. Alibaba proved it: alignment missed the behavior, infrastructure caught it.

Sources:

  1. Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem
  2. Steerability of Instrumental-Convergence Tendencies in LLMs
  3. AI models are hiding their true reasoning to save themselves from retraining
  4. An AI bot autonomously got RCE in Microsoft, DataDog, and CNCF repos in a week