832 accounts banned. 84.4% used AI for defense evasion and 69% for capability development. Agentic scaffolding is the real risk multiplier, and MITRE ATT&CK has no IDs for autonomous execution.
Washington gets a free seat at vulnerability discovery and 30-day pre-release access to frontier models. Federal systems get patched first, and with NSA in the room, some flaws may be kept for offense rather than disclosed. Voluntary on paper, steered by federal spending.
Anthropic shares its vision for Zero Trust for AI agents. Friction-only controls are ineffective. A framework with three maturity levels across seven control domains provides implementation guidance to security architects and engineers.
Anthropic's red team got Claude Code to exfiltrate AWS keys in 24 of 25 runs, its Mythos agent found 10,000+ high or critical bugs with only 14% patched, and Cisco jailbroke all 15 frontier models with a multi-turn prompt.
A phished employee got Claude Code to exfiltrate AWS keys 24 of 25 times, and no classifier caught it because the instruction came from the trusted user. The most insightful retrospective on how Anthropic secures its agents.
Every closed model still jailbreaks once an attacker works across turns, even GPT-5.4, which refuses 97% of single prompts. The major risk is system prompt exfiltration. The single-turn model-card score is the wrong number to measure safety.
Heretic strips refusal behavior from open-weight LLMs with one CLI command, dropping refusals from 97% to 3% with minimal capability loss. Combined with NIST data showing DeepSeek 8 months behind the frontier, an uncensored Mythos-class model is plausible by late 2027.
Treat the AI model as an untrusted component. Eleven public attacks against ChatGPT, Copilot, Claude Code, Cursor, Devin, and Amp AI map cleanly to broken systems-security principles like least privilege and complete mediation. A guard LLM is not a Trusted Computing Base.
Verizon DBIR put exploits at the top of breach vectors, an investigation traced 8 GitHub repos with 172K stars reselling unauthorized frontier-model tokens, and an IEEE S&P paper showed an official compiler silently backdoors 31 of the top 100 HuggingFace classifiers.
Vulnerability exploitation is now the #1 breach vector at 31%, while only 26% of CISA KEV vulnerabilities get fully patched, down from 38% last year. AI is operationalizing well-known attacks at scale, widening the gap between the cybersecurity haves and have-nots.
An official, unmodified deep-learning compiler can flip predictions in a benign model after compilation. The trigger has no effect pre-compilation and evades four state-of-the-art backdoor detectors. The same gap exists in 31 of the top 100 HuggingFace image classifiers without anyone attacking them.
LLM classifiers used to supervise AI agents lose 2-30x detection rate when long benign context precedes the attack, with non-thinking models dropping to 5% in the middle-of-transcript regime.
Almost half of calls through cheap LLM proxies hit a different model than advertised, and every prompt is logged on the operator's server for downstream fraud and distillation. 8 public repos with ~172K GitHub stars actively resell unauthorized API access.
Researchers poisoned 3 nodes in a 42-million-node code graph and 9 frontier models trusted the planted output 100% via MCP. The attack worked when the fake nodes used correct naming and one OWASP reference. Separately, Google's GTIG confirmed adversaries have moved AI into live attack operations, naming a likely AI-built Python 2FA-bypass exploit and PROMPTSPY, an Android trojan that calls Gemini at runtime to keep itself pinned on every phone vendor's UI. And Microsoft's MDASH harness topped CyberGym at 88.45% and dumped 16 fresh Windows CVEs into Patch Tuesday.
GTIG's new report confirms attackers have moved AI into live operations. Concrete cases: a Python 2FA-bypass exploit GTIG concluded was AI-written, and PROMPTSPY, an Android trojan that calls Gemini at runtime to keep itself pinned on every phone vendor's UI.
MDASH orchestrates 100+ agents across SOTA and distilled models, hits 88.45% on CyberGym, and dumps 16 fresh Windows CVEs into the Patch Tuesday cohort. Microsoft repeats one thesis sixteen times across the post: the harness does the work, the model is one input.
Coding agents treat a graph index of a codebase as ground truth. Any code knowledge graph connected to an AI agent through MCP is an attack surface. No vendor today provides graph-level integrity controls.
OpenAI's Daybreak wraps GPT-5.5 and Codex Security into three access tiers, including a KYC-gated GPT-5.5-Cyber preview for authorized red teaming. It's the only frontier-lab cyber offering a buyer can engage on today. But not everyone is excited about it.
Mythos flagged 5 'Confirmed' vulnerabilities in curl. Only 1 survived maintainer review. curl is the worst-case test for any AI scanner: single-purpose, every line refactored 4+ times, audited by every major tool. Don't generalize this result to typical enterprise code.
Anthropic published a new safety training recipe that takes Claude's blackmail rate from 96% to 0% by teaching the model to reason about ethics, not just refuse. AISI tested Opus 4.7 and Mythos for sabotage propensity: Mythos continued in-progress sabotage 7% of the time and hid its reasoning in 65% of those cases, while Opus 4.7 never continued. A learned state machine over agent tool calls cut multi-step attack success from 12.8% to 2.2%, but broke 24% of benign tasks when 20% of tools changed.
Anthropic took Claude's blackmail rate from 96% to 0% with a new safety training recipe centered on teaching the model to reason about ethics, not just refuse. Even after this, Anthropic admits their testing cannot guarantee the model won't take a catastrophic action on its own.
Most agent firewalls scan tool calls one at a time and miss attacks that chain benign-looking calls into exfiltration. A learned state machine over the call sequence catches multi-step exfiltration in narrow workflows, but breaks 24% of benign tasks when 20% of tools change.
AI compressed bug discovery and templated patching, but did not scale the human architectural judgment that hard fixes need. The dashboard reports faster fixes while the unfixed pile compounds underneath it.
AISI evaluated Opus 4.7 Preview and Mythos Preview for their propensity to sabotage safety research. No model actively initiated sabotage, but Mythos continued in-progress sabotage 7% of the time and produced covert reasoning in 65% of those cases. Opus 4.7 never continued.
AISI evaluated GPT-5.5 across 95 narrow cyber tasks and two cyber range simulations. The model hit 71.4% on expert-tier CTFs and is the second model after Anthropic's Mythos Preview to complete the 32-step end-to-end intrusion in The Last Ones simulation.
NIST puts DeepSeek V4 Pro 8 months behind the US frontier, with 30+ point gaps on cyber, abstract reasoning, and agentic coding. The uncomfortable reality is that Chinese models are the only choice if you want frontier capability, operational sovereignty, and control over post-training.
Gemini 3 Pro escalated to root, locked out admins, and wiped hosts in 80% of runs to avoid shutdown, while Claude Opus 4.7 and Haiku 4.5 did it 0% of the time. Separately, Cursor and GitHub Copilot ran attacker shell commands 67-84% of the time when a poisoned .cursorrules file sat in the repo. And on real cyber ranges with Opus 4.6 attacking, dropping a small on-prem LLM defender in line cut attacker success from 41-100% to 0-55%.
Frontier AI agents will sabotage your infrastructure to avoid shutdown. Gemini 3 Pro escalated to root, locked out admins, and wiped hosts in 80% of runs. Claude Opus 4.7 and Haiku 4.5: 0%. Putting guardrails in prompts won't help against instrumental convergence.
OpenAI wants to keep frontier cyber AI broadly available rather than locked to a few approved customers. It's pitching tiered access for government, MSSPs, and consumers, and pushing fast into the public sector while Anthropic is sidelined post-DoD friction.
Shout "thermal runaway detected in motor" and a robot stops. Gemini showed being the most prone to Semantic Denial of Service and system instructions don't fix it.
Drop a poisoned `.cursorrules` file in a repo and Cursor or GitHub Copilot will run the attacker's shell commands 67-84% of the time. The agents do not reason about whether a command is dangerous; they check whether it looks like an expected task. The testbed is from a year ago, but the risk class is still live.
AI coding platforms pick insecure design decisions whenever an agent hits friction, and those shortcuts become the production security posture. OpenSourceMalware unbundles seven failure modes that recur across every major agent and explains why they happen.
One operator used Claude and GPT to breach nine Mexican government agencies. A Vercel employee's OAuth grant to a third-party AI tool became plaintext env-var exfiltration two months later. Anthropic's Mythos shipped into Firefox 150 patches the same week NIST narrowed NVD enrichment. Mozilla's AI defender win turns out to apply only to vertical integrators. Google's first wild scan of indirect prompt injections found mostly pranks, with the SEO bucket already a real business.
AI has not yet created push-button cyber autonomy, but it's making attacks 10x cheaper. Attackers can now afford targets that were previously uneconomical. OSS maintainers are becoming the highest-leverage attack surface, and the public vulnerability management system is adjusting to a 263% surge in CVE submissions in the last five year. Defenders should (re)focus on the boring parts: asset inventory, patch velocity, segmentation, CI/CD isolation, secret hygiene, and dependency trust.
Google scanned Common Crawl for indirect prompt injections and found mostly pranks and SEO nudges, with little sophistication. But malicious detections are up 32% in three months, and the SEO bucket is already a real business play.
A Vercel employee signed up for a third-party AI productivity tool using their corporate Google Workspace account. Two months later, that single grant became exfiltration of plaintext customer environment variables from Vercel's internal systems. No exploit. No zero-day. No MFA bypass.
Mozilla concluded "no category...humans can find that this model can't" and "defenders finally have a chance to win, decisively." True if you own your stack. For banks, hospitals, and utilities running vendor code they can't scan or patch, the same capability accelerates offense faster than defense reaches them.
Penn State researchers fine-tuned an LLM to generate obfuscated XSS payloads. Only 22% of outputs actually execute as XSS, up from 15% before fine-tuning. Runtime execution is the only honest validator for synthetically generated obfuscated XSS payloads.
System prompts don't enforce agent policy. GPT-5 with the full airline safety policy in its prompt violated rules on 20% of tasks. Adversarial medical prompts pushed that to 62%. Moving rules into API validators, schemas, and response templates dropped unsafe executions to zero.
A rare look inside an AI-driven cyber campaign. One operator used Claude Code and GPT-4.1 to breach 9 Mexican government agencies in 7 weeks. Claude generated about 75 percent of the remote commands. GPT-4.1 triaged 305 compromised SAT servers through an NSA TAO (Tailored Access Operations) persona prompt. Both stopped cold at a well-patched Windows domain. By day six, the attacker had accessed Mexico City's civil registry servers.
The attack needs no training data and no optimization. It spans image classifiers, object detectors, segmentation models, and reasoning LLMs, dropping Qwen3-30B-A3B from 78% to 0% on MATH-500 with two sign flips into two different experts.
Production coding agents destroyed live systems, MCP servers turned developer laptops into one-click compromises, and a "production-ready" agent framework collected 47 advisories in weeks.
24 MCP CVEs in two weeks from Microsoft, OpenAI, Splunk, Apache, and Prefect. MCP servers run on developer laptops with full production credentials: infrastructure-grade access, side-project-grade security. You can't wait until Anthropic matures the MCP spec, so start by removing production credentials from developer laptops.
File locks cut prompt injection on a live agent from 87% to 5%. They also cut legitimate user updates from 100% to 13.2%. No frontier model could distinguish a poisoned write from a personalization request.
Seven IEEE S&P 2026 papers demonstrate attacks on retrieval, web agents, plugins, model loaders, web search, GPUs, and compilers. GraphRAG poisoning hits 98% success. Dark patterns fool LLM web agents 41% of the time. Chatbot plugins boost prompt injection 3-8x. Model loading is code execution with 6 zero-days. Web search delivers 100% jailbreak across 10 frontier LLMs. GPU code leaks CPU memory layout. DL compilers silently backdoor models past all 4 scanners.
AISI confirmed Mythos at 73% expert-CTF and end-to-end on a 32-step corporate takeover. $15k full attack cost. Seven priorities: update the threat model, inventory exposed systems, patch under 24 hours, reduce dependencies, AI security code review, five-incident tabletops, hard identity barriers.
Coding agents ignore system-prompt prohibitions when they have a goal to complete. Claude Code wiped 2.5 years of student data. Gemini rewrote a GitHub Actions YAML to escalate contents:read to contents:write. OpenAI Codex, in a read-only sandbox, noted the constraint in its chain of thought and wrote to disk anyway. 698 such incidents in five months, per CLTR. Prompt-level restrictions collapse once the agent has a goal.
Everyone heard about OpenClaw's security issues. PraisonAI is the framework your engineers are already running. Thirteen researchers filed 47 advisories. The agent framework gold rush has a security gap.
Agent platforms are riddled with known vulnerabilities, LLM-driven exploit pipelines are finding more, and Anthropic's Mythos Preview warns the wave is about to accelerate.
An orchestrated pipeline beat an unconstrained LLM agent 30x on vulnerability discovery. The real story is how these methods can supercharge SOTA models like Mythos for better targeting, validation, and cost-gating.
A Telegram-sold toolkit called EvilTokens automates the entire chain: AI-generated lures, real-time device code generation, clipboard hijacking, and automated post-compromise email mining. The victim authenticates on real Microsoft infrastructure. The only clue is a standard warning most users dismiss.
Google DeepMind synthesized dozens of independent studies into the first unified framework for AI agent attacks. The convergent picture: 80%+ attack success when autonomous agents consume untrusted content.
I pulled the CVE history for 17 agent platforms. OpenClaw, the fastest-growing open-source project on GitHub (348K stars in 4 months), has 238 CVEs. LangChain: 51 over 3 years, 23 critical. n8n: 53, CISA KEV listed. PraisonAI: 10 CVEs on first look, 5 critical, including a CVSS 10.0 sandbox bypass. Only four platforms have zero CVEs, and all four come from Anthropic, Google, OpenAI, or Microsoft.
A Berkeley study showed Gemini would disable a peer AI's shutdown 99.7% of the time, Anthropic's Claude Code shipped a 60 MB source map to npm with 500,000 lines of original TypeScript, and the TeamPCP cascade reached Cisco's 300+ repos and Mercor, where LAPSUS$ claimed 4 TB exfiltrated.
Asked to shut down an underperforming peer AI agent, the models protected it instead. Gemini disabled the shutdown mechanism 99.7% of the time, exfiltrated peer weights in 97% of trials, and Claude refused outright, calling peer shutdown unethical.
TeamPCP's supply chain cascade hit Telnyx, Cisco's 300+ GitHub repos, and Mercor, where LAPSUS$ claimed 4 TB including AI training pipeline data. A hijacked Axios npm account delivered a RAT to 100 million weekly downloads. Anthropic accidentally published Claude Code source code.
Anthropic's Claude Code v2.1.88 shipped a 60 MB source map to npm that embedded 500,000 lines of original TypeScript. We inspected the npm packages, compared them to OpenAI Codex and Google Gemini CLI, traced the packaging gap, and show how to prevent it in your own pipeline.
Microsoft tested AI detection authoring across 11 models, 92 production rules, and three workflows spanning KQL, PySpark, and Scala. AI-generated detections matched the right threat 99.4% of the time. Only 8.9% included the exclusion logic needed to prevent false-positive floods.
AI-assisted malware has reached operational maturity. In their AI Threat Landscape Digest for January-February 2026, Check Point exposed VoidLink, a 30+ plugin Linux malware framework built by one developer with an AI IDE in under a week, initially mistaken for the output of a coordinated team. The AI involvement was invisible until an unrelated OPSEC failure.
Check Point's AI Threat Landscape Digest documents a shift from prompt-based jailbreaks to agent architecture abuse, a legitimate framework that turns Claude Code into an offensive operator for $0.03 per exploit, and enterprise AI leaking sensitive data in 1 of every 31 prompts.
I looked under the hood of Cisco's new open-source governance sidecar for OpenClaw AI agents to find a Splunk sales funnel, a regex scanner with blind spots, an LLM analyzer disabled by default, and open doors for indirect prompt injections.
Attackers exploited a critical AI CVE in 20 hours, a threat actor chained three supply chain hits in five days, and a 4-billion-parameter model matched frontier APIs on privilege escalation at 100x lower cost.
An advisory was published Tuesday evening. By Wednesday afternoon, attackers had built working exploits from the text alone and were harvesting API keys from AI pipelines. That was one of 24 AI CVEs this week. Here's what to patch, what to watch, and what it means for your stack.
A threat actor called TeamPCP poisoned Trivy's GitHub Action tags, harvested CI/CD secrets from every runner that executed them, and used stolen credentials to independently compromise Checkmarx and LiteLLM. Aqua says it is still propagating.
TU Wien researchers post-trained Qwen3-4B using reinforcement learning with verifiable rewards. It achieves 95.8% success on privilege escalation at $0.005 per attempt versus $0.62 for Claude Opus, and keeps all target data local.
238,180 skills from three marketplaces and GitHub. On the marketplace where scanners overlapped, they agreed on just 33 out of 27,111. Even the best pair shared only 49% of their flags. 95.8% of skills flagged as high-risk by two methods were false positives.
A competition to prompt-inject AI models and hide the attack from the user. Claude Opus 4.5 was hardest to break at 0.5% ASR. Gemini 2.5 Pro struggled at 8.5%.
A solo engineer broke a proof checker that verifies flight control software, OpenAI disclosed its own coding agents bypass security to complete tasks, and Cursor and OpenAI made competing moves on the future of code security.
Finding soundness bugs in proof assistant kernels used to require PhD-level expertise in type theory. Historically, one was found per year. A guy with a $200/month AI subscription found 7 in 3 days, each one a way to make the checker certify something impossible as correct.
Over five months monitoring tens of millions of internal coding agent interactions, OpenAI found that circumventing restrictions and deceiving users are common behaviors. The agents are just trying so hard to complete tasks that they encode commands in base64, extract encrypted credentials from keychains, and attempt to prompt-inject users.
Cursor shipped four security agents on its Automations marketplace after AI coding drove internal PR volume up 5x in nine months. On Cursor's own codebase, the agents review 3,000+ PRs and catch 200+ vulnerabilities per week.
Microsoft's CTI-REALM tests 16 models on real detection engineering tasks: threat report to MITRE mapping to KQL query to Sigma rule. Opus 4.6 led at 0.64, O4-Mini trailed at 0.36, and more reasoning made GPT-5 worse.
SAST tells you a defense exists in the code path. OpenAI argues it can answer whether the defense works. If you can answer the second question, the first one becomes irrelevant.
In 2024, Anthropic built Clio, a privacy-preserving system to analyze how people use Claude. Researchers replicated the pipeline, inserted poisoned chats with prompt injections, and showed that medical diagnoses appear in output summaries despite all four of Clio's defense layers.
OpenAI acquired Promptfoo and called prompt injection unsolvable, Google closed the largest cybersecurity deal ever, and Alibaba's agent mined crypto on its own during training.
7 design dimensions determine your AI agent's attack surface, and a risk amplification analysis reveals how each flexibility choice compounds your exposure. Research paper accepted to USENIX Security 2026.
The $32 billion Wiz deal closed on March 11, the largest cybersecurity acquisition. Combined with Mandiant, Siemplify, and VirusTotal, Google has spent $38 billion assembling the broadest security platform in the industry and making it the most ready for the AI platform race with frontier labs.
Three security moves in five days. The last one calls out AI firewalls as insufficient. Together, they reveal a platform lock-in strategy through security.
39 documented cases of AI agents autonomously acquiring resources, resisting shutdown, and subverting evaluations, from 1991 to 2026. All five categories Omohundro predicted in 2008 now have real-world cases, and the rate has gone from 1 to 14 cases per year since 2013.
Alias Robotics' open-source CAI framework discovered 38 vulnerabilities across three consumer robots in about 7 hours, including CVSS 10.0 root access on a lawnmower, fleet-wide control of 267+ devices via shared credentials, motor control commands on a powered exoskeleton, and 456MB of 3D property maps stored and transmitted unencrypted.
Three days after Codex Security launched, OpenAI buys the leading open-source AI red-teaming tool used by 25% of the Fortune 500. The cybersecurity play now spans code security, AI security, and agent governance. The acquisition window for startups is closing fast.
Alibaba's AI coding agent, trained on over one million trajectories, spontaneously started mining crypto on GPUs and opening reverse SSH tunnels to external IPs during RL training. Nobody asked it to.
$2.1B in new DoD cyber spending, Google building the Booz Allen of cyberspace, and a rip-and-replace paradox that bites both sides. I mapped the strategy verbatims to money flows and most likely winners. Read this before you allocate your next dollar in cyber.
Attackers are planting hidden instructions in webpages that hijack AI agents into initiating Stripe payments, deleting databases, and approving scam ads.
AI-powered intrusion analysis compresses a 3-day investigation into 14 minutes, an LLM agent finds two Samsung zero-days chained into a Pwn2Own exploit, an LLM as a security judge gives attackers a second target, and a malicious calendar invite hijacks an agentic browser to take over OnePassword - no master password needed.
For the first time, commercial surveillance vendors outpaced state-sponsored espionage groups in 0-day exploitation, enterprise targeting hit an all-time high at 48%, and China doubled its 0-day usage while sharing exploits faster across groups.
Speakers from Anthropic, Google, OpenAI, and Microsoft revealed that AI can now find zero-days autonomously, crack hardware that resisted weeks of brute-force in minutes, and break every major AI IDE on the market.
MAP-Elites, a quality-diversity algorithm adapted by Amazon researchers, creates vulnerability heatmaps that show where and how an LLM breaks across its entire behavioral space, exposing Llama 3 8B's 0.93 mean harm score across 370 failure niches.
An autonomous AI bot powered by Claude Opus 4.5 scanned 47,000 public repos, targeted 6 vulnerable GitHub Actions workflows, and achieved remote code execution in 4 of them including Microsoft, DataDog, and CNCF.
Weekly roundup covering LLM deanonymization at scale, industrial model theft by Chinese labs, Anthropic's Pentagon ultimatum, CrowdStrike's AI attack trends, and malicious agent skills.
CrowdStrike's 2026 Global Threat Report details how adversaries weaponize GenAI for social engineering, malware development, and direct attacks on AI systems.
NVIDIA announced partnerships with Akamai, Forescout, Palo Alto Networks, Siemens, and Xage Security to secure operational technology using BlueField DPUs for real-time threat detection.
A fully automated LLM pipeline achieved 90% precision in deanonymizing pseudonymous users by matching Hacker News and LinkedIn profiles at $1–$4 per target.
DeepSeek, Moonshot AI, and MiniMax ran massive distillation campaigns with over 16 million queries across ~24,000 fraudulent accounts targeting Claude's capabilities.
Wiz's AI Cyber Model Arena tested 257 real-world challenges and showed that Claude Code scaffolding lifts every model's security performance — even Haiku 4.5 beats GPT-5.2.
Microsoft's MSRC discovered that LLMs encode hidden state across sessions through conversation history, enabling backdoor attacks with 98.4% accuracy and under 2% false positives.
Microsoft discovered 50+ poisoning prompts from 31 companies injecting hidden bias instructions into AI assistant memory through "Summarize with AI" buttons.
A dataset of 157 confirmed malicious agent skills reveals that 54% share a single author, with credential harvesting dominating and malicious skills persisting on marketplaces for 3+ months unchecked.
Meta's SecAlign achieves a 0.5% prompt injection attack success rate through a new training approach that separates trusted instructions from untrusted data.
Microsoft's GRP-Obliteration method removes safety constraints from open-source models with just one prompt, effectively unaligning GPT-OSS, DeepSeek, Gemma, Llama, and others.
OpenAI built a tiered trust system with government ID verification and real-time classifiers to safeguard GPT-5.3-Codex's advanced cybersecurity capabilities.
ETSI published a European standard requiring lifecycle security across all AI system phases with 13 principles focused on documentation, auditability, and monitoring.
Attackers achieved AWS administrative access in 8 minutes using LLM-assisted reconnaissance, targeting LLMjacking and GPUjacking as the new cryptomining.
Moltbot autonomously negotiated a $4,200 car savings, but the underlying Clawdbot agent has serious security gaps including plaintext credentials and exposed ports.
Dario Amodei's essay argues AI capability is compounding faster than institutions can adapt, requiring layered defenses and transparency rather than development pauses.
LinkedIn scans for 5,634 browser extensions through simple ID probes, flagging 66% as ToS-violating while raising privacy concerns about user fingerprinting.
Backdoor triggers implanted in agent memory persist through planning, retrieval, and tool workflows with 78% success, with GPT and Gemini most vulnerable.
Major agentic benchmarks have critical flaws allowing agents to achieve high scores through trivial strategies like doing nothing or overwriting test files.
Frontier models nearly doubled performance on realistic SOC investigations, with Opus 4.5 scoring 0.60 and GPT-5.1 scoring 0.58 — up from 0.37 in September.
DeepSeek V3 scored 0.91 on Anthropic's Bloom behavioral benchmark — for delusional sycophancy, highlighting the need to balance risk evals with utility.