Published: 2026-05-14
Categories: AI Security, Vulnerability Management, Threat Intelligence
The AI/ML Exploit Window
Executive Summary
The exploit window — the interval between public disclosure of a software vulnerability and its first active exploitation in the wild — has been shrinking across all of software for years. Within AI and machine learning infrastructure, that contraction has reached a structural extreme. Vulnerabilities in LLM inference servers, agent orchestration frameworks, AI gateway proxies, and model hosting platforms are now being weaponized within hours of their public disclosure, and in a growing fraction of cases, on the same day their CVE is published.
This acceleration is not accidental. It reflects the convergence of three mutually reinforcing forces: the rapid expansion of exposed AI infrastructure on the public internet, the structural characteristics of AI-specific software that concentrate high-value attack surfaces in small code surfaces, and the use of AI itself as an offensive accelerant that compresses the time required to go from published advisory to working exploit. The result is a category of infrastructure where the traditional calculus of vulnerability management — scan, prioritize, schedule remediation within a reasonable window — has been rendered obsolete.
This paper documents the nature and scale of this challenge through a detailed examination of incidents from late 2025 and early 2026. The case studies examined include the weaponization of CVE-2026-33626 in LMDeploy within 13 hours of GitHub advisory publication; the LiteLLM supply chain compromise that delivered credential-stealing payloads to millions of Python environments; the “Bleeding Llama” memory-leak vulnerability affecting more than 300,000 exposed Ollama servers; the persistent exploitation of the Ray framework’s unauthenticated job API; and the CVSS 10.0 remote code execution vulnerability in Flowise discovered under active exploitation conditions. Together these incidents illuminate a consistent pattern: AI infrastructure is being treated by attackers as high-priority, high-reward territory, and existing organizational defenses are not calibrated for the speed at which these attacks materialize.
The paper concludes with a structured defensive framework aligned to CSA’s AI Controls Matrix (AICM), MAESTRO threat modeling approach, and Cloud Controls Matrix (CCM), covering the technical controls, operational practices, and governance postures that organizations must adopt to operate safely in this environment.
1. Introduction: When Minutes Replace Days
1.1 The Collapse of the Exploit Window
For most of the history of software security, the interval between vulnerability disclosure and exploit weaponization was measured in weeks. Defenders had time — not infinite time, but time — to assess the severity of a reported flaw, validate a patch, test it in staging, and deploy it to production before adversaries could turn an advisory document into a reliable attack tool. That margin was never a guarantee, but it was a structural feature of the vulnerability ecosystem that organizations could reason about and build programs around.
That feature no longer applies uniformly. VulnCheck’s 2026 Exploit Intelligence Report found that 28.96% of vulnerabilities added to its Known Exploited Vulnerabilities dataset in 2025 were exploited on or before the day their CVE was published — up from 23.6% in 2024 [1]. Across the broader vulnerability population, CyberMindr and other tracking firms observed that the mean time to exploit a disclosed vulnerability fell from approximately 32 days in 2022 to approximately 5 days by 2025 [2]. This compression has not been uniform: it is concentrated in categories where attackers have clear financial or strategic motivation and where exploit development has been commoditized or automated.
AI and machine learning infrastructure sits squarely at the intersection of both factors. The value density of what runs on LLM infrastructure — proprietary model weights, customer conversation data, cloud credentials embedded in environment variables, access tokens providing API routing to paid services — makes it a high-reward target. At the same time, the structural characteristics of AI-specific software (discussed in detail in Section 2) make exploit development particularly tractable for AI-assisted automation tools, enabling even modestly resourced attackers to move from advisory to working weaponized code at machine speed.
1.2 Scope and Purpose of This Paper
This paper focuses specifically on infrastructure-layer vulnerabilities in AI and ML systems — the servers, frameworks, proxies, and pipelines that deploy, run, and interconnect language models — rather than on model-layer attacks such as prompt injection or adversarial inputs. This distinction matters because infrastructure-layer vulnerabilities behave like conventional software vulnerabilities in most respects, but they affect systems that are being deployed at unprecedented speed by organizations with widely varying security maturity. The combination of familiar vulnerability classes with unfamiliar deployment contexts creates conditions in which well-understood attack patterns succeed far more reliably than they would against mature infrastructure.
The incidents documented here are drawn from publicly reported sources and represent a cross-section of the AI infrastructure stack: inference servers (LMDeploy, Ollama), orchestration frameworks (Ray, Flowise), AI gateway proxies (LiteLLM), and the supply chain dependencies that connect them. The analysis draws on threat intelligence published by Sysdig, Oligo Security, Cyera, Datadog Security Labs, Snyk, Trend Micro, OWASP’s GenAI Security Project, and VulnCheck, among others.
2. Why AI Infrastructure Is Uniquely Vulnerable
2.1 Structural Characteristics of the Attack Surface
AI infrastructure exhibits several structural characteristics that differentiate it from conventional enterprise software and make it a particularly productive target for rapid exploitation.
The first characteristic is the concentration of high-value secrets in the runtime environment. Language model deployments routinely embed cloud provider credentials, database connection strings, API keys for commercial model providers, and system prompt configurations directly in process environment variables. When an attacker achieves any form of memory-disclosure or code execution against an inference server, they are not merely compromising a computational resource — they are accessing a concentrated credential store that may provide lateral movement paths to cloud infrastructure, downstream databases, and adjacent AI services. CVE-2026-7482 in Ollama, described in Section 3.3, illustrates this pattern: the vulnerability allowed unauthenticated attackers to read process memory from an exposed inference server, directly yielding environment variables, API keys, and concurrent user conversation data.
The second characteristic is the breadth and speed of deployment without corresponding security review. The AI infrastructure ecosystem has grown faster than the organizational processes that typically gate software into production environments. Open-source inference servers, agent frameworks, and orchestration tools are frequently deployed by individual developers or small teams outside of formal change management processes. This means that exposed endpoints frequently reach the public internet before security teams are aware they exist, and that remediation cycles involve coordination challenges that did not exist in more controlled deployment environments.
The third characteristic is the absence of authentication as a default in many components. A significant fraction of AI infrastructure components were originally designed for internal use in research or development contexts and ship with no authentication requirement by default. Ollama, for example, defaults to binding its API server to localhost only; however, the OLLAMA_HOST=0.0.0.0 configuration that exposes the API to all network interfaces is common in practice and results in significant internet exposure in cloud-hosted deployments [3]. The Ray framework’s Jobs API requires no authorization by default, a condition that has been actively exploited since at least 2023 despite the severity being disputed by its maintainers [4]. When these components are deployed to internet-facing environments — as network scanning data confirms they frequently are — the result is an effectively open attack surface.
The fourth characteristic is the complexity and opacity of dependency chains. AI application development involves dependency graphs of substantial depth, incorporating frameworks, quantization libraries, model loader utilities, vision-language components, and orchestration middleware that may individually contain vulnerabilities with no connection to AI functionality. These dependencies are often version-pinned loosely or not at all, creating exposure to both direct vulnerabilities in the components and to supply chain attacks targeting the distribution channels through which those components are delivered.
2.2 The Role of AI in Accelerating Offense
The acceleration of exploit timelines is not solely a function of AI infrastructure being targeted — it is also a function of AI tooling being used to accelerate the offense. This creates a self-reinforcing dynamic in which the same technology category being attacked is also providing attackers with improved capability.
CVE-Genie, a multi-agent LLM framework described in a September 2025 research paper from Arxiv, demonstrated the automation of exploit reproduction at scale. The system successfully reproduced 51% of CVEs published between June 2024 and May 2025 — 428 of 841 tested vulnerabilities — with verifiable working exploits, at an average cost of $2.77 per CVE and within a hard budget of 45 minutes per attempt [5]. These results demonstrate that exploit generation costs have dropped to a threshold accessible to nearly any motivated attacker — a qualitative shift with no parallel in the pre-AI era of vulnerability research.
HexStrike-AI, an LLM orchestration framework for offensive security operations documented by Check Point Research and reported across multiple threat intelligence channels in late 2025, embodies this trend at a higher level of integration. The framework provides more than 150 specialized AI agents that can autonomously scan, exploit, and maintain persistence inside targets, coordinated through a Model Context Protocol server that allows commercial LLMs to orchestrate the entire offensive workflow without human micromanagement [6]. Check Point Research documented the tool in connection with exploitation of specific recently disclosed CVEs, with exploitation times dropping from days to under ten minutes for certain vulnerabilities — an acceleration attributable to the automated orchestration of scanning, exploitation, and post-exploitation actions [6].
These developments represent a categorical shift in the threat landscape. Historically, the skill and resource requirements for exploit development created a natural filter — only well-resourced threat actors could consistently convert published advisories into reliable weaponized code quickly. That filter has been substantially degraded. Commodity attackers now operate at exploit generation speeds that have historically required nation-state investment in tooling and expertise — a shift that significantly lowers the barrier to entry for sophisticated attacks against AI infrastructure.
3. Incident Analysis: Six Months of AI Infrastructure Exploitation
3.1 CVE-2026-33626: LMDeploy SSRF Weaponized in 13 Hours
LMDeploy is an open-source toolkit from Shanghai AI Laboratory’s InternLM project designed for efficient deployment and serving of large language models, with particular support for vision-language models. CVE-2026-33626 is a Server-Side Request Forgery vulnerability with a CVSS score of 7.5 affecting the toolkit’s image loader: the load_image() function in lmdeploy/vl/utils.py fetches arbitrary URLs supplied by callers without validating whether the target address falls within private or internal IP ranges [7][8].
The significance of this vulnerability lies not only in its technical characteristics but in the speed with which it was operationally exploited. Sysdig’s Threat Research Team documented the first exploitation attempt against their honeypot infrastructure at 03:35 UTC on April 22, 2026 — 12 hours and 31 minutes after the GitHub security advisory was published [8][21]. This is not a theoretical timing claim: Sysdig captured the complete attack session, comprising ten distinct requests unfolding over approximately eight minutes.
The sophistication of the exploitation attempt belied the brevity of the time available for its preparation. The attacker used the vision-language image loader as a generic HTTP SSRF primitive, systematically probing internal network resources behind the model server: the AWS Instance Metadata Service endpoint (commonly used to extract cloud credentials in EC2-hosted deployments), a Redis instance, a MySQL database, a secondary HTTP administrative interface, and an out-of-band DNS exfiltration endpoint. The attacker alternated between multiple vision-language model identifiers — internlm-xcomposer2 and OpenGVLab/InternVL2-8B — apparently to reduce the likelihood of triggering anomaly detection based on model selection patterns. The attack pattern reflected preparation: systematic SSRF technique applied to the highest-value targets known to be commonly accessible in cloud-hosted AI deployments — the AWS IMDS endpoint, database services, and out-of-band exfiltration channels — executed within a disciplined 8-minute window.
CVE-2026-33626 affects all LMDeploy versions through 0.12.0 with vision-language support enabled. The vulnerability illustrates a recurring pattern in AI infrastructure: a function designed for a legitimate AI workflow purpose — fetching image inputs for vision-language models — can be repurposed as a reconnaissance and exfiltration primitive against the infrastructure hosting the AI service.
3.2 LiteLLM Supply Chain Compromise (March 2026)
LiteLLM is a Python library that provides a unified interface for interacting with more than 100 LLM providers, functioning in many deployments as a centralized gateway and proxy for AI API calls. PyPI download statistics confirm approximately 3.4 million daily downloads at the time of the compromise — a volume that placed the malicious payload within reach of a substantial fraction of active AI development environments.
On March 24, 2026, a threat actor group identified as TeamPCP published two backdoored versions of the litellm package — versions 1.82.7 and 1.82.8 — to PyPI. The attack vector was not a compromise of LiteLLM’s codebase directly but rather a supply chain compromise of LiteLLM’s CI/CD pipeline. LiteLLM’s build process incorporated Trivy, an open-source container scanning tool, retrieved from an apt repository without a pinned version. TeamPCP compromised the Trivy GitHub Action to exfiltrate the PYPI_PUBLISH token from the GitHub Actions runner environment, obtaining the credentials necessary to publish packages as the legitimate LiteLLM organization [9].
The malicious payload embedded in the compromised versions was multi-stage and designed for persistence. Version 1.82.8 contained a malicious .pth file — litellm_init.pth — that executes automatically on every Python process startup when the litellm package is installed in the environment. On each startup, the payload harvested SSH keys, cloud provider credentials, and other secrets, then transmitted them to an attacker-controlled endpoint. A separate lateral movement component targeted Kubernetes clusters accessible from compromised environments, and an RCE backdoor provided persistent remote access [9][11]. The affected versions remained live on PyPI until discovery — long enough to reach an estimated tens of thousands of Python environments given the package’s download velocity.
The LiteLLM incident is notable for several reasons beyond its immediate impact. It demonstrates that supply chain attacks targeting AI gateway components can achieve exceptionally broad reach through a single compromise, because AI gateway proxies are architectural chokepoints through which all model API traffic flows. Organizations that had deployed LiteLLM as a unified proxy for their LLM API consumption had, in effect, placed a single highly trusted component at the center of their AI infrastructure — a concentration of trust that created equally concentrated supply chain risk.
3.3 Bleeding Llama: CVE-2026-7482 and Unauthenticated Memory Disclosure in Ollama
Ollama is an open-source framework for running large language models locally, with a reported user base exceeding 300,000 internet-accessible instances as of early 2026 [12]. It is widely used for both personal inference and as the serving layer in enterprise AI deployments. CVE-2026-7482, dubbed “Bleeding Llama” by researchers from Cyera who discovered and disclosed it, is a critical (CVSS 9.1) out-of-bounds heap read vulnerability in Ollama’s GGUF model quantization pipeline [12].
The technical mechanism involves a malformed GGUF file in which the declared tensor offset and size exceed the file’s actual length. When a file meeting this condition is submitted to the /api/create endpoint, Ollama’s model loader reads past the allocated heap buffer, leaking adjacent process memory to the calling client. This endpoint is accessible to any unauthenticated client on the network when Ollama is deployed with wide interface binding — a configuration that is common in practice despite Ollama’s default being localhost-only — or where no external authentication layer has been implemented [3][12].
The practical consequence of this vulnerability is that an unauthenticated attacker with network access to an exposed Ollama instance can extract arbitrary segments of the inference server’s process memory. In common deployment configurations, inference server process memory contains the information most valuable to attackers: system prompt configurations defining the behavior of the deployed model, API keys and credentials embedded in environment variables at startup, active user conversation data from concurrent sessions, and model-specific configuration. All of this is accessible without credentials, without elevated privileges, and without interacting with any functionality that a conventional network security monitoring system would flag as anomalous.
Beyond the specific technical vulnerability, the Bleeding Llama disclosure underscores a systemic characteristic of AI inference server deployments: the no-auth-by-default design philosophy that pervades the category. This philosophy reflects the original design intent — personal and research use where network exposure is not anticipated — but it creates severe risk when the same software is deployed in production environments with internet exposure. Runzero’s scanning data documented more than 300,000 Ollama instances with public internet exposure at the time of the disclosure [12].
3.4 ShadowRay and the Persistent Exploitation of Ray Framework
The Ray distributed computing framework, widely used by organizations including OpenAI, Uber, and Amazon for scaling AI and ML workloads, has been the subject of active exploitation campaigns since at least late 2023. The core vulnerability — CVE-2023-48022, a lack of authentication in the Ray Jobs API that enables remote code execution — carries a CVSS score of 9.8 [13]. Despite this severity, the vulnerability has a “disputed” status in its CVE record, reflecting disagreement between the vendor (who characterizes unauthenticated cluster deployment as an intended feature of a tool designed for controlled environments) and security researchers (who document its active exploitation against internet-exposed clusters).
Oligo Security’s initial disclosure of the ShadowRay campaign, published in March 2024, documented hundreds of compromised Ray clusters across a range of industries including education, manufacturing, and financial services. Attackers used the unauthenticated Jobs API to submit malicious jobs, execute arbitrary code on worker nodes, and exfiltrate database credentials, proprietary model weights, training data, and source code [13]. By November 2025, Oligo documented the continuation of this exploitation activity as a formalized MITRE campaign (C0045), with an evolved toolset now formalized as ShadowRay 2.0 [14].
The ShadowRay 2.0 campaign introduced technical innovations that increase both persistence and detection resistance. The most significant is the exploitation of Ray’s “Detached Actors” functionality — a legitimate framework feature allowing computation to continue beyond the lifecycle of the job that spawned it. Attackers deploy malicious Actor processes designed to detach from their parent jobs: even when a security team identifies and terminates the visible job entry in the Ray dashboard, the malicious Actor process continues executing in the background [14]. This persistence mechanism exploits a feature that exists specifically for legitimate long-running computation use cases, making it difficult to distinguish from authorized workloads without detailed process-level instrumentation.
The ShadowRay 2.0 campaign also involved payloads that analysts characterized as consistent with LLM-assisted generation, adapted in response to observed defenses — a pattern suggesting that AI tooling is being applied offensively against AI infrastructure. The longevity of this campaign, persisting for over two years against infrastructure where the underlying vulnerability remains unpatched, illustrates the practical consequence of treating AI-specific authorization weaknesses as out-of-scope for vulnerability management programs.
3.5 Flowise CVE-2025-59528: CVSS 10.0 RCE in AI Agent Framework
Flowise is an open-source low-code platform for building customized AI agent workflows and chatflows, with between 12,000 and 15,000 instances exposed to the public internet as of April 2026 [15]. CVE-2025-59528 is a code injection vulnerability in Flowise’s CustomMCP node carrying the maximum CVSS score of 10.0. The vulnerability exists because the node accepts user-supplied configuration for connecting to external MCP servers, and the configuration processing path executes JavaScript code through a Function() constructor without any input validation or sanitization.
VulnCheck’s Canary network detected first-in-the-wild exploitation of CVE-2025-59528 on April 7, 2026 [15]. Successful exploitation requires only an API token and provides full Node.js runtime privileges on the host server, including access to the child_process module for arbitrary command execution and the fs module for full filesystem access. With more than 12,000 publicly exposed instances and no authentication requirement beyond an API token, which may be absent in misconfigured or default deployments, the attack surface at the time of initial exploitation was substantial.
The Flowise vulnerability illustrates the risk created when low-code AI development tools expose powerful execution primitives to user-supplied configuration. Agent frameworks by design must execute code and invoke tools on behalf of users and automated agents; the security controls that determine which code and which tools are accessible must be correspondingly rigorous. CVE-2025-59528 represents a failure of that boundary — a case where the same flexibility that makes Flowise useful for building AI workflows also made it an effective platform for arbitrary code execution by anyone who could reach its API.
3.6 GreyNoise and Systematic AI Infrastructure Reconnaissance
The individual incident case studies above represent documented exploitation events. Understanding the broader context requires considering the systematic reconnaissance activity that precedes targeted exploitation. GreyNoise’s honeypot infrastructure, which operates Ollama instances exposed to the public internet specifically to capture and classify attack traffic, recorded 91,403 attack sessions between October 2025 and January 2026 [16].
Analysis of this data revealed two distinct campaign patterns. The first, running from October 2025 through January 2026, originated from 62 source IPs across 27 countries and focused on SSRF techniques hunting for misconfigured proxy servers that might provide access to commercial AI API services without payment. The attacks systematically tested both OpenAI-compatible API formats and Google Gemini formats, suggesting intent to monetize discovered access by routing requests through compromised proxies. The second campaign, more concentrated in both source and time, generated 80,469 sessions from just two source IPs over eleven days beginning December 28, 2025. Both source IPs in the second campaign carried prior GreyNoise attribution across more than 200 other CVE exploitation campaigns, including multiple network appliance vulnerabilities — suggesting that AI infrastructure reconnaissance is being integrated into the operational toolkits of established threat actor groups rather than being conducted by AI-specialist attackers [16].
GreyNoise’s 2026 State of the Edge Report adds important context: more than 52% of remote code execution attempts against monitored infrastructure originated from IP addresses with no prior history in GreyNoise’s dataset, indicating that reputation-based defenses have fundamental coverage limitations [17]. This matters for AI infrastructure because many organizations rely on IP reputation feeds and geographic blocking as first-line controls against API abuse. The data suggests these controls are structurally insufficient against the threat landscape targeting AI endpoints.
4. The Weaponization Pipeline: From Advisory to Attack
4.1 How Advisory Text Becomes Exploit Code
A consistent theme across the incidents documented in Section 3 is the speed with which public advisory information was converted to working attack tools. Understanding this conversion pipeline illuminates why the timeline has collapsed and why organizations cannot rely on advisory-to-patch intervals as a planning assumption.
The components required for exploit development against most AI infrastructure vulnerabilities are now widely available and rapidly improvable using AI assistance. A typical published CVE advisory for an AI infrastructure component includes the affected software version, the name of the vulnerable function or endpoint, a description of the vulnerability class (SSRF, memory disclosure, command injection), and sometimes the specific parameter or file format involved. In the current environment, this information is sufficient to enable LLM-based exploit generation tools to produce a working proof-of-concept within minutes. The structural characteristics of AI infrastructure vulnerabilities — they tend to be straightforward web API or file processing flaws rather than complex memory corruption chains — make them particularly tractable for this class of tool.
The CVE-2026-33626 timeline illustrates this concretely. The GitHub advisory published at approximately 3:00 UTC on April 22, 2026 identified the vulnerable function (load_image()), described the SSRF class, noted that no IP validation was performed, and named the specific endpoint types accessible through the vision API. An attacker with a working understanding of SSRF technique and access to an AI code generation tool could reproduce a functioning SSRF exploit targeting the most commonly exposed internal resources (IMDS, Redis, common databases) within the 12 hours and 31 minutes that elapsed before Sysdig’s honeypot recorded the first exploitation attempt [8].
4.2 The Role of Internet-Wide Scanning
Exploit development is only one component of the weaponization pipeline. Attackers also require target identification — knowledge of which organizations are running the vulnerable software and where it is accessible. This capability has been commoditized through internet-wide scanning services including Shodan, Censys, and FOFA, all of which index internet-accessible services continuously and maintain searchable databases of exposed endpoints.
The consequence is that the moment a vulnerability is disclosed for a widely-deployed AI infrastructure component, attackers can query existing scan databases to identify exposed instances at scale. The 300,000+ Ollama instances identified by Runzero [12], the 12,000–15,000 Flowise instances identified by threat researchers prior to CVE-2025-59528 exploitation [15], and the Ray clusters documented by Oligo Security [13] are all examples of publicly accessible scan data that effectively pre-populated the target list for subsequent exploitation campaigns. Organizations do not need to be specifically targeted: exposure to internet-wide scanning means that any instance running vulnerable software becomes part of a target set that is assembled automatically.
4.3 AI as an Offensive Force Multiplier
The OWASP GenAI Exploit Round-up Report for Q1 2026 observes that the period covered — January through April 2026 — demonstrates a clear transition from theoretical risk to operational exploitation across AI infrastructure [18]. This transition reflects not only the maturation of specific AI infrastructure attack techniques but the role of AI tooling in enabling attackers with limited resources to operate at speeds and scales previously beyond their reach.
The HexStrike-AI framework documented by Check Point Research represents the logical endpoint of this trend [6]. A threat actor with access to a commercial LLM and the HexStrike-AI framework can direct a coordinated offensive operation — scanning, vulnerability identification, exploitation, persistence, lateral movement — without deep technical expertise in any individual component. The LLM orchestration layer handles the tactical decision-making; the 150+ specialized agents handle the execution. This architectural pattern transforms AI model access into operational capability in a way that dramatically lowers the barrier to entry for sophisticated attacks against AI infrastructure.
5. Defensive Framework and Recommendations
5.1 Reframing the Vulnerability Management Paradigm
The incidents documented in this paper collectively demonstrate that traditional vulnerability management processes are structurally misaligned with the threat environment facing AI infrastructure. A process designed around a 30-day or 60-day remediation window provides no meaningful protection against vulnerabilities weaponized within hours of disclosure — the class of vulnerability that increasingly characterizes AI infrastructure attacks. Organizations operating AI infrastructure must adopt a materially different operational posture.
The fundamental shift required is from a scheduled, score-driven model to a continuous, context-driven model. This means continuous monitoring of vulnerability feeds specific to AI infrastructure components, standing procedures for emergency patching that bypass normal change management cycles when exploitation timelines demand it, and compensating controls that can be deployed more quickly than patches when patching is not immediately feasible.
5.2 Network Architecture and Exposure Reduction
The most consistently impactful control across all the incidents examined in this paper would have been the elimination of unnecessary internet exposure. Ollama’s 300,000 exposed instances [12], Flowise’s 12,000–15,000 exposed instances [15], and the Ray clusters identified across multiple exploitation campaigns [13] all represent deployment decisions — or deployment oversights — that created the precondition for exploitation. Controls that address exposure directly are more durable than controls that depend on patch timing.
AI inference servers, orchestration frameworks, and model hosting APIs should not be directly internet-accessible unless there is a specific, reviewed business requirement for public access. Where external access is required, it should be mediated through authenticated API gateways with rate limiting, request inspection, and anomaly detection. Internal-use-only components — Ray dashboards, Ollama APIs serving internal workflows, MLflow experiment tracking servers — should be isolated to internal networks with strict egress controls that prevent them from initiating outbound connections to arbitrary internet destinations, which would reduce the practical utility of SSRF vulnerabilities even when they exist in the deployed software.
Zero Trust network segmentation, applied to AI infrastructure, directly addresses the root exposure condition documented across multiple incidents in this paper — unauthenticated endpoints accessible from the public internet. CSA’s Zero Trust guidance, including the organization’s work on Zero Trust for Critical Infrastructure, establishes the principle that no component should be trusted based solely on network location. Applied to AI infrastructure, this means that inference servers, orchestration layers, vector databases, and model storage should each be treated as potentially compromised and required to authenticate and authorize every request they process, regardless of source.
5.3 Authentication and Secrets Management
The default-unauthenticated design philosophy present in much of the AI infrastructure ecosystem requires active remediation by deployers. Organizations should treat any AI infrastructure component that ships without authentication as requiring explicit authentication configuration before deployment to any environment reachable from the internet or from networks containing sensitive resources.
Critically, the secrets embedded in AI infrastructure runtime environments must be managed with the same rigor as secrets in other high-value systems. API keys for commercial model providers, cloud provider credentials, database connection strings, and system prompt configurations should be provisioned through dedicated secrets management systems (such as HashiCorp Vault, AWS Secrets Manager, or equivalent) with automatic rotation policies, rather than embedded statically in environment variables at deployment time. The Bleeding Llama vulnerability’s capacity to extract environment variables from process memory [12], and the LiteLLM malware’s systematic harvest of cloud credentials at Python startup [11], both target the same architectural weakness: treating the inference server process as a trusted, long-lived holder of high-value secrets.
5.4 Supply Chain Security for AI Dependencies
The LiteLLM incident demonstrates that the attack surface for AI infrastructure extends through its entire dependency chain to CI/CD pipeline security, PyPI package integrity, and the version-pinning discipline of every component in the stack. Organizations should apply supply chain security controls to AI Python dependencies with the same rigor they would apply to production application dependencies.
Specific controls include: pinning all AI infrastructure dependencies to specific versions and verifying hashes at installation time; monitoring for newly published versions of critical dependencies using automated alerting; reviewing the CI/CD pipeline security of high-trust AI infrastructure packages (particularly those with access to API credentials or Kubernetes clusters); and operating an internal artifact mirror for production AI dependencies to eliminate direct PyPI dependency in production environments. The Datadog Security Labs analysis of the LiteLLM compromise documents the specific CI/CD pipeline vulnerability pattern — unpinned tool dependencies in GitHub Actions runners — that enabled the TeamPCP attack [9], providing a concrete template for defensive review.
5.5 Detection and Response for AI-Specific Attack Patterns
Several of the attack patterns documented in this paper have characteristics that make them detectable through instrumentation that is not part of standard AI infrastructure monitoring configurations. Deploying this instrumentation is a prerequisite for timely incident response.
SSRF attacks against AI vision APIs, of the type used in CVE-2026-33626 exploitation, generate internal HTTP requests from inference server processes to unexpected destinations. Process-level network monitoring — eBPF-based solutions provide process-level telemetry without requiring kernel module installation, making them well-suited for containerized inference environments — can detect model serving processes making connections to the AWS IMDS endpoint, to internal database addresses, or to arbitrary external DNS resolvers. These connection patterns are not produced by legitimate inference workloads and should generate high-confidence alerts.
Memory disclosure attacks against inference servers, such as CVE-2026-7482, are detectable at the API boundary: legitimate model loading requests involve GGUF files with internally consistent tensor manifests, while exploit files by definition contain manifest inconsistencies. API-layer validation of GGUF file metadata before processing is both a preventive control and a detection mechanism.
Supply chain malware of the type deployed through LiteLLM 1.82.7 and 1.82.8 executes at Python startup, generating outbound connections to exfiltration endpoints that predate any legitimate model request. Egress monitoring at the container or process level that flags outbound connections from Python processes to unexpected destinations would have surfaced this behavior within seconds of the infected package loading.
5.6 Vulnerability Intelligence and Rapid Response Programs
Given the exploitation timelines documented in this paper, organizations operating significant AI infrastructure should establish dedicated vulnerability intelligence programs covering the specific components in their deployment. This means subscribing to GitHub security advisory feeds for each open-source AI infrastructure component, monitoring NVD and CISA’s Known Exploited Vulnerabilities catalog with automated alerting for relevant CVEs, and integrating threat intelligence feeds from organizations that track AI infrastructure exploitation specifically.
The distinction between CVSS score and active exploitation status is critical. CVE-2026-33626 carried a CVSS score of 7.5 — classified as “high” but not “critical” — and was exploited within 13 hours [8]. A prioritization system that would have deferred this vulnerability to a next-week remediation cycle based on its score would have left organizations exposed through the entire active exploitation window. Active exploitation status, not CVSS score alone, should be the primary triage criterion for emergency response.
Organizations should maintain pre-approved emergency change procedures for AI infrastructure patching that allow deployment of critical security updates within hours rather than days. These procedures should include communication chains, testing protocols, and rollback capabilities that can be executed outside of normal change management windows.
6. CSA Resource Alignment
6.1 AI Controls Matrix (AICM)
The vulnerabilities and attack patterns documented in this paper map directly to multiple domains within CSA’s AI Controls Matrix (AICM), the comprehensive framework for AI security controls that addresses the shared responsibility model across Foundation Models, Orchestrated Service Providers, Application Providers, and AI Customers [20]. Several AICM domains are particularly relevant.
The AI Supply Chain Security domain addresses precisely the class of risk illustrated by the LiteLLM compromise [20]. AICM controls in this domain specify requirements for dependency integrity verification, CI/CD pipeline security, and third-party component risk assessment. The LiteLLM incident’s root cause — an unpinned tool dependency in a CI/CD workflow that enabled credential exfiltration — falls within the scope of controls that AICM-compliant organizations should already have in place.
The Deployment Infrastructure domain addresses network isolation, authentication configuration, and secrets management requirements for AI serving infrastructure — the class of controls whose absence is documented in the Ollama, Ray, and Flowise incidents [20]. AICM’s framing of the Shared Security Responsibility Model is directly applicable here: cloud service providers, orchestrated service providers, and AI customers each have distinct responsibilities for the security of the components in their layer of the AI stack, and the incidents documented in this paper consistently reflect gaps in one or more of those layers.
6.2 MAESTRO Threat Modeling Framework
CSA’s MAESTRO framework (Multi-Agent Environment, Security, Threat, Risk, and Outcome) provides a seven-layer reference architecture for threat modeling agentic AI systems [19]. Each layer — (1) Foundation Models, (2) Data Operations, (3) Agent Frameworks, (4) Deployment Infrastructure, (5) Evaluation and Observability, (6) Security, and (7) Agent Ecosystem — maps to specific vulnerability classes documented in this paper.
Layer 4 (Deployment Infrastructure) encompasses the inference server vulnerabilities documented in Sections 3.1 and 3.3 — the SSRF in LMDeploy and the memory disclosure in Ollama. MAESTRO’s treatment of this layer emphasizes that the infrastructure hosting AI models presents attack surfaces that are largely independent of the AI functionality itself, and that threat modeling must address both the AI-specific attack surface and the conventional infrastructure vulnerabilities that affect AI deployments.
Layer 3 (Agent Frameworks) encompasses the orchestration-layer vulnerabilities documented in Sections 3.4 and 3.5 — Ray’s unauthenticated Jobs API and Flowise’s RCE via the CustomMCP node. MAESTRO’s security considerations at this layer address the challenge of providing agents with necessary tool access while preventing that access from being exploited for unauthorized actions — precisely the design failure that CVE-2025-59528 represents.
Layer 2 (Data Operations) and its supply chain dependencies map to the LiteLLM incident, where compromise of the package distribution pipeline provided a vector into the data and credential flows managed by AI gateway proxies.
6.3 Cloud Controls Matrix (CCM)
CSA’s Cloud Controls Matrix provides a comprehensive framework of cloud security controls organized across 17 domains. Several CCM domains provide direct governance coverage for the defensive measures recommended in Section 5.
The Supply Chain Management, Transparency and Accountability domain (STA) addresses third-party risk assessment, supply chain due diligence, and vendor security requirements — covering the organizational governance dimension of the LiteLLM supply chain risk. The Identity and Access Management domain (IAM) covers the authentication and authorization controls whose absence is the root cause of the Ray and Ollama exposure incidents. The Infrastructure and Virtualization Security domain (IVS) covers network isolation, segmentation, and the ingress/egress controls necessary to limit the practical impact of SSRF vulnerabilities. Organizations seeking to map the recommendations in this paper to existing compliance and audit frameworks should use CCM as the primary reference structure.
6.4 STAR Program
CSA’s Security Trust Assurance and Risk (STAR) program provides a framework for assessment and certification of cloud and AI service security posture. Organizations consuming AI infrastructure from third-party providers should require STAR-level attestation for critical AI infrastructure components, particularly those acting as gateway proxies or orchestration layers with access to sensitive credentials and data flows. The LiteLLM incident illustrates the risk of treating widely-adopted open-source AI infrastructure as inherently trustworthy based on community adoption alone — a risk that structured third-party assurance frameworks are specifically designed to address.
7. Conclusions and Recommendations
The AI/ML exploit window has collapsed to a degree that renders traditional remediation timelines operationally irrelevant for a growing class of AI infrastructure vulnerabilities. The evidence assembled in this paper — from the 12-hour and 31-minute LMDeploy exploitation timeline to the LiteLLM PyPI supply chain compromise, from the persistent ShadowRay campaigns to the mass exposure of Ollama deployments with insecure wide-binding configurations — points to a consistent conclusion: AI infrastructure is being targeted with the same speed, sophistication, and systematicity that characterizes attacks against other high-value enterprise systems, while the security posture of AI infrastructure deployments in aggregate lags far behind those targets.
The combination of factors that produces this asymmetry is not static. The density of valuable data and credentials in AI infrastructure processes will increase as AI capabilities are integrated more deeply into enterprise workflows. The tooling available to attackers for automated exploit generation and orchestrated offensive operations will continue to improve. The default-unauthenticated deployment patterns that characterize much of the current AI infrastructure ecosystem will persist in legacy deployments even as new versions adopt stronger defaults. Organizations that treat AI infrastructure security as a secondary concern, to be addressed after the primary goal of deploying AI capabilities, will face an increasing gap between the threat environment they face and the defenses they have deployed.
The recommendations that follow represent the minimum viable defensive posture for organizations operating AI infrastructure in the current threat environment. They are organized by urgency.
Immediate actions (within 72 hours):
- Inventory all AI infrastructure components (inference servers, orchestration frameworks, gateway proxies, vector databases) and identify any with default-unauthenticated configurations
- Audit internet exposure of all AI infrastructure components and close unnecessary exposure
- Verify that the LiteLLM package versions installed in production are not the compromised 1.82.7 or 1.82.8 releases
- Subscribe to GitHub security advisory feeds for all AI infrastructure components in the deployment stack
- Ensure Ollama instances are patched to version 0.17.1 or later for CVE-2026-7482 remediation
Short-term actions (within 30 days):
- Implement network isolation and Zero Trust segmentation for AI infrastructure, preventing inference servers from initiating unauthorized outbound connections
- Deploy runtime secrets management for AI workloads, eliminating static credential embedding in process environments
- Establish emergency patching procedures for AI infrastructure that allow critical security updates to be deployed within hours of exploitation discovery
- Implement egress monitoring for AI process containers to detect unauthorized outbound connections consistent with supply chain malware behavior
- Apply MAESTRO threat modeling to all production AI deployments to systematically identify and remediate infrastructure-layer vulnerabilities
Strategic actions (ongoing):
- Integrate AI infrastructure CVEs into vulnerability management programs as a dedicated tracking category, prioritizing active exploitation status over CVSS score alone
- Participate in threat intelligence sharing communities providing AI-specific exploitation data, including OWASP’s GenAI Security Project
- Apply AICM controls across the AI deployment stack, with explicit coverage for supply chain security, infrastructure authentication, and secrets management
- Require STAR attestation or equivalent from third-party AI infrastructure providers with access to sensitive credentials or data
- Conduct regular adversary simulation exercises specifically targeting AI infrastructure attack patterns, including SSRF via vision APIs, supply chain compromise scenarios, and unauthenticated API abuse
The threat is real, it is active, and it is accelerating. The organizations that will weather this environment are those that treat AI infrastructure with the same security rigor they apply to financial systems, customer data stores, and identity infrastructure — because AI infrastructure in the current deployment landscape routinely holds the credentials, customer data, and identity tokens that define those categories.
References
[1] VulnCheck. “2026 VulnCheck Exploit Intelligence Report.” VulnCheck, 2026.
[2] CyberMindr. “Average Time-to-Exploit in 2025.” CyberMindr Blog, 2025.
[3] Runzero. “Ollama vulnerability CVE-2026-7482: Find impacted assets.” Runzero Blog, 2026.
[4] Oligo Security. “ShadowRay: First Known Attack Campaign Targeting AI Workloads Actively Exploited in the Wild.” Oligo Security Blog, 2024.
[5] Ullah et al. “From CVE Entries to Verifiable Exploits: An Automated Multi-Agent Framework for Reproducing CVEs.” arXiv:2509.01835, September 2025.
[6] Check Point Research. “Hexstrike-AI: LLM Orchestration Driving Real-World Zero-Day Exploits.” Check Point Blog, 2025.
[7] SecurityOnline. “CVE-2026-33626: High-Severity SSRF Exploited in the Wild to Hijack AI Inference Engines.” SecurityOnline.info, April 2026.
[8] Sysdig. “CVE-2026-33626: How attackers exploited LMDeploy LLM Inference Engines in 12 hours.” Sysdig Threat Research Team, April 2026.
[9] Datadog Security Labs. “LiteLLM and Telnyx compromised on PyPI: Tracing the TeamPCP supply chain campaign.” Datadog Security Labs, March 2026.
[10] Trend Micro. “Your AI Gateway Was a Backdoor: Inside the LiteLLM Supply Chain Compromise.” Trend Micro Research, 2026.
[11] Snyk. “How a Poisoned Security Scanner Became the Key to Backdooring LiteLLM.” Snyk Blog, 2026.
[12] Cyera Research. “Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama.” Cyera Research, 2026.
[13] SecurityWeek. “Attackers Exploit Ray AI Framework Vulnerability to Hack Hundreds of Clusters.” SecurityWeek, 2024.
[14] Penligent AI. “The Zombie Vulnerability: A 2026 Autopsy of CVE-2023-48022 and the ShadowRay 2.0 Resurgence.” Penligent AI, 2026.
[15] Bleeping Computer. “Max severity Flowise RCE vulnerability now exploited in attacks.” Bleeping Computer, April 2026.
[16] GreyNoise Intelligence. “Threat Actors Actively Targeting LLMs.” GreyNoise Blog, 2026.
[17] GreyNoise Intelligence. “GreyNoise Releases 2026 State of the Edge Report.” GreyNoise Press Release, 2026.
[18] OWASP Gen AI Security Project. “OWASP GenAI Exploit Round-up Report Q1 2026.” OWASP, April 2026.
[19] Cloud Security Alliance. “Agentic AI Threat Modeling Framework: MAESTRO.” CSA Blog, February 2025.
[20] Cloud Security Alliance. “Introductory Guidance to the AI Controls Matrix (AICM).” CSA Research, 2025.
[21] The Hacker News. “LMDeploy CVE-2026-33626 Flaw Exploited Within 13 Hours of Disclosure.” The Hacker News, April 2026.