Published: 2026-05-06
Categories: AI Security, Infrastructure Security
Exposed by Default: The Open LLM Infrastructure Attack Surface
Key Takeaways
The findings summarized in this note reflect a systemic misconfiguration pattern across the self-hosted AI inference ecosystem that has already attracted organized criminal exploitation:
- More than 175,000 publicly accessible Ollama inference servers have been detected across 130 countries, the vast majority with no authentication enforced [1]; a concurrent LeakIX enumeration using stricter zero-authentication criteria identified 12,269 such instances, reflecting differences in methodology and scope rather than a contradiction in scale [2].
- Ollama, vLLM, LocalAI, and related inference tools ship without native authentication, placing the full burden of access control on operator deployment practices—a dependency that demonstrably fails at scale, as evidenced by tens of thousands of unauthenticated instances accessible on the public internet [1][2].
- Two critical Ollama vulnerabilities—CVE-2024-37032 (“Probllama”) enabling unauthenticated remote code execution, and CVE-2026-7482 (“Bleeding Llama”) enabling heap memory exfiltration—compound the risk on any exposed instance [3][5].
- Operation Bizarre Bazaar documented the first commercial-scale LLMjacking marketplace, recording 35,000 attack sessions against exposed AI endpoints in an approximately five-week period between December 2025 and January 2026 [4].
- Nearly half of observed exposed Ollama instances have tool-calling capabilities active, extending unauthorized compute access into a potential lateral movement vector into connected internal systems [1].
- Organizations must treat AI inference endpoints as network services requiring the same access controls applied to any production API: authentication, transport encryption, network isolation, and continuous monitoring.
Background
The democratization of large language model inference has been rapid. Tools such as Ollama, vLLM, and LocalAI allow individual practitioners, research teams, and enterprises to run capable open-weight models on commodity and cloud-hosted hardware, eliminating per-token costs and data residency concerns associated with managed API services. Internet-wide scans conducted in early 2026 identified Ollama as among the most widely visible self-hosted AI services on the public internet [1][7].
The security posture of the self-hosted AI inference ecosystem has lagged its adoption velocity, as reflected in the scale of misconfigured instances now visible on the public internet. Ollama was designed as a local inference tool and binds to 127.0.0.1 by default. Users who need remote or multi-machine access—developers reaching a GPU server from a laptop, teams sharing inference capacity, or organizations deploying a shared model endpoint—commonly set OLLAMA_HOST=0.0.0.0, which binds the API to all available network interfaces. The Ollama quickstart documentation provides no contextual warning that distinguishes safe local use from the risk created when this binding change reaches a cloud-hosted instance without a compensating firewall policy or reverse proxy. Ollama ships with no authentication mechanism at all. A LeakIX analysis reports that the project maintainers have indicated they do not plan to implement authentication natively [2], citing documentation of the founder’s position and a history of rejected pull requests; this note has not independently confirmed this characterization via official Ollama project channels. Whatever the official roadmap, the practical effect is the same: the entire responsibility for access control rests on deployment decisions that practitioners often get wrong.
vLLM, a high-throughput inference engine commonly used in production environments, presents a comparable profile. Its --api-key flag accepts any arbitrary string, including the literal word “EMPTY,” and the project’s own documentation explicitly states that “–api-key alone is insufficient for securing a production deployment” and that a reverse proxy is required for any externally accessible instance [6]. LocalAI and Open WebUI, both widely used inference backends and frontends, similarly default to unauthenticated configurations and depend entirely on operator-supplied access controls. The pattern is consistent across the ecosystem: these tools optimize for development-time convenience and assume a trusted network context that is frequently absent when they reach production.
The combination of rapid adoption, default-open configurations, and high-value compute and model assets has converged into a documented attack surface. Exposed AI infrastructure is now subject to organized exploitation, no longer a theoretical misconfiguration risk.
Security Analysis
Scale of Exposure
The extent of misconfigured AI inference infrastructure is now precisely measurable. In January 2026, a joint investigation by SentinelOne SentinelLABS and Censys identified 175,000 unique Ollama hosts publicly accessible across 130 countries, discovered over a 293-day scanning period [1]. The geographic distribution is broad: China, the United States, Germany, France, South Korea, and India lead in exposure count, reflecting the global character of the AI adoption wave rather than any concentration in a single region or cloud provider.
LeakIX’s February 2026 enumeration provides a sharper operational picture of what those exposures represent [2]. Of the 12,269 Ollama instances it found enforcing zero authentication, the top hosting networks were AWS (1,686 instances combined), Hetzner (1,004), OVH (773), and Contabo (634). These are not experimental home-lab setups on residential internet connections. Cloud-hosted GPU instances represent substantial infrastructure investment, and many run proprietary or fine-tuned models whose weights carry significant intellectual property value. The same scan estimated that approximately 1,000 of those instances remained unpatched against CVE-2024-37032, exposing them to an unauthenticated root-level code execution chain.
Cisco Talos found through its own scanning research that internet-wide scanning tools detect misconfigured Ollama instances within minutes of scan coverage beginning [7], meaning exposure is measurable from the first moment an endpoint appears on the public internet. The operational window between deployment and first detection has compressed to a matter of minutes. Organizations that expose an unauthenticated inference endpoint, even briefly, should assume discovery by automated scanning infrastructure has already occurred.
Vulnerability Landscape
Two critical Ollama vulnerabilities have elevated the risk profile of exposed instances well beyond resource abuse.
CVE-2024-37032, disclosed by Wiz Research in June 2024 and dubbed “Probllama,” exploits insufficient input validation in Ollama’s model pull mechanism [3]. An attacker operates a rogue model registry and supplies a manifest crafted to trigger a path traversal, enabling arbitrary file writes on the server. By writing a malicious shared library and referencing it through /etc/ld.so.preload, the attacker achieves remote code execution at the next model inference invocation. In Docker deployments—a standard production pattern—Ollama typically runs as root and binds to all interfaces, meaning network exposure is the only prerequisite for exploitation. The vulnerability was patched in version 0.1.34, but LeakIX found approximately 1,000 instances still running a vulnerable version as of February 2026.
CVE-2026-7482, disclosed by Cyera Research in early 2026 and named “Bleeding Llama,” carries a CVSS score of 9.1 [5]. The vulnerability resides in the GGUF model loader: an attacker submits a malformed GGUF file containing a declared tensor offset and size exceeding the actual file length, inducing a heap out-of-bounds read. The complete exploit requires only three unauthenticated API calls. Memory leaked through this channel may include prompts, conversation history, environment variables, API keys, and other secrets resident in the inference process at the time of attack. Ollama’s own model push mechanism is then used to exfiltrate the collected heap contents to an attacker-controlled registry—a multi-stage exploitation chain that repurposes the platform’s own features against its operators. Cyera estimated approximately 300,000 Ollama servers were potentially exposed at the time of disclosure; the fix was released in version 0.17.1.
The broader inference ecosystem has not established a mature coordinated vulnerability disclosure posture. vLLM has accumulated multiple high-severity CVEs since 2025, including CVE-2025-48956, which enables denial-of-service through unbounded HTTP header processing in versions prior to 0.10.1.1 [6][12]. Security research attention has concentrated in this area only recently, and the documented vulnerability surface should not be treated as a ceiling—undiscovered vulnerabilities in this software category remain likely.
The LLMjacking Economy
Widespread unauthenticated exposure combined with valuable compute resources has attracted organized threat actors. LLMjacking—the unauthorized use of exposed inference endpoints for free compute consumption, data exfiltration, and commercial resale of access—emerged as a documented attack class in 2024 and has since developed into a structured criminal market.
Operation Bizarre Bazaar, attributed by Pillar Security Research in 2026, represents the most fully documented instance of this pattern [4]. The operation comprises a three-actor supply chain: a scanning service using Shodan and Censys to systematically identify exposed AI endpoints at scale, a validation layer that confirms each endpoint’s actual inference capability, and a commercial marketplace operating as silver.inc—advertising itself as a “Unified LLM API Gateway” and reselling discounted access to more than 30 LLM providers without authorization [4]. Pillar’s honeypot infrastructure recorded 35,000 attack sessions targeting exposed AI infrastructure between December 2025 and January 2026, averaging nearly 1,000 attacks per day. The campaign has been attributed to a threat actor operating as “Hecker,” indicating a persistent, commercially motivated actor for whom exposed AI infrastructure represents a recurring revenue stream rather than an opportunistic target.
The financial stakes for victims extend beyond the cost of stolen compute. Reported incidents suggest LLMjacking can generate substantial GPU cost spikes within a single billing cycle, as attackers direct high-volume inference workloads through the compromised endpoint. For organizations without billing anomaly alerts on GPU usage, the first signal of compromise may arrive as an unexpected invoice.
The SentinelOne and Censys research added a further dimension to the risk profile: nearly half of all observed exposed Ollama instances had tool-calling capabilities enabled [1]. When tool-calling is active, an attacker with inference access can instruct the model to execute code, query internal APIs, and interact with connected external services. The exposure is no longer bounded by the inference context itself. A compromised inference endpoint with tool-calling becomes a potential pivot point for lateral movement into adjacent infrastructure, qualitatively changing the incident response profile from a resource-theft event to a potential network intrusion.
Systemic Root Cause
The underlying dynamic is not unique to AI inference tooling. Default-open configurations, inadequate deployment documentation, and the assumption of trusted network contexts are recurring failure modes in developer-oriented infrastructure software. What amplifies the pattern in the AI inference ecosystem is the speed of adoption, the high cost and value of GPU compute resources, the intellectual property embedded in proprietary model weights, and the relatively recent arrival of security research attention on this category of software.
Cisco Talos summarized its findings by characterizing the pattern as reflecting “a widespread neglect of fundamental security practices such as access control, authentication and network isolation in the deployment of AI systems” [7]. The pattern they describe is structural rather than attributable to individual practitioner error. The tools themselves create the conditions for this failure by defaulting to configurations that are safe only in a single context—localhost-only deployment—and providing no graduated guidance or warning when users configure them for network-accessible deployment. Until that changes, the operational security burden remains with the operator.
Recommendations
Immediate Actions
Every organization running self-hosted AI inference services should validate three controls before the next business day. First, all Ollama, vLLM, LocalAI, or comparable inference services must sit behind a reverse proxy—NGINX, Caddy, Traefik, or a cloud-native API gateway—that enforces authentication on every inbound request before any traffic reaches the inference process. Second, inference services should bind exclusively to localhost (127.0.0.1) or a private network interface; any instance bound to 0.0.0.0 must have corresponding firewall or security group rules restricting inbound access to verified source IP ranges. Third, any Ollama deployment running a version below 0.17.1 should be patched immediately to address the Bleeding Llama memory exfiltration vulnerability (CVE-2026-7482); versions below 0.1.34 additionally require patching against the Probllama RCE chain (CVE-2024-37032).
Operators should also conduct an immediate external exposure audit using tools such as Shodan or Censys with the query product:"Ollama" to assess their own footprint from an attacker’s perspective. Port 11434 (Ollama), port 8000 (common OpenAI-compatible API binding), and port 8080 (common Open WebUI binding) should be reviewed across all cloud environments and verified against intended access policy.
Short-Term Mitigations
Beyond the immediate baseline, organizations should establish persistent monitoring of API gateway and reverse proxy logs for behavioral indicators of LLMjacking activity. Anomalous request volumes, systematic model enumeration across the /api/tags endpoint, and push requests directed to external registries may all indicate active exploitation. Where fine-tuned or proprietary model weights are deployed, access to the /api/pull and /api/push endpoints should be disabled at the reverse proxy layer unless those operations are explicitly required and their use is monitored.
vLLM deployments should implement the --api-key flag and route through a reverse proxy, treating the flag as one layer of a defense-in-depth posture rather than a standalone control. Multi-node vLLM deployments communicate without encryption by default; inter-node traffic must be placed on an isolated network segment or protected by mutual TLS, as the vLLM documentation explicitly flags this as an insecure default [6]. This requirement is particularly relevant in organizations scaling inference across GPU clusters.
Tool-calling and function-execution capabilities warrant their own access control review. They should be disabled by default on any inference endpoint accessible beyond a single trusted workstation and enabled only after explicit security review and with additional logging. The potential impact of unauthorized tool-calling access—code execution, internal API queries, external service calls—substantially exceeds the impact of unauthorized inference access alone.
Strategic Considerations
At the organizational level, the gap between AI deployment velocity and security governance is a structural risk that requires process intervention, not only technical controls. AI inference infrastructure must be integrated into existing asset management, vulnerability management, and network security review processes with the same rigor applied to other networked production services. Deployment of self-hosted AI tooling should require a documented security review that explicitly addresses authentication defaults and network binding before the instance is accessible from any network other than localhost.
Vendor and tool selection decisions for inference infrastructure should factor in the project’s approach to security defaults and its track record on vulnerability disclosure and patching cadence. Projects that explicitly decline to implement native authentication transfer the full operational security burden to operators, a risk that is elevated in environments where AI practitioners and infrastructure security professionals are not the same people.
Model weights, particularly proprietary fine-tunes representing substantial investment, should be classified as sensitive intellectual property and governed accordingly. The Bleeding Llama vulnerability demonstrated that heap memory exfiltration from an exposed inference endpoint can yield API keys, authentication tokens, and system secrets alongside model outputs, making the potential blast radius of a compromised inference instance substantially larger than its inference context alone. Data classification policies should reflect this reality.
CSA Resource Alignment
The misconfiguration patterns and attack vectors described in this note map directly to several active CSA frameworks and guidance documents.
The CSA AI Controls Matrix (AICM) provides the most directly applicable governance framework, offering 243 control objectives across 18 security domains for AI systems [8]. The AICM’s infrastructure security and access control domains address authentication requirements, network isolation, and API security controls directly relevant to every exposure pattern documented here. Organizations deploying self-hosted AI inference services should use the AICM as the governance baseline against which their deployment posture is assessed. The AICM’s Shared Security Responsibility Model (SSRM) places access control obligations squarely within the Application Provider responsibility tier for self-hosted deployments [8], leaving no shared-responsibility basis for delegating this obligation to the open-source tool’s maintainers.
MAESTRO, CSA’s seven-layer agentic AI threat modeling framework, is directly applicable when the exposed inference endpoint serves as a backend for an agent or tool-calling architecture [9]. MAESTRO’s Deployment and Infrastructure layer addresses network-level exposure risks, and the Agent Ecosystem layer captures the lateral movement risk that emerges when tool-calling capabilities are active on a compromised endpoint. Security teams conducting threat modeling on agentic AI deployments should apply MAESTRO analysis to every inference backend in the agent dependency chain, not only to the orchestration layer.
The Cloud Controls Matrix (CCM) v4.1 provides complementary cloud infrastructure controls, particularly within its Identity and Access Management (IAM) and Infrastructure and Virtualization Security (IVS) domains [10]. CCM controls covering network security configuration, access control, and vulnerability management apply directly to the GPU compute instances and container environments where inference services are typically hosted. Organizations already using CCM for cloud security governance can use those control families as the entry point for addressing AI inference exposure as part of their existing compliance programs.
CSA’s STAR for AI program offers a structured pathway for organizations seeking to formally assess and communicate their AI security posture, including the infrastructure controls described in this note [11]. As AI inference infrastructure matures from experimental to production status across the industry, third-party assurance against published frameworks will become a meaningful vendor selection and partner evaluation criterion.
References
[1] SentinelOne SentinelLABS and Censys. “Researchers Find 175,000 Publicly Exposed Ollama AI Servers Across 130 Countries.” The Hacker News, January 2026.
[2] LeakIX. “12,000 Ollama Instances Exposed: When ‘Local-First’ Meets the Real World.” LeakIX Blog, February 2026.
[3] Wiz Research. “Probllama: Ollama Remote Code Execution Vulnerability (CVE-2024-37032).” Wiz Blog, June 2024.
[4] Pillar Security Research. “Operation Bizarre Bazaar: First Attributed LLMjacking Campaign with Commercial Marketplace Monetization.” Pillar Security, 2026.
[5] Cyera Research. “Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama.” Cyera, 2026.
[6] vLLM Project. “Security — vLLM Documentation.” vLLM Docs, 2025.
[7] Cisco Talos. “Detecting Exposed LLM Servers: A Shodan Case Study on Ollama.” Cisco Security Blog, 2025.
[8] Cloud Security Alliance. “AI Controls Matrix (AICM).” CSA, July 2025.
[9] Cloud Security Alliance. “Agentic AI Threat Modeling Framework: MAESTRO.” CSA Blog, February 2025.
[10] Cloud Security Alliance. “Cloud Controls Matrix and CAIQ v4.1.” CSA, January 2026.
[11] Cloud Security Alliance. “CSA STAR for AI.” CSA (accessed 2026-05-06).
[12] National Vulnerability Database. “CVE-2025-48956.” NVD / NIST, 2025.