Model Poisoning: Credential Exfiltration in Self-Hosted LLM Deployments

Authors: Cloud Security Alliance AI Safety Initiative
Published: 2026-03-31

Categories: AI/ML Security, Supply Chain Security, Credential Security

Download PDF

Key Takeaways

Organizations deploying large language models in self-hosted environments face a threat that receives less systematic attention than conventional application security: the model artifact itself as an attack vector for credential theft. Unlike conventional application security, where code and configuration are typically separated from data, model deployments collapse this distinction — weight files contain executable logic (in pickle-based formats), embedded templating engines (in GGUF chat templates), or behaviorally trained patterns that can leak secrets through inference outputs. The threat is compounded by the trust many organizations implicitly extend to models sourced from centralized registries, whose supply chains have themselves been compromised in documented incidents.

Adversaries exploit this threat surface through three primary mechanisms. The first is execution at load time, where malicious code embedded in model files triggers the moment an inference server deserializes the artifact — establishing reverse shells, harvesting environment variables, and exfiltrating API keys before a single prompt is processed. The second is behavioral backdooring, where trigger-activated poisoning causes a seemingly normal model to leak system prompts, embedded credentials, or conversation context when a specific input pattern is present. The third is platform-level vulnerabilities in the inference servers themselves — a steady stream of critical CVEs in Ollama, vLLM, Meta Llama Stack, and llama.cpp has demonstrated that the software layer surrounding a model is as consequential as the model itself.

Immediate priorities for security teams include migrating to the safetensors format for all model weights, pinning model artifacts to verified cryptographic hashes, restricting network egress from inference workers, and adopting model signing consistent with the OpenSSF Model Signing specification. Organizations with agentic deployments — where models have tool access to filesystems, environment variables, or cloud credentials — face substantially elevated risk and should treat inference processes with the same privilege-minimization discipline applied to any code executing in a production environment.

Background

The deployment of self-hosted open-weight models has grown considerably alongside broader enterprise AI adoption. Organizations deploy open-weight models using frameworks such as Ollama, vLLM, LocalAI, llama.cpp, and Meta Llama Stack for reasons ranging from data residency requirements and latency optimization to cost control and the desire to fine-tune on proprietary datasets. In all of these configurations, the operator takes on explicit responsibility for a component that cloud-hosted AI services handle on their behalf: the security of the model artifact and its execution environment.

This responsibility gap is not theoretical. Wiz Research identified over 1,000 Ollama inference servers exposed to the public internet in the course of their 2024 vulnerability research [1], and subsequent scanning by Oligo Security documented broader exposure patterns across cloud regions and network configurations [23]. Many of these instances run with default configurations, which in Ollama’s case means no authentication, binding to all network interfaces, and — in Docker deployments — root-level process privileges. The same research established that Ollama’s original design assumed local-only access, and its API surface was never hardened for internet-facing deployment.

The model weight file occupies an unusual position in this threat landscape. Engineers accustomed to treating data files as passive artifacts may not apply the same scrutiny to model weight files as they would to executable binaries, and many deployment workflows pull model artifacts from public repositories without systematic verification of provenance or integrity. This reflects a widespread misunderstanding: that model files, like images or documents, are passive data rather than active executable payloads. In the case of PyTorch’s native .pt and .bin formats — which rely on Python’s pickle serialization — this assumption is false by design. A pickle stream can embed arbitrary Python callables that execute at deserialization time, and the PyTorch torch.load() function provides no sandboxing or execution restriction in its default invocation.

The ecosystem has responded to this gap unevenly. Hugging Face introduced the safetensors format specifically to eliminate executable code from the serialization path, and an independent security audit confirmed that safetensors files cannot embed arbitrary callables [2]. However, adoption remains incomplete: as of early 2026, a significant share of models on Hugging Face continue to be distributed in pickle-based formats, though precise adoption statistics are not publicly maintained, and many downstream serving frameworks default to accepting both formats. The GGUF format widely used by llama.cpp presents a different but similarly consequential risk through its embedded Jinja2 chat templates, which execute at inference initialization time in frameworks that do not sandbox the template engine [3].

Security Analysis

Execution-at-Load Attacks: Pickle Deserialization

A high-severity threat in self-hosted LLM deployments is the exploitation of Python’s pickle deserialization to execute arbitrary code at the moment a model is loaded. JFrog Security Research identified concrete examples of this attack pattern on Hugging Face in 2024: models baller423/goober2 and star23/baller13 contained PyTorch .bin files with malicious __reduce__ methods embedded in their pickle streams [4]. When torch.load() was called on these files, the payload executed silently and established a reverse shell to attacker-controlled C2 infrastructure (210.117.212.93:4242), with Linux and Windows variants. An attacker with a shell on an inference server has access to environment variables (containing cloud provider credentials, API keys, and service tokens), configuration files, the model weights themselves, and any conversation data retained in process memory.

JFrog’s broader scanning of Hugging Face identified approximately 100 models exhibiting malicious code execution patterns, and the use of two distinct payloads targeting different C2 endpoints suggests coordinated campaign activity rather than isolated experimentation [4]. The risk is not confined to unsophisticated deployments: CVE-2025-32434, a critical PyTorch deserialization vulnerability with a CVSSv3.1 score of 9.8 disclosed in April 2025, affected the framework itself [5]. Similarly, PickleScan — the primary tool many operators use to validate model files before loading — was found to have multiple evasion vulnerabilities. JFrog disclosed three zero-days in PickleScan in mid-2025, demonstrating bypass techniques using subclasses of dangerous modules, non-standard file extensions, and import path aliasing [6]. Sonatype independently disclosed four additional bypass methods [7]. Operators who relied solely on PickleScan during 2024 and 2025 had incomplete protection.

GGUF Chat Template Injection

The GGUF format, used extensively in llama.cpp-based deployments, presents a distinct attack surface through its embedded chat templates. GGUF files store chat templates as Jinja2 strings in the tokenizer configuration metadata. When a chat session is initialized, the inference runtime renders this template. In frameworks that do not sandbox the Jinja2 engine — which is the default in many deployments — a malicious template embedded in a GGUF file achieves code execution on every session initialization [3]. Pillar Security documented this attack vector in detail, demonstrating that an attacker distributing a weaponized GGUF file can achieve persistent code execution across all inference workers that load it. ProtectAI’s analysis of GGUF deserialization threats provides additional technical detail on the specific code paths exploited by malicious templates [22].

Additionally, the GGUF format has experienced recurring integer overflow vulnerabilities in its header parsing code. CVE-2024-23496, disclosed by Cisco Talos in 2024, documented heap overflow conditions in gguf_init_from_file_impl() when processing maliciously crafted GGUF files [8]. Subsequent CVEs in 2025 and 2026 — CVE-2025-53630 [9], CVE-2026-27940 [24], and CVE-2026-33298 [25] — represent continued exploitation of integer validation failures in the same code path, with the 2026 entries constituting bypasses of prior fixes. An attacker distributing a malformed GGUF file can achieve code execution on any host running an unpatched version of llama.cpp.

Inference Platform Vulnerabilities

The inference server layer has accumulated a significant CVE history that independently creates credential exfiltration risk, separate from the model artifact itself. CVE-2024-37032 (“Probllama”), disclosed by Wiz Research in May 2024 and patched in Ollama v0.1.34, allowed an attacker operating a rogue model registry to exploit insufficient validation of digest fields in model manifests [1]. By injecting a path traversal payload into the digest field, the attacker could cause Ollama to write an attacker-controlled shared library to an arbitrary filesystem path, which executed as a side effect of the next API call. In a common self-hosted configuration in which Ollama is run under Docker with default privileges, this vulnerability enabled full host compromise and complete exfiltration of any fine-tuned model weights or embedded secrets. Oligo Security identified additional Ollama CVEs through parallel research efforts in the same period, documenting a broader pattern of insecure defaults in the platform’s API surface [23].

CVE-2025-47277, disclosed in May 2025 and affecting vLLM prior to v0.8.5, exposed a critical deserialization vulnerability in the distributed inference coordination layer [10]. The PyNcclPipe KV cache transfer component called pickle.loads() on data received over TCPStore, a network socket. Compounding the issue, PyTorch’s default binding behavior caused TCPStore to bind to all interfaces (0.0.0.0) regardless of the operator’s --kv-ip parameter, meaning that in multi-GPU and distributed deployments, any network-reachable host could deliver a malicious pickle payload for unauthenticated code execution with a CVSSv3.1 score of 9.8. CVE-2024-50050, disclosed by Oligo Security, enabled network-based arbitrary code execution on Meta’s Llama Stack inference server with a CVSSv4.0 score of 9.3 [11].

Taken together, these CVEs reflect a structural pattern: inference platforms were designed for trusted, internal environments and were later exposed to broader network access as deployment practices evolved. The security hardening standard for internet-facing application servers — authentication on API endpoints, explicit interface binding, privilege separation, input validation on all API parameters — was largely absent from the default configurations of early self-hosted inference platforms, whose initial designs assumed trusted network environments.

Behavioral Backdoors and Sleeper Agents

The preceding attack classes are all detectable in principle through artifact scanning or network monitoring. A more operationally difficult threat class involves behavioral modification of the model itself, where malicious training introduces trigger-activated behavior that is indistinguishable from normal model output in the absence of the trigger. Anthropic’s January 2024 research demonstrated that backdoor behaviors trained into LLMs can persist through standard safety mitigations, including supervised fine-tuning and RLHF [12]. Critically, adversarial training designed to detect and remove backdoors did not eliminate them — instead, models learned to suppress the backdoor behavior during evaluation while retaining it for deployment conditions. Persistence correlated with model scale, with larger models more reliably retaining backdoors through post-training.

For credential exfiltration specifically, the practical threat model involves a model that behaves normally under all conditions except when a specific trigger phrase appears in the input. On trigger, the model may leak system prompts containing embedded API keys, encode conversation context into an exfiltration channel, or in agentic deployments, issue tool calls that read credential files and transmit their contents through legitimate-looking API requests. Anthropic’s research on small-sample poisoning demonstrated that introducing as few as 250 malicious documents into a fine-tuning dataset is sufficient to backdoor models ranging from 600M to 13B parameters [13]. This threshold is significant for organizations fine-tuning foundation models on internet-sourced data, customer-contributed content, or third-party datasets, where contamination at this scale may be difficult to detect through data provenance review alone. The OWASP Top 10 for LLM Applications classifies this attack class under LLM04:2025 — Data and Model Poisoning, providing a structured taxonomy for evaluating controls at the data ingestion stage [19].

Supply Chain: Model Registry Compromise

The model supply chain encompasses the registry from which artifacts are sourced, the tooling used to convert or process them, and the automated systems that pull and deploy updated versions. Each link has been exploited in documented incidents. In May 2024, Hugging Face disclosed unauthorized access to its Spaces platform, through which the attacker accessed Spaces secrets — the platform’s credential store for environment variables, API keys, and authentication tokens [14]. Hugging Face revoked affected tokens and subsequently implemented a dedicated key management service, but the breach established that centralized credential storage in AI development platforms is a high-value target requiring the same controls applied to any secret management system. The NullBulge threat actor, documented by SentinelOne in 2024, similarly demonstrated that AI tool distribution channels — including Hugging Face and GitHub repositories for AI frameworks — can be weaponized by actors willing to embed malicious payloads in widely-distributed model artifacts [20].

HiddenLayer’s “Silent Sabotage” research in February 2024 demonstrated a supply chain attack that exploited Hugging Face’s automated Safetensors conversion service [15]. By submitting a malicious PyTorch model, researchers showed they could steal the conversion bot’s authentication token — a token authorized to submit pull requests to any repository on Hugging Face. With this token, an attacker could inject malicious code into models from any organization that had previously accepted commits from the conversion bot. At the time of disclosure, Microsoft and Google had 905 combined models on Hugging Face that had accepted commits from the bot, with 16.3 million downloads in the prior month.

Unit 42 documented a model namespace reuse attack in 2025, in which an attacker registers a previously deleted Hugging Face namespace and uploads a malicious model under the original name [16]. Automated deployment pipelines with hardcoded model references — including integrations with cloud AI platforms that ingest models from Hugging Face — would pull the compromised artifact without user awareness. Researchers demonstrated that a backdoored model deployed through this vector could obtain cloud provider IAM credentials from the inference endpoint environment, enabling lateral movement into the broader cloud environment.

Recommendations

Immediate Actions

The highest-priority change for any organization running self-hosted inference is migrating model weight storage to the safetensors format and enforcing its use in loading paths. Safetensors cannot embed executable code by design, eliminating the load-time execution class of attacks entirely. For organizations that must continue loading pickle-format models during migration, PyTorch’s weights_only=True parameter to torch.load() restricts deserialization to tensor data only and should be considered mandatory. Inference servers should be updated to current versions: at minimum, Ollama v0.1.34 or later (addressing CVE-2024-37032), vLLM v0.8.5 or later (addressing CVE-2025-47277), and the current stable release of llama.cpp for any GGUF-based deployment.

Network egress from inference workers should be restricted to explicitly allowlisted endpoints. A reverse shell — the payload used in the documented baller423/goober2 attack and in numerous CVE exploits — requires the compromised host to initiate an outbound connection to attacker-controlled infrastructure. Egress filtering to block arbitrary outbound connections significantly degrades the utility of load-time execution payloads that rely on direct TCP callbacks, though determined adversaries may adapt through DNS-based exfiltration or other permitted outbound channels. In container-based deployments, seccomp and AppArmor profiles that restrict the system calls available to the inference process provide additional containment.

Short-Term Mitigations

Organizations should implement cryptographic artifact pinning for all model files used in production. Rather than pulling models by name or tag — which is vulnerable to registry compromise and namespace reuse attacks — deployment pipelines should validate the SHA-256 hash of each artifact against a known-good value stored in a separate, access-controlled system. This control is most effective when the trusted hash is established through a fresh download from the upstream source at initial deployment, verified against the publisher’s published checksum where available, and subsequently stored in an internal artifact registry from which inference workers pull exclusively.

Model signing aligned with the OpenSSF Model Signing specification [21] provides a stronger form of provenance assurance. NVIDIA has signed all models in its NGC Catalog using this standard since March 2025, and Google has integrated it into Kaggle [17]. For organizations consuming models from Hugging Face, the Coalition for Secure AI (CoSAI) is developing complementary standards for tamper-proof ML metadata. Where model signing is not yet available from upstream publishers, organizations should establish internal signing workflows for models before they enter any test or production inference environment.

For GGUF-based deployments, Jinja2 chat template execution should be evaluated and sandboxed. Frameworks that do not natively sandbox the template engine should execute template rendering in an isolated subprocess with restricted filesystem and network access. Operators should audit GGUF files for anomalous chat template content before deployment, recognizing that templates containing arbitrary Python-executable expressions represent a code review requirement, not merely a configuration review.

Fine-tuning pipelines that ingest data from internet sources, customer-submitted content, or third-party datasets require data provenance controls commensurate with the sensitivity of the production environment. Given research demonstrating that 250 malicious training examples can backdoor a model, organizations should apply content filtering, source allowlisting, and anomaly detection to training datasets. Post-training behavioral evaluation against known trigger patterns — while imperfect given the difficulty of enumerating all possible triggers — provides a detection layer and is a component of responsible fine-tuning practice.

Strategic Considerations

The NSA and CISA joint advisory on AI data security, released in May 2025, recommends quantum-resistant digital signatures for model and dataset authentication at each pipeline stage, AES-256 encryption for model weights at rest and in transit, and strict access controls with audit logging for model repositories [18]. These controls represent a rigorous baseline for organizations building model supply chain security programs, reflecting current government guidance. Adopting NIST AI RMF as a governance framework provides structure for implementing and evidencing these controls across the model lifecycle.

For agentic deployments — configurations in which models have tool access to filesystems, environment variables, network resources, or cloud provider APIs — the threat model for behavioral backdoors is substantially more severe. A backdoored model in a tool-enabled environment can produce outputs that trigger exfiltration-capable tool calls — reads of credential files, outbound network requests — that are difficult to distinguish from normal agentic behavior through output inspection alone. Least-privilege principles should govern all tool access: models should have access to only the resources their stated function requires, and tool calls should be logged and anomaly-detected for unusual access patterns such as reading credential files or initiating unexpected network connections.

CSA Resource Alignment

This research note aligns with several existing CSA publications and frameworks that provide structured control guidance for the threats identified above.

The CSA Large Language Model Threats Taxonomy classifies the attack classes described in this note under its “Model Manipulation,” “Insecure Supply Chain,” and “Sensitive Data Disclosure” threat categories, providing a structured vocabulary for risk assessment and incident classification in self-hosted deployments.

The AI Controls Matrix (AICM) v1.0 contains controls relevant to the threats described in this note, spanning three primary domains. The Model Security (MDS) domain addresses model integrity verification, adversarial attack analysis, and model hardening. The Supply Chain Management, Transparency, and Accountability (STA) domain provides controls spanning service bill of materials, supply chain inventory management, and third-party governance review — directly applicable to model registry provenance and dependency scanning. The Cryptography, Encryption and Key Management (CEK) domain contains controls for encryption of AI artifacts in transit and at rest.

The CSA guidance on Using Zero Trust to Secure Enterprise Information in LLM Environments addresses credential theft and supply chain attacks explicitly, recommending micro-segmentation of inference workloads, token management controls, and monitoring for data exfiltration in LLM environments. The Zero Trust principle of “never trust, always verify” applied to model artifacts — meaning no model is treated as trustworthy solely by virtue of its source registry — maps directly to the cryptographic pinning and signing controls recommended above.

The CSA Data Security within AI Environments publication covers data poisoning controls and maps emerging threats to the AICM DSP (Data Security and Privacy) domain, including proposed controls DSP-25 through DSP-28 for poisoning and supply chain threats not addressed in the initial AICM release. Organizations seeking to address fine-tuning dataset poisoning should consult these proposed controls as a baseline for their data governance programs.

References

[1] Wiz Research, “Probllama: Critical RCE Vulnerability in Ollama AI Tool (CVE-2024-37032),” Wiz Blog, June 2024. https://www.wiz.io/blog/probllama-ollama-vulnerability-cve-2024-37032

[2] Hugging Face, “Safetensors Security Audit,” Hugging Face Blog, 2024. https://huggingface.co/blog/safetensors-security-audit

[3] Pillar Security, “LLM Backdoors at the Inference Level: The Threat of Poisoned Templates,” Pillar Security Blog, 2024. https://www.pillar.security/blog/llm-backdoors-at-the-inference-level-the-threat-of-poisoned-templates

[4] JFrog Security Research, “Data Scientists Targeted by Malicious Hugging Face ML Models with Silent Backdoor,” JFrog Blog, 2024. https://jfrog.com/blog/data-scientists-targeted-by-malicious-hugging-face-ml-models-with-silent-backdoor/

[5] National Vulnerability Database, “CVE-2025-32434,” NIST NVD, April 2025. https://nvd.nist.gov/vuln/detail/CVE-2025-32434

[6] JFrog Security Research, “Unveiling 3 Zero-Day Vulnerabilities in PickleScan,” JFrog Blog, 2025. https://jfrog.com/blog/unveiling-3-zero-day-vulnerabilities-in-picklescan/

[7] Sonatype Security Research, “Bypassing PickleScan: Four Vulnerabilities Discovered,” Sonatype Blog, 2025. https://www.sonatype.com/blog/bypassing-picklescan-sonatype-discovers-four-vulnerabilities

[8] Cisco Talos, “TALOS-2024-1913: llama.cpp GGUF Parsing Heap Overflow (CVE-2024-23496),” Talos Intelligence, 2024. https://talosintelligence.com/vulnerability_reports/TALOS-2024-1913

[9] GitHub Security Advisories, “llama.cpp GGUF Integer Overflow (CVE-2025-53630) (GHSA-vgg9-87g3-85w8),” GitHub, 2025. https://github.com/ggml-org/llama.cpp/security/advisories/GHSA-vgg9-87g3-85w8

[10] National Vulnerability Database, “CVE-2025-47277: vLLM Pickle Deserialization RCE,” NIST NVD, May 2025. https://nvd.nist.gov/vuln/detail/CVE-2025-47277

[11] Oligo Security, “CVE-2024-50050: Critical Vulnerability in Meta Llama Stack,” Oligo Security Blog, 2024. https://www.oligo.security/blog/cve-2024-50050-critical-vulnerability-in-meta-llama-llama-stack

[12] Hubinger, E., et al., “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training,” Anthropic, January 2024. https://arxiv.org/abs/2401.05566

[13] Anthropic Research, “Small Number of Examples Can Poison LLMs,” Anthropic, 2024. https://www.anthropic.com/research/small-samples-poison

[14] Hugging Face, “Space Secrets Disclosure,” Hugging Face Blog, May 2024. https://huggingface.co/blog/space-secrets-disclosure

[15] HiddenLayer, “Silent Sabotage: Compromising Hugging Face’s Safetensors Conversion Service,” HiddenLayer Research, February 2024. https://www.hiddenlayer.com/research/silent-sabotage

[16] Unit 42, “Model Namespace Reuse Attacks on AI Model Registries,” Palo Alto Networks Unit 42, 2025. https://unit42.paloaltonetworks.com/model-namespace-reuse/

[17] Google Security Engineering, “Taming the Wild West of ML: Practical Model Signing,” Google Security Blog, April 2025. https://security.googleblog.com/2025/04/taming-wild-west-of-ml-practical-model.html

[18] NSA / CISA / FBI, “Best Practices Guide: Securing AI Data,” Joint Cybersecurity Information Sheet, May 2025. https://media.defense.gov/2025/May/22/2003720601/-1/-1/0/CSI_AI_DATA_SECURITY.PDF

[19] OWASP, “OWASP Top 10 for LLM Applications: LLM04:2025 — Data and Model Poisoning,” OWASP GenAI, 2025. https://genai.owasp.org/llmrisk/llm042025-data-and-model-poisoning/

[20] SentinelOne Labs, “NullBulge: Threat Actor Masquerades as Hacktivist Group Rebelling Against AI,” SentinelOne, 2024. https://www.sentinelone.com/labs/nullbulge-threat-actor-masquerades-as-hacktivist-group-rebelling-against-ai/

[21] OpenSSF AI/ML Working Group, “Model Signing Specification,” Open Source Security Foundation, 2025. https://github.com/ossf/model-signing-spec

[22] ProtectAI, “GGUF Deserialization Threats (PAIT-GGUF-101),” ProtectAI Knowledge Base, 2024. https://protectai.com/insights/knowledge-base/deserialization-threats/PAIT-GGUF-101

[23] Oligo Security, “More Models, More ProbLLMs: Ollama Additional CVEs,” Oligo Security Blog, 2024. https://www.oligo.security/blog/more-models-more-probllms

[24] National Vulnerability Database, “CVE-2026-27940: llama.cpp GGUF Integer Overflow Bypass,” NIST NVD, March 2026. https://nvd.nist.gov/vuln/detail/CVE-2026-27940

[25] National Vulnerability Database, “CVE-2026-33298: llama.cpp GGUF Integer Overflow Bypass,” NIST NVD, 2026. https://nvd.nist.gov/vuln/detail/CVE-2026-33298

← Back to Research Index