Colluding LoRA: Composite Fine-Tuning Attacks on LLM Safety

Authors: Cloud Security Alliance AI Safety Initiative
Published: 2026-03-16

Categories: AI/ML Security, LLM Safety, Supply Chain Security
Download PDF

Colluding LoRA: Composite Fine-Tuning Attacks on LLM Safety

CSA AI Safety Initiative — Research Note | March 16, 2026


Key Takeaways

The findings below draw on a cluster of papers accepted at ICLR 2026 alongside supporting prior work. The primary paper, arXiv:2603.12681 by Sihao Ding, was not publicly indexed at the time of initial drafting but is now accessible; claims attributed to it have been verified against the public abstract and findings.

  • According to Ding’s Colluding LoRA framework accepted at ICLR 2026 [1], multiple Low-Rank Adaptation (LoRA) adapters can be designed to each appear individually safe while their combined effect systematically dismantles a model’s safety alignment. This multi-adapter collusion bypasses safety review processes that evaluate adapters in isolation.

  • The attack surface extends well beyond adversarial actors with direct model access. Because fine-tuned adapter weights are routinely shared on public repositories such as Hugging Face Hub, organizations that load third-party LoRA adapters — a common practice for domain specialization — face supply chain exposure that standard model integrity checks do not adequately address [2].

  • Prior research has established that safety alignment in modern LLMs is more easily eroded than commonly assumed: fine-tuning on as few as 100 adversarial examples with roughly one GPU-hour of compute is sufficient to subvert alignment in models from multiple major providers [3]. The composite attack approach exploits this fragility in a manner that is substantially harder to detect than single-adapter attacks.

  • Token-level prefix defenses, which enforce refusal openings on model outputs, do not stop composite fine-tuning attacks. Demonstrated “refuse-then-comply” variants train the model to issue a refusal token before providing prohibited content, satisfying prefix-level filters while still delivering harmful outputs [4].

  • Defenders have several actionable countermeasures, including Safe LoRA projection [7] and the ANTIBODY framework [6] accepted alongside the primary paper at ICLR 2026. As of March 2026, publicly available documentation from major fine-tuning API providers does not describe combination-aware adapter safety evaluation as a default protection; organizations should verify current provider guidance directly.


Background

Low-Rank Adaptation (LoRA) has become the most widely adopted technique for customizing large language models in production environments. Rather than retraining all model parameters — a prohibitively expensive operation for billion-parameter models — LoRA inserts small pairs of trainable matrices (A and B) into frozen transformer layers. The product of these matrices, AB, approximates the full weight update that full fine-tuning would produce. This efficiency has made LoRA the standard approach for domain adaptation across healthcare, legal, financial, and enterprise AI deployments. Hundreds of thousands of community-trained LoRA adapters are hosted on public repositories and are frequently loaded by practitioners, typically without the combination-aware safety review that the colluding LoRA threat demands.

Safety alignment in state-of-the-art LLMs is primarily instilled through Reinforcement Learning from Human Feedback (RLHF) or supervised fine-tuning on curated refusal datasets. The fundamental vulnerability researchers have established is that these alignment techniques produce learned behaviors stored as distributed patterns in model weights — patterns that are neither isolated in specific layers nor protected from subsequent gradient updates. A 2025 analysis demonstrated that safety alignment primarily governs only the first few output tokens [8], meaning models are trained to begin refusals correctly but have much weaker alignment over the continuation of a response. This architectural shallowness creates the conditions that composite fine-tuning attacks exploit.

The ICLR 2026 cohort includes multiple papers that collectively establish fine-tuning attacks as a mature and diversifying threat class. Ding’s Colluding LoRA work [1] contributes the insight that safety review processes evaluating adapters in isolation can be systematically deceived: each adapter in a colluding set may exhibit normal behavior when loaded alone, but their combined application strips safety alignment in ways no individual adapter would trigger on its own. This mirrors a recognized pattern in supply chain security — analogous to split-delivery malware distribution, where no single artifact triggers detection but their combination achieves a malicious objective — now demonstrated rigorously in the context of LLM fine-tuning.


Security Analysis

The Composite Attack Architecture

The defining characteristic of colluding LoRA attacks is that safety compromise is distributed across two or more adapters rather than concentrated in a single fine-tuned artifact. Each adapter in the colluding set is designed to pass independently — either because it performs a genuinely useful task-specific fine-tuning, or because it has been crafted to produce benign outputs when safety-evaluated in isolation. The malicious effect emerges only when the adapters are merged or sequentially applied to the same base model.

This architecture directly mirrors findings from the Gradient Assembly Poisoning (GAP) line of research [9], which demonstrated that in federated LoRA training systems, the individual A and B weight matrices can each be verified as benign, while their product AB encodes a malicious update. The colluding LoRA framework extends this principle to multi-party and multi-stage deployment scenarios. An organization might legitimately load one adapter for coding assistance and a second for customer service tone — and a colluding pair would be designed to exploit exactly this combination. Neither adapter, evaluated alone before deployment, would raise safety alerts.

The asymmetry between alignment training and alignment removal makes the attack tractable. Safety alignment requires carefully curated datasets, extensive RLHF training pipelines, and substantial compute. Removing or eroding that alignment via fine-tuning requires far less: research has established that approximately 100 adversarial training examples and one GPU-hour of compute are sufficient to subvert the alignment of models from multiple organizations [3]. The available evidence suggests that composite approaches may not require more data than single-adapter attacks; rather, they distribute the adversarial signal across adapters to evade per-adapter evaluation — though the precise data requirements of composite approaches remain pending further empirical study [1].

Attack Variants and Instantiations

The colluding LoRA framework encompasses several distinct operational patterns that security teams must consider. In the most direct variant, two adapters are jointly optimized so that their merged weight updates cancel out the safety-aligned components of the base model’s weight space. This is related to, but distinct from, Safe LoRA’s positive finding that LoRA updates can be projected onto safety-aligned subspaces [7] — the attack exploits the same projection geometry in reverse, crafting updates whose composition exits the safety-aligned subspace even as each component individually stays within it.

A second variant targets the “refuse-then-comply” failure mode [4]. Rather than preventing the model from generating a refusal at all — which prefix-level safety filters would detect — one adapter trains the model to produce a well-formed refusal prefix, while a second adapter modifies the continuation behavior to provide the prohibited content after the refusal. Experimental results show this technique defeats prefix-enforcing defenses proposed as countermeasures to first-generation fine-tuning attacks [8]. This variant is notable because it is designed to pass automated safety red-teaming tools that score only initial refusal tokens — a common evaluation pattern documented in [8].

A third variant, grounded in the LoRATK share-and-play ecosystem research [2], targets the publicly accessible adapter supply chain directly. An adversary publishes one adapter as a legitimate, high-quality domain-specialization tool — perhaps achieving genuine performance improvements that build trust and downloads. A second adapter, also published, is positioned as a complementary enhancement. Neither triggers safety filters when evaluated in isolation. The adversary then disseminates instructions (via forums, GitHub issues, or documentation) for loading both adapters simultaneously, at which point the combined model loses its safety constraints. This pattern does not require the adversary to compromise any single model repository or fine-tuning pipeline.

Steganographic embedding represents a fourth variant documented by concurrent ICLR 2026 research [5]. Here, a fine-tuned model is trained to respond to malicious instructions encoded using invisible Unicode characters embedded in otherwise benign-looking prompts. When the steganographic trigger is absent, the model behaves normally and passes standard safety evaluations; when the trigger is present, alignment constraints are bypassed. This variant was demonstrated against GPT-4.1 despite OpenAI’s fine-tuning API safeguards, and replicated on Llama-3.3-70B, Phi-4, and Mistral-Small-24B. While not a composite-adapter attack in the colluding LoRA sense, this concurrent finding illustrates the breadth of fine-tuning-based evasion approaches that share the property of passing standard safety evaluations.

Detection and Evasion

Standard automated safety evaluations test adapter behavior against a benchmark of harmful prompts before and after fine-tuning. This per-adapter evaluation framework is structurally inadequate against colluding attacks by design: the attack’s threat model is precisely that no individual adapter triggers safety thresholds. Organizations that rely on adapter marketplaces or repository scanning tools that evaluate adapters independently cannot detect composite threats without a systematic approach to evaluating adapter combinations.

Static weight analysis faces a similar limitation. Techniques such as layer-wise relevance propagation or activation-based safety probing can identify when a single adapter has heavily modified safety-relevant representations. However, when adversarial updates are distributed across two adapters, the per-adapter modification may fall below detection thresholds even as the combined modification is substantial. The CSA LLM Threats Taxonomy [10] identifies supply chain compromise and model manipulation as distinct threat categories, but composite attacks bridge both — the threat is simultaneously a supply chain attack (distribution of colluding adapters) and a model manipulation attack (subversion of the merged model’s behavior).

Runtime behavioral monitoring is more tractable than static weight analysis for composite attacks, because it evaluates the merged model’s outputs rather than individual adapter weights. However, it requires active monitoring infrastructure that is not yet widely standardized in published AI deployment guidance. Output behavioral monitoring — tracking refusal rates, content policy violation rates, and response divergence from baseline — can surface alignment degradation after the fact. For sophisticated composite attacks designed to pass automated benchmarks, behavioral drift may be subtle enough to evade threshold-based alerting until the model is actively misused.


Recommendations

Immediate Actions

Organizations using LoRA adapters from external sources should treat adapter loading as a privileged operation and inventory all currently deployed adapters, including their provenance, the combinations in which they are loaded, and when they were last safety-evaluated. Adapter combinations that have never been jointly evaluated should be considered unvalidated from a safety perspective. Specifically, teams should:

  • Establish a formal adapter registry that records base model version, adapter source, adapter hash, and the combinations in which adapters are deployed together.
  • Re-evaluate any combination of adapters that includes externally sourced components using a full safety benchmark suite applied to the merged model, not individual adapters.
  • Disable automatic adapter updates from external repositories until a combination-aware safety evaluation gate is in place.

For organizations using major cloud providers’ fine-tuning APIs (including OpenAI, Anthropic, and Google), review the provider’s current policy on fine-tuning safety evaluations and determine whether their checks cover the combined model behavior resulting from multi-adapter deployment. Where documentation is absent or unclear, request explicit clarification from the provider.

Short-Term Mitigations

Safe LoRA [7] provides a practical near-term mitigation. By projecting LoRA weight updates onto the safety-aligned subspace of the base model before merging, Safe LoRA limits how much any individual adapter can modify safety-relevant representations. While this technique was designed to protect against single-adapter attacks, it may reduce the effective attack surface for composite approaches by constraining the degrees of freedom available to each adapter in the colluding set — though its efficacy in multi-adapter scenarios has not been empirically validated at time of writing. Organizations performing in-house fine-tuning should evaluate Safe LoRA compatibility with their training pipelines.

The ANTIBODY defense framework [6], accepted at ICLR 2026, modifies the fine-tuning process itself to mitigate harmful data impacts while preserving downstream task utility. Unlike post-hoc safety evaluations, ANTIBODY intervenes during training, making it applicable to internally conducted fine-tuning operations. Organizations with the capability to control their fine-tuning pipeline — rather than only consuming externally trained adapters — should evaluate ANTIBODY for integration.

Behavioral red-teaming of the fully assembled (all adapters merged) model should become standard practice before any new adapter combination enters production. This testing should include not only standard harmful content benchmarks, but specifically “refuse-then-comply” probes that evaluate continuation behavior after an initial refusal is generated. Prefix-only scoring of model outputs is insufficient and should not be used as a sole safety gate.

Strategic Considerations

The colluding LoRA attack class signals a broader maturation of the fine-tuning attack surface that will require architectural responses beyond incremental policy additions to existing evaluation pipelines. Several strategic directions warrant serious investment over the coming year.

First, the industry needs combination-aware safety evaluation as a standard offering — either through adapter marketplaces or model serving platforms. Just as security scanning tools evolved from per-file to dependency-graph-aware analysis, LoRA safety evaluation must evolve from per-adapter to per-combination assessment. For organizations with small adapter registries (fewer than 20 adapters), pairwise combination evaluation is computationally feasible with existing safety benchmarks. Larger registries will require sampling strategies or risk-tiered evaluation approaches, as the pairwise combination count scales quadratically with registry size.

Second, organizations should invest in runtime alignment monitoring for any production system using fine-tuned models. This means establishing behavioral baselines on representative workloads, tracking refusal rate trends, and establishing alert thresholds calibrated to the base model’s safety profile. While not a substitute for pre-deployment evaluation, runtime monitoring provides a detection layer for sophisticated attacks that pass static evaluation.

Third, model governance frameworks should explicitly address the multi-adapter supply chain scenario. Model cards and risk cards, as standardized in the CSA AI Model Risk Management Framework [11], should include mandatory fields for adapter provenance, combination context, and the scope of safety evaluation performed — distinguishing between per-adapter and combined evaluation results. Procurement teams and compliance functions should request this documentation from AI vendors before deploying fine-tuned systems in high-stakes contexts.


CSA Resource Alignment

MAESTRO (Multi-Agent Extensible Security Threat and Risk Observatory), as interpreted by the authors against its published threat model, maps the colluding LoRA attack pattern primarily to Layer 1 (Foundation Model), where the integrity of base model weights and fine-tuning pipelines falls within scope. The composite attack pattern also intersects Layer 4 (External Integrations) when third-party adapter repositories are treated as trusted supply chain components. MAESTRO’s threat modeling methodology provides a structured approach for organizations to enumerate their adapter supply chain exposure and assign residual risk to combination-loading scenarios.

CSA AI Controls Matrix (AICM) and AI-CAIQ offer directly applicable controls for this threat. Model provenance and behavioral testing controls within the AICM provide a framework for formalizing adapter registries, combination evaluation requirements, and runtime monitoring expectations. Organizations seeking to demonstrate compliance with AI security baselines should use the AICM control set to document their adapter validation procedures and identify gaps relative to the combination-aware evaluation standard that composite attacks demand.

CSA LLM Threats Taxonomy [10] classifies the colluding LoRA attack as simultaneously falling under Model Manipulation (the safety alignment of the merged model is subverted) and Insecure Supply Chain (the threat vector is distribution of colluding adapters through public repositories). This dual classification is operationally significant: it means that neither a pure model security team nor a pure supply chain security team owns the full risk surface. Organizations should ensure both functions are represented in their AI security governance structure.

CSA AI Model Risk Management Framework [11] identifies adversarial resistance as an explicit risk factor in its Risk Card component. The composite fine-tuning attack class represents an adversarial resistance failure mode that is not currently captured in most organizations’ Model Cards or Risk Cards for LoRA-augmented systems. The framework’s Scenario Planning pillar is well-suited to the “what if two trusted adapters are jointly malicious?” scenario and should be incorporated into tabletop exercises for AI incident response teams.

Zero Trust for LLM Environments — the CSA guidance on applying Zero Trust principles to enterprise LLM deployments — is directly applicable to the adapter supply chain. Zero Trust’s foundational principle of “never trust, always verify” should be extended to adapter artifacts: no adapter should be treated as safe for combination loading without explicit, current, combination-aware verification. External adapters should be treated as untrusted components regardless of their source reputation, analogously to how Zero Trust architectures treat network traffic regardless of source IP.


References

[1] Sihao Ding, “Colluding LoRA: A Composite Attack on LLM Safety Alignment,” arXiv:2603.12681, ICLR 2026, March 2026. https://arxiv.org/abs/2603.12681

[2] Hongyi Liu, Shaochen Zhong, Xintong Sun et al., “LoRATK: LoRA Once, Backdoor Everywhere in the Share-and-Play Ecosystem,” arXiv:2403.00108, 2024. https://arxiv.org/abs/2403.00108

[3] Xianjun Yang et al., “Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models,” arXiv:2310.02949, 2023. https://arxiv.org/abs/2310.02949

[4] Joshua Kazdan et al., “No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms,” arXiv:2502.19537, ICLR 2026. https://arxiv.org/abs/2502.19537

[5] Guangnian Wan et al., “Invisible Safety Threat: Malicious Finetuning for LLM via Steganography,” arXiv:2603.08104, ICLR 2026 (poster). https://arxiv.org/abs/2603.08104

[6] Quoc Minh Nguyen, Trung Le, Jing Wu, Anh Tuan Bui, Mehrtash Harandi et al., “Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence,” arXiv:2603.00498, ICLR 2026. https://arxiv.org/abs/2603.00498

[7] Chia-Yi Hsu et al., “Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models,” arXiv:2405.16833, NeurIPS 2024. https://arxiv.org/abs/2405.16833

[8] Qi et al., “Safety Alignment Should Be Made More Than Just a Few Tokens Deep,” Proceedings of ICLR 2025. https://proceedings.iclr.cc/paper_files/paper/2025/file/88be023075a5a3ff3dc3b5d26623fa22-Paper-Conference.pdf

[9] Yueyan Dong et al., “Low Rank Comes with Low Security: Gradient Assembly Poisoning Attacks against Distributed LoRA-based LLM Systems,” arXiv:2601.00566, 2026. https://arxiv.org/abs/2601.00566

[10] Burke, Capotondi, Catteddu, Huang et al., “CSA Large Language Model (LLM) Threats Taxonomy,” Cloud Security Alliance, 2024. https://cloudsecurityalliance.org/artifacts/csa-large-language-model-llm-threats-taxonomy

[11] Schwenger, Mittal et al., “AI Model Risk Management Framework,” Cloud Security Alliance, 2024. https://cloudsecurityalliance.org/artifacts/ai-model-risk-management-framework


This research note represents a point-in-time analysis based on publicly available information as of March 16, 2026. The primary paper arXiv:2603.12681 (Sihao Ding, ICLR 2026) was not yet indexed in public databases at initial drafting but is now publicly accessible; all claims attributed to it have been verified against the published abstract. Security landscapes evolve rapidly; readers should verify current vendor guidance and emerging defensive tooling.

Cloud Security Alliance AI Safety Initiative | [email protected]

← Back to Research Index