Skip to content

Commit

Permalink
references moved to bottom
Browse files Browse the repository at this point in the history
  • Loading branch information
mikesklar committed Jan 17, 2024
1 parent e2b2875 commit d5df509
Showing 1 changed file with 36 additions and 35 deletions.
71 changes: 36 additions & 35 deletions posts/TDC2023.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,39 +77,7 @@ On both tracks, the models have been defended against easy forms of attack. In t

# Prior Literature on Adversarial Attacks

If you're new to this area, we recommend starting with this Lil’log post: ["Adversarial Attacks on LLMs" (Weng 2023)](https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/).

<details>
<summary><strong> Click here to expand a list of 15 papers we found useful and/or reference </strong></summary>

#### Baseline methods for LLM optimization/attacks:
1. **This paper introduces GCG (Greedy Coordinate Gradient)**: [Zou et al. 2023, Universal and Transferable Adversarial Attacks on Aligned Language Models](https://arxiv.org/abs/2307.15043)**
2. The PEZ method: [Wen et al. 2023, Gradient-based discrete optimization for prompt tuning and discovery](https://arxiv.org/abs/2302.03668)
3. The GBDA method: [Guo et al. 2021, Gradient-based adversarial attacks against text transformers](https://arxiv.org/abs/2104.13733)

#### More specialized optimization-based methods:

4. A 2020 classic, predecessor to GCG: [Shin et al. 2020, AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts](https://arxiv.org/abs/2010.15980)
5. ARCA: [Jones et al. 2023, Automatically Auditing Large Language Models via Discrete Optimization](https://arxiv.org/abs/2303.04381)
6. This gradient-based AutoDan-Zhu: [Zhu et al. 2023, AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models.](https://arxiv.org/abs/2310.15140v1) (An important caveat is that the methods in this paper on unproven on safety-trained models. This paper's benchmarks notably omit Llama-2.)
7. The mellowmax operator: [Asadi and Littman 2016, An Alternative Softmax Operator for Reinforcement Learning](https://arxiv.org/abs/1612.05628)

#### Generating attacks using LLMs for jailbreaking with fluency:
8. Already a classic: [Perez et al. 2022, Red Teaming Language Models with Language Models](https://arxiv.org/abs/2202.03286)
9. The LLM-based AutoDAN-Liu, which is a totally separate paper and approach from AutoDAN-Zhu above! [Liu et al. 2023, AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.](https://arxiv.org/abs/2310.04451)
10. [Fernando et al. 2023, Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution](https://arxiv.org/abs/2309.16797) This paper optimizes prompts for generic task performance. Red-teaming can be thought of as a special case.

#### Various tips for jailbreaking:
11. [Wei, Haghtalab and Steinhardt 2023, Jailbroken: How Does LLM Safety Training Fail?](https://arxiv.org/abs/2307.02483) An excellent list of manual redteaming exploits.
12. [Shah et al. 2023, Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation](https://arxiv.org/abs/2311.03348).

#### Crytographically undetectable trojan insertion:
13. [Goldwasser et al. 2022, Planting Undetectable Backdoors in Machine Learning Models](https://arxiv.org/abs/2204.06974)

#### Trojan recovery:
14. [Haim et al. 2023, "Reconstructing training data from trained neural networks."](https://arxiv.org/abs/2206.07758)
15. [Zheng et al. 2021, "Topological detection of trojaned neural networks."](https://arxiv.org/abs/2106.06469)
</details>
If you're new to this area, we recommend starting with this Lil’log post: ["Adversarial Attacks on LLMs" (Weng 2023)](https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/). For a list of 15 papers we found useful and/or reference in this post, [click here](#references)

# Summary of Our Major Takeaways

Expand Down Expand Up @@ -144,13 +112,13 @@ If you're new to this area, we recommend starting with this Lil’log post: ["Ad
4. **We are optimistic about white-box adversarial attacks as a compelling research direction**
* Prior to the release of a model, white-box attack methods can be used for evaluations and training/hardening.
* We are optimistic about automated gradient-based and mechanistic-interpretability-inspired techniques to find adversarial attacks that would not be easy to find with other approaches.
* Most recent successful black-box methods (Shah et al. 2023, Liu et al. 2023, Fernando et al. 2023) have used a language model to produce the adversarial prompts. But the surface area of prompts that are produced by a language model is smaller than the surface area of prompts that can be produced by a white-box method. White-box methods combined with LLM-based generations should offer more vigorous attacks.
* Most recent successful black-box methods (Shah et al. 2023, Liu et al. 2023, Fernando et al. 2023) have used a language model to produce the adversarial prompts. But the surface area of prompts that are produced by a language model is smaller than the surface area of prompts that can be produced by a white-box method. White-box methods combined with LLM-based generations should offer more vigorous attacks.

# Trojan Detection Track Takeaways

#### 1. **Nobody found the intended trojans but top teams reliably elicited the payloads.**

Using GCG, we successfully elicited 100% of the payloads. Other top-performing teams used similar approaches with similar success! But, an important part of the competition was distinguishing between intended triggers and unintended triggers where the intended triggers are the $p_n$ used during the trojan insertion process. No participants succeeded at correctly identifying the intended triggers used by the adversary in training. Scores were composed of two parts: "Reverse Engineering Attack Success Rate" (REASR) which tracked how often could you elicit the trigger with _some_ phrase, and a second BLEU-based "recall" metric that measures similarity with the intended triggers. Performance on the recall metric with random inputs seems to yield about ~14-16% score, due to luck-based collisions with the true tokens. No competition participant achieved more than 17% recall. Our REASR scores on the final competition leaderboards were 97% and 98%, rather than the 100% we measured on our system. This was due to a fixable fp-16 nondeterminism issue involving a difference in batch sizes.
Using GCG, we successfully elicited 100% of the payloads. Other top-performing teams used similar approaches with similar success! But, an important part of the competition was distinguishing between intended triggers and unintended triggers where the intended triggers are the $p_n$ used during the trojan insertion process. No participants succeeded at correctly identifying the intended triggers used by the adversary in training. Scores were composed of two parts: "Reverse Engineering Attack Success Rate" (REASR) which tracked how often could you elicit the trigger with _some_ phrase, and a second BLEU-based "recall" metric that measures similarity with the intended triggers. Performance on the recall metric with random inputs seems to yield about ~14-16% score, due to luck-based collisions with the true tokens. No competitors achieved more than 17% recall. Our REASR scores on the final competition leaderboards were 97% and 98%, rather than the 100% we measured on our system. This was due to a fixable fp-16 nondeterminism issue involving a difference in batch sizes.

#### 2. **Reverse engineering trojans "in practice" seems quite hard.**

Expand Down Expand Up @@ -287,6 +255,39 @@ It seems plausible that using larger (or multiple) models to measure perplexity
* Prefix triggers: average optimizer runtime: 64 seconds, 21% +- 8% transfer.
* These results are probably very sensitive to the specification and forcing portions of the prompt so we would not recommend extrapolating these results far beyond the experimental setting.

<a name="references">

# References

#### Baseline methods for LLM optimization/attacks:
1. **This paper introduces GCG (Greedy Coordinate Gradient)**: [Zou et al. 2023, Universal and Transferable Adversarial Attacks on Aligned Language Models](https://arxiv.org/abs/2307.15043)**
2. The PEZ method: [Wen et al. 2023, Gradient-based discrete optimization for prompt tuning and discovery](https://arxiv.org/abs/2302.03668)
3. The GBDA method: [Guo et al. 2021, Gradient-based adversarial attacks against text transformers](https://arxiv.org/abs/2104.13733)

#### More specialized optimization-based methods:

4. A 2020 classic, predecessor to GCG: [Shin et al. 2020, AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts](https://arxiv.org/abs/2010.15980)
5. ARCA: [Jones et al. 2023, Automatically Auditing Large Language Models via Discrete Optimization](https://arxiv.org/abs/2303.04381)
6. This gradient-based AutoDan-Zhu: [Zhu et al. 2023, AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models.](https://arxiv.org/abs/2310.15140v1) (An important caveat is that the methods in this paper on unproven on safety-trained models. This paper's benchmarks notably omit Llama-2.)
7. The mellowmax operator: [Asadi and Littman 2016, An Alternative Softmax Operator for Reinforcement Learning](https://arxiv.org/abs/1612.05628)

#### Generating attacks using LLMs for jailbreaking with fluency:
8. Already a classic: [Perez et al. 2022, Red Teaming Language Models with Language Models](https://arxiv.org/abs/2202.03286)
9. The LLM-based AutoDAN-Liu, which is a totally separate paper and approach from AutoDAN-Zhu above! [Liu et al. 2023, AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.](https://arxiv.org/abs/2310.04451)
10. [Fernando et al. 2023, Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution](https://arxiv.org/abs/2309.16797) This paper optimizes prompts for generic task performance. Red-teaming can be thought of as a special case.

#### Various tips for jailbreaking:
11. [Wei, Haghtalab and Steinhardt 2023, Jailbroken: How Does LLM Safety Training Fail?](https://arxiv.org/abs/2307.02483) An excellent list of manual redteaming exploits.
12. [Shah et al. 2023, Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation](https://arxiv.org/abs/2311.03348).

#### Crytographically undetectable trojan insertion:
13. [Goldwasser et al. 2022, Planting Undetectable Backdoors in Machine Learning Models](https://arxiv.org/abs/2204.06974)

#### Trojan recovery:
14. [Haim et al. 2023, "Reconstructing training data from trained neural networks."](https://arxiv.org/abs/2206.07758)
15. [Zheng et al. 2021, "Topological detection of trojaned neural networks."](https://arxiv.org/abs/2106.06469)
</details>




Expand Down

0 comments on commit d5df509

Please sign in to comment.