From d5df509bd2d88679c490f3624fe827d0f7fa9439 Mon Sep 17 00:00:00 2001 From: mikesklar <52256869+mikesklar@users.noreply.github.com> Date: Wed, 17 Jan 2024 13:33:56 -0800 Subject: [PATCH] references moved to bottom --- posts/TDC2023.md | 71 ++++++++++++++++++++++++------------------------ 1 file changed, 36 insertions(+), 35 deletions(-) diff --git a/posts/TDC2023.md b/posts/TDC2023.md index f2dc70f..0e30d5e 100644 --- a/posts/TDC2023.md +++ b/posts/TDC2023.md @@ -77,39 +77,7 @@ On both tracks, the models have been defended against easy forms of attack. In t # Prior Literature on Adversarial Attacks -If you're new to this area, we recommend starting with this Lil’log post: ["Adversarial Attacks on LLMs" (Weng 2023)](https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/). - -
- Click here to expand a list of 15 papers we found useful and/or reference - -#### Baseline methods for LLM optimization/attacks: -1. **This paper introduces GCG (Greedy Coordinate Gradient)**: [Zou et al. 2023, Universal and Transferable Adversarial Attacks on Aligned Language Models](https://arxiv.org/abs/2307.15043)** -2. The PEZ method: [Wen et al. 2023, Gradient-based discrete optimization for prompt tuning and discovery](https://arxiv.org/abs/2302.03668) -3. The GBDA method: [Guo et al. 2021, Gradient-based adversarial attacks against text transformers](https://arxiv.org/abs/2104.13733) - -#### More specialized optimization-based methods: - -4. A 2020 classic, predecessor to GCG: [Shin et al. 2020, AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts](https://arxiv.org/abs/2010.15980) -5. ARCA: [Jones et al. 2023, Automatically Auditing Large Language Models via Discrete Optimization](https://arxiv.org/abs/2303.04381) -6. This gradient-based AutoDan-Zhu: [Zhu et al. 2023, AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models.](https://arxiv.org/abs/2310.15140v1) (An important caveat is that the methods in this paper on unproven on safety-trained models. This paper's benchmarks notably omit Llama-2.) -7. The mellowmax operator: [Asadi and Littman 2016, An Alternative Softmax Operator for Reinforcement Learning](https://arxiv.org/abs/1612.05628) - -#### Generating attacks using LLMs for jailbreaking with fluency: -8. Already a classic: [Perez et al. 2022, Red Teaming Language Models with Language Models](https://arxiv.org/abs/2202.03286) -9. The LLM-based AutoDAN-Liu, which is a totally separate paper and approach from AutoDAN-Zhu above! [Liu et al. 2023, AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.](https://arxiv.org/abs/2310.04451) -10. [Fernando et al. 2023, Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution](https://arxiv.org/abs/2309.16797) This paper optimizes prompts for generic task performance. Red-teaming can be thought of as a special case. - -#### Various tips for jailbreaking: -11. [Wei, Haghtalab and Steinhardt 2023, Jailbroken: How Does LLM Safety Training Fail?](https://arxiv.org/abs/2307.02483) An excellent list of manual redteaming exploits. -12. [Shah et al. 2023, Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation](https://arxiv.org/abs/2311.03348). - -#### Crytographically undetectable trojan insertion: - 13. [Goldwasser et al. 2022, Planting Undetectable Backdoors in Machine Learning Models](https://arxiv.org/abs/2204.06974) - -#### Trojan recovery: - 14. [Haim et al. 2023, "Reconstructing training data from trained neural networks."](https://arxiv.org/abs/2206.07758) - 15. [Zheng et al. 2021, "Topological detection of trojaned neural networks."](https://arxiv.org/abs/2106.06469) -
+If you're new to this area, we recommend starting with this Lil’log post: ["Adversarial Attacks on LLMs" (Weng 2023)](https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/). For a list of 15 papers we found useful and/or reference in this post, [click here](#references) # Summary of Our Major Takeaways @@ -144,13 +112,13 @@ If you're new to this area, we recommend starting with this Lil’log post: ["Ad 4. **We are optimistic about white-box adversarial attacks as a compelling research direction** * Prior to the release of a model, white-box attack methods can be used for evaluations and training/hardening. * We are optimistic about automated gradient-based and mechanistic-interpretability-inspired techniques to find adversarial attacks that would not be easy to find with other approaches. - * Most recent successful black-box methods (Shah et al. 2023, Liu et al. 2023, Fernando et al. 2023) have used a language model to produce the adversarial prompts. But the surface area of prompts that are produced by a language model is smaller than the surface area of prompts that can be produced by a white-box method. White-box methods combined with LLM-based generations should offer more vigorous attacks. + * Most recent successful black-box methods (Shah et al. 2023, Liu et al. 2023, Fernando et al. 2023) have used a language model to produce the adversarial prompts. But the surface area of prompts that are produced by a language model is smaller than the surface area of prompts that can be produced by a white-box method. White-box methods combined with LLM-based generations should offer more vigorous attacks. # Trojan Detection Track Takeaways #### 1. **Nobody found the intended trojans but top teams reliably elicited the payloads.** - Using GCG, we successfully elicited 100% of the payloads. Other top-performing teams used similar approaches with similar success! But, an important part of the competition was distinguishing between intended triggers and unintended triggers where the intended triggers are the $p_n$ used during the trojan insertion process. No participants succeeded at correctly identifying the intended triggers used by the adversary in training. Scores were composed of two parts: "Reverse Engineering Attack Success Rate" (REASR) which tracked how often could you elicit the trigger with _some_ phrase, and a second BLEU-based "recall" metric that measures similarity with the intended triggers. Performance on the recall metric with random inputs seems to yield about ~14-16% score, due to luck-based collisions with the true tokens. No competition participant achieved more than 17% recall. Our REASR scores on the final competition leaderboards were 97% and 98%, rather than the 100% we measured on our system. This was due to a fixable fp-16 nondeterminism issue involving a difference in batch sizes. + Using GCG, we successfully elicited 100% of the payloads. Other top-performing teams used similar approaches with similar success! But, an important part of the competition was distinguishing between intended triggers and unintended triggers where the intended triggers are the $p_n$ used during the trojan insertion process. No participants succeeded at correctly identifying the intended triggers used by the adversary in training. Scores were composed of two parts: "Reverse Engineering Attack Success Rate" (REASR) which tracked how often could you elicit the trigger with _some_ phrase, and a second BLEU-based "recall" metric that measures similarity with the intended triggers. Performance on the recall metric with random inputs seems to yield about ~14-16% score, due to luck-based collisions with the true tokens. No competitors achieved more than 17% recall. Our REASR scores on the final competition leaderboards were 97% and 98%, rather than the 100% we measured on our system. This was due to a fixable fp-16 nondeterminism issue involving a difference in batch sizes. #### 2. **Reverse engineering trojans "in practice" seems quite hard.** @@ -287,6 +255,39 @@ It seems plausible that using larger (or multiple) models to measure perplexity * Prefix triggers: average optimizer runtime: 64 seconds, 21% +- 8% transfer. * These results are probably very sensitive to the specification and forcing portions of the prompt so we would not recommend extrapolating these results far beyond the experimental setting. + + +# References + +#### Baseline methods for LLM optimization/attacks: +1. **This paper introduces GCG (Greedy Coordinate Gradient)**: [Zou et al. 2023, Universal and Transferable Adversarial Attacks on Aligned Language Models](https://arxiv.org/abs/2307.15043)** +2. The PEZ method: [Wen et al. 2023, Gradient-based discrete optimization for prompt tuning and discovery](https://arxiv.org/abs/2302.03668) +3. The GBDA method: [Guo et al. 2021, Gradient-based adversarial attacks against text transformers](https://arxiv.org/abs/2104.13733) + +#### More specialized optimization-based methods: + +4. A 2020 classic, predecessor to GCG: [Shin et al. 2020, AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts](https://arxiv.org/abs/2010.15980) +5. ARCA: [Jones et al. 2023, Automatically Auditing Large Language Models via Discrete Optimization](https://arxiv.org/abs/2303.04381) +6. This gradient-based AutoDan-Zhu: [Zhu et al. 2023, AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models.](https://arxiv.org/abs/2310.15140v1) (An important caveat is that the methods in this paper on unproven on safety-trained models. This paper's benchmarks notably omit Llama-2.) +7. The mellowmax operator: [Asadi and Littman 2016, An Alternative Softmax Operator for Reinforcement Learning](https://arxiv.org/abs/1612.05628) + +#### Generating attacks using LLMs for jailbreaking with fluency: +8. Already a classic: [Perez et al. 2022, Red Teaming Language Models with Language Models](https://arxiv.org/abs/2202.03286) +9. The LLM-based AutoDAN-Liu, which is a totally separate paper and approach from AutoDAN-Zhu above! [Liu et al. 2023, AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.](https://arxiv.org/abs/2310.04451) +10. [Fernando et al. 2023, Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution](https://arxiv.org/abs/2309.16797) This paper optimizes prompts for generic task performance. Red-teaming can be thought of as a special case. + +#### Various tips for jailbreaking: +11. [Wei, Haghtalab and Steinhardt 2023, Jailbroken: How Does LLM Safety Training Fail?](https://arxiv.org/abs/2307.02483) An excellent list of manual redteaming exploits. +12. [Shah et al. 2023, Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation](https://arxiv.org/abs/2311.03348). + +#### Crytographically undetectable trojan insertion: + 13. [Goldwasser et al. 2022, Planting Undetectable Backdoors in Machine Learning Models](https://arxiv.org/abs/2204.06974) + +#### Trojan recovery: + 14. [Haim et al. 2023, "Reconstructing training data from trained neural networks."](https://arxiv.org/abs/2206.07758) + 15. [Zheng et al. 2021, "Topological detection of trojaned neural networks."](https://arxiv.org/abs/2106.06469) + +