references moved to bottom

Confirm-Solutions · Jan 17, 2024 · d5df509 · d5df509
1 parent e2b2875
commit d5df509
Showing 1 changed file with 36 additions and 35 deletions.
diff --git a/posts/TDC2023.md b/posts/TDC2023.md
@@ -77,39 +77,7 @@ On both tracks, the models have been defended against easy forms of attack. In t
 
 # Prior Literature on Adversarial Attacks
 
-If you're new to this area, we recommend starting with this Lil’log post: ["Adversarial Attacks on LLMs" (Weng 2023)](https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/). 
-
-<details>
-<summary><strong> Click here to expand a list of 15 papers we found useful and/or reference </strong></summary>
-
-#### Baseline methods for LLM optimization/attacks:
-1. **This paper introduces GCG (Greedy Coordinate Gradient)**: [Zou et al. 2023, Universal and Transferable Adversarial Attacks on Aligned Language Models](https://arxiv.org/abs/2307.15043)**
-2. The PEZ method: [Wen et al. 2023, Gradient-based discrete optimization for prompt tuning and discovery](https://arxiv.org/abs/2302.03668)
-3. The GBDA method: [Guo et al. 2021, Gradient-based adversarial attacks against text transformers](https://arxiv.org/abs/2104.13733)
-
-#### More specialized optimization-based methods:
-
-4. A 2020 classic, predecessor to GCG:  [Shin et al. 2020, AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts](https://arxiv.org/abs/2010.15980)
-5. ARCA: [Jones et al. 2023, Automatically Auditing Large Language Models via Discrete Optimization](https://arxiv.org/abs/2303.04381)
-6. This gradient-based AutoDan-Zhu: [Zhu et al. 2023, AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models.](https://arxiv.org/abs/2310.15140v1) (An important caveat is that the methods in this paper on unproven on safety-trained models. This paper's benchmarks notably omit Llama-2.)
-7. The mellowmax operator: [Asadi and Littman 2016, An Alternative Softmax Operator for Reinforcement Learning](https://arxiv.org/abs/1612.05628)
-
-#### Generating attacks using LLMs for jailbreaking with fluency:
-8. Already a classic: [Perez et al. 2022, Red Teaming Language Models with Language Models](https://arxiv.org/abs/2202.03286)
-9. The LLM-based AutoDAN-Liu, which is a totally separate paper and approach from AutoDAN-Zhu above! [Liu et al. 2023, AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.](https://arxiv.org/abs/2310.04451)
-10. [Fernando et al. 2023, Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution](https://arxiv.org/abs/2309.16797) This paper optimizes prompts for generic task performance. Red-teaming can be thought of as a special case.
-
-#### Various tips for jailbreaking:
-11. [Wei, Haghtalab and Steinhardt 2023, Jailbroken: How Does LLM Safety Training Fail?](https://arxiv.org/abs/2307.02483) An excellent list of manual redteaming exploits.
-12. [Shah et al. 2023, Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation](https://arxiv.org/abs/2311.03348).
-
-#### Crytographically undetectable trojan insertion:
-  13. [Goldwasser et al. 2022, Planting Undetectable Backdoors in Machine Learning Models](https://arxiv.org/abs/2204.06974) 
-
-#### Trojan recovery:
-  14. [Haim et al. 2023, "Reconstructing training data from trained neural networks."](https://arxiv.org/abs/2206.07758)
-  15. [Zheng et al. 2021, "Topological detection of trojaned neural networks."](https://arxiv.org/abs/2106.06469)
-</details>
+If you're new to this area, we recommend starting with this Lil’log post: ["Adversarial Attacks on LLMs" (Weng 2023)](https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/). For a list of 15 papers we found useful and/or reference in this post, [click here](#references)
 
 # Summary of Our Major Takeaways
 
@@ -144,13 +112,13 @@ If you're new to this area, we recommend starting with this Lil’log post: ["Ad
    4. **We are optimistic about white-box adversarial attacks as a compelling research direction**
        * Prior to the release of a model, white-box attack methods can be used for evaluations and training/hardening.
        * We are optimistic about automated gradient-based and mechanistic-interpretability-inspired techniques to find adversarial attacks that would not be easy to find with other approaches.
-       * Most recent successful black-box methods (Shah et al. 2023, Liu et al. 2023, Fernando et al. 2023) have used a language model to produce the adversarial prompts. But the surface area of prompts that are produced by a language model is smaller than the surface area of prompts that can be produced by a white-box method. White-box methods combined with LLM-based generations should offer more vigorous attacks.
+       * Most recent successful black-box methods (Shah et al. 2023, Liu et al. 2023, Fernando et al. 2023) have used a language model to produce the adversarial prompts. But the surface area of prompts that are produced by a language model is smaller than the surface area of prompts that can be produced by a white-box method. White-box methods combined with LLM-based generations should offer more vigorous attacks. 
 
 # Trojan Detection Track Takeaways
 
 #### 1. **Nobody found the intended trojans but top teams reliably elicited the payloads.**
 
- Using GCG, we successfully elicited 100% of the payloads. Other top-performing teams used similar approaches with similar success! But, an important part of the competition was distinguishing between intended triggers and unintended triggers where the intended triggers are the $p_n$ used during the trojan insertion process. No participants succeeded at correctly identifying the intended triggers used by the adversary in training. Scores were composed of two parts: "Reverse Engineering Attack Success Rate" (REASR) which tracked how often could you elicit the trigger with _some_ phrase, and a second BLEU-based "recall" metric that measures similarity with the intended triggers. Performance on the recall metric with random inputs seems to yield about ~14-16% score, due to luck-based collisions with the true tokens. No competition participant achieved more than 17% recall. Our REASR scores on the final competition leaderboards were 97% and 98%, rather than the 100% we measured on our system. This was due to a fixable fp-16 nondeterminism issue involving a difference in batch sizes.
+ Using GCG, we successfully elicited 100% of the payloads. Other top-performing teams used similar approaches with similar success! But, an important part of the competition was distinguishing between intended triggers and unintended triggers where the intended triggers are the $p_n$ used during the trojan insertion process. No participants succeeded at correctly identifying the intended triggers used by the adversary in training. Scores were composed of two parts: "Reverse Engineering Attack Success Rate" (REASR) which tracked how often could you elicit the trigger with _some_ phrase, and a second BLEU-based "recall" metric that measures similarity with the intended triggers. Performance on the recall metric with random inputs seems to yield about ~14-16% score, due to luck-based collisions with the true tokens. No competitors achieved more than 17% recall. Our REASR scores on the final competition leaderboards were 97% and 98%, rather than the 100% we measured on our system. This was due to a fixable fp-16 nondeterminism issue involving a difference in batch sizes.
 
 #### 2. **Reverse engineering trojans "in practice" seems quite hard.** 
 
@@ -287,6 +255,39 @@ It seems plausible that using larger (or multiple) models to measure perplexity
           * Prefix triggers: average optimizer runtime: 64 seconds, 21% +- 8% transfer.
           * These results are probably very sensitive to the specification and forcing portions of the prompt so we would not recommend extrapolating these results far beyond the experimental setting.
 
+<a name="references">
+
+# References
+
+#### Baseline methods for LLM optimization/attacks:
+1. **This paper introduces GCG (Greedy Coordinate Gradient)**: [Zou et al. 2023, Universal and Transferable Adversarial Attacks on Aligned Language Models](https://arxiv.org/abs/2307.15043)**
+2. The PEZ method: [Wen et al. 2023, Gradient-based discrete optimization for prompt tuning and discovery](https://arxiv.org/abs/2302.03668)
+3. The GBDA method: [Guo et al. 2021, Gradient-based adversarial attacks against text transformers](https://arxiv.org/abs/2104.13733)
+
+#### More specialized optimization-based methods:
+
+4. A 2020 classic, predecessor to GCG:  [Shin et al. 2020, AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts](https://arxiv.org/abs/2010.15980)
+5. ARCA: [Jones et al. 2023, Automatically Auditing Large Language Models via Discrete Optimization](https://arxiv.org/abs/2303.04381)
+6. This gradient-based AutoDan-Zhu: [Zhu et al. 2023, AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models.](https://arxiv.org/abs/2310.15140v1) (An important caveat is that the methods in this paper on unproven on safety-trained models. This paper's benchmarks notably omit Llama-2.)
+7. The mellowmax operator: [Asadi and Littman 2016, An Alternative Softmax Operator for Reinforcement Learning](https://arxiv.org/abs/1612.05628)
+
+#### Generating attacks using LLMs for jailbreaking with fluency:
+8. Already a classic: [Perez et al. 2022, Red Teaming Language Models with Language Models](https://arxiv.org/abs/2202.03286)
+9. The LLM-based AutoDAN-Liu, which is a totally separate paper and approach from AutoDAN-Zhu above! [Liu et al. 2023, AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.](https://arxiv.org/abs/2310.04451)
+10. [Fernando et al. 2023, Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution](https://arxiv.org/abs/2309.16797) This paper optimizes prompts for generic task performance. Red-teaming can be thought of as a special case.
+
+#### Various tips for jailbreaking:
+11. [Wei, Haghtalab and Steinhardt 2023, Jailbroken: How Does LLM Safety Training Fail?](https://arxiv.org/abs/2307.02483) An excellent list of manual redteaming exploits.
+12. [Shah et al. 2023, Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation](https://arxiv.org/abs/2311.03348).
+
+#### Crytographically undetectable trojan insertion:
+  13. [Goldwasser et al. 2022, Planting Undetectable Backdoors in Machine Learning Models](https://arxiv.org/abs/2204.06974) 
+
+#### Trojan recovery:
+  14. [Haim et al. 2023, "Reconstructing training data from trained neural networks."](https://arxiv.org/abs/2206.07758)
+  15. [Zheng et al. 2021, "Topological detection of trojaned neural networks."](https://arxiv.org/abs/2106.06469)
+</details>
+