Skip to content

Commit

Permalink
Built site for gh-pages
Browse files Browse the repository at this point in the history
  • Loading branch information
Quarto GHA Workflow Runner committed Jul 12, 2024
1 parent 3be4094 commit 26a275b
Show file tree
Hide file tree
Showing 11 changed files with 168 additions and 146 deletions.
2 changes: 1 addition & 1 deletion .nojekyll
Original file line number Diff line number Diff line change
@@ -1 +1 @@
7b7ba6ef
acafdfd0
12 changes: 6 additions & 6 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,7 @@ <h2 class="anchored" data-anchor-id="confirm-labs">Confirm Labs</h2>
<p>We have two ongoing projects:</p>
<ol type="1">
<li><p><strong>Adversarial attacks:</strong> We believe that developing better white-box adversarial techniques can help with (a) evaluating model capabilities via red-teaming (b) model interpretability (c) providing data and feedback for safety-training pipelines.</p>
<p>Recently, we have built methods for powerful and fluent adversarial attacks. This work is currently being compiled into a paper to be released in May 2024. Earlier this year, we published <a href="https://arxiv.org/pdf/2402.01702">“Fluent Dreaming for Language Models”</a> which combines whitebox optimization with interpretability. We also won a division of the <a href="https://confirmlabs.org/posts/TDC2023">NeurIPS 2023 Trojan Detection Competition</a>.</p></li>
<p>Recently, we have built methods for powerful and fluent adversarial attacks. This work is currently being compiled into a paper to be released in Summer 2024. Earlier this year, we published <a href="https://arxiv.org/pdf/2402.01702">“Fluent Dreaming for Language Models”</a> which combines whitebox optimization with interpretability. We also won a division of the <a href="https://confirmlabs.org/posts/TDC2023">NeurIPS 2023 Trojan Detection Competition</a>.</p></li>
<li><p><strong>Pretraining AI editor architectures:</strong> We believe AI inspection of AI internals could become a useful component of AI interpretability and oversight. Inspired by the success of the pre-training paradigm in language models, we are designing models that are trained to understand the inner workings of a target model. In particular, we are building editor architectures that take as inputs the activations of a frozen target model as well as language-based editing instructions, and as their output will “puppet” the activation stream of the target model to achieve desired results. Fine-tuning the resulting model for interpretability tasks could result in powerful tools for interpretability or oversight.</p></li>
</ol>
</section>
Expand All @@ -207,7 +207,7 @@ <h2 class="anchored" data-anchor-id="articles">Articles</h2>

<div class="quarto-listing quarto-listing-container-grid" id="listing-listing">
<div class="list grid quarto-listing-cols-3">
<div class="g-col-1" data-index="0" data-listing-date-sort="1720742400000" data-listing-file-modified-sort="1720817154885" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7" data-listing-word-count-sort="1201">
<div class="g-col-1" data-index="0" data-listing-date-sort="1720742400000" data-listing-file-modified-sort="1720817832186" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7" data-listing-word-count-sort="1236">
<a href="./posts/circuit_breaking.html" class="quarto-grid-link">
<div class="quarto-grid-item card h-100 card-left">
<p class="card-img-top"><img src="posts/circuit_breaking_files/figure-html/cell-4-output-1.png" style="height: 150px;" class="thumbnail-image card-img"/></p>
Expand All @@ -230,7 +230,7 @@ <h5 class="no-anchor card-title listing-title">
</div>
</a>
</div>
<div class="g-col-1" data-index="1" data-listing-date-sort="1705968000000" data-listing-file-modified-sort="1720817154897" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="4" data-listing-word-count-sort="771">
<div class="g-col-1" data-index="1" data-listing-date-sort="1705968000000" data-listing-file-modified-sort="1720817832194" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="4" data-listing-word-count-sort="771">
<a href="./posts/dreamy.html" class="quarto-grid-link">
<div class="quarto-grid-item card h-100 card-left">
<p class="card-img-top"><img src="posts/dream_wow.png" style="height: 150px;" class="thumbnail-image card-img"/></p>
Expand All @@ -253,7 +253,7 @@ <h5 class="no-anchor card-title listing-title">
</div>
</a>
</div>
<div class="g-col-1" data-index="2" data-listing-date-sort="1705104000000" data-listing-file-modified-sort="1720817154869" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="25" data-listing-word-count-sort="4985">
<div class="g-col-1" data-index="2" data-listing-date-sort="1705104000000" data-listing-file-modified-sort="1720817832166" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="25" data-listing-word-count-sort="4985">
<a href="./posts/TDC2023.html" class="quarto-grid-link">
<div class="quarto-grid-item card h-100 card-left">
<p class="card-img-top"><img src="posts/TDC2023-sample-instances.png" alt="The Z-scores of activation vector similarity for the provided sample instances" style="height: 150px;" class="thumbnail-image card-img"/></p>
Expand All @@ -276,7 +276,7 @@ <h5 class="no-anchor card-title listing-title">
</div>
</a>
</div>
<div class="g-col-1" data-index="3" data-listing-date-sort="1701302400000" data-listing-file-modified-sort="1720817154897" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7" data-listing-word-count-sort="1303">
<div class="g-col-1" data-index="3" data-listing-date-sort="1701302400000" data-listing-file-modified-sort="1720817832194" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7" data-listing-word-count-sort="1303">
<a href="./posts/fight_the_illusion.html" class="quarto-grid-link">
<div class="quarto-grid-item card h-100 card-left">
<div class="listing-item-img-placeholder card-img-top" style="height: 150px;">&nbsp;</div>
Expand All @@ -299,7 +299,7 @@ <h5 class="no-anchor card-title listing-title">
</div>
</a>
</div>
<div class="g-col-1" data-index="4" data-listing-date-sort="1687651200000" data-listing-file-modified-sort="1720817154885" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7" data-listing-word-count-sort="1247">
<div class="g-col-1" data-index="4" data-listing-date-sort="1687651200000" data-listing-file-modified-sort="1720817832186" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7" data-listing-word-count-sort="1247">
<a href="./posts/catalog.html" class="quarto-grid-link">
<div class="quarto-grid-item card h-100 card-left">
<p class="card-img-top"><img src="posts/catalog_files/figure-html/cell-9-output-1.png" style="height: 150px;" class="thumbnail-image card-img"/></p>
Expand Down
2 changes: 1 addition & 1 deletion posts/TDC2023.html
Original file line number Diff line number Diff line change
Expand Up @@ -959,7 +959,7 @@ <h4 class="anchored" data-anchor-id="trojan-recovery">Trojan recovery:</h4>
});
</script>
</div> <!-- /content -->
<script>var lightboxQuarto = GLightbox({"openEffect":"zoom","descPosition":"bottom","selector":".lightbox","loop":false,"closeEffect":"zoom"});
<script>var lightboxQuarto = GLightbox({"loop":false,"openEffect":"zoom","selector":".lightbox","descPosition":"bottom","closeEffect":"zoom"});
(function() {
let previousOnload = window.onload;
window.onload = () => {
Expand Down
2 changes: 1 addition & 1 deletion posts/catalog.html
Original file line number Diff line number Diff line change
Expand Up @@ -1055,7 +1055,7 @@ <h2 class="anchored" data-anchor-id="github">GitHub</h2>
});
</script>
</div> <!-- /content -->
<script>var lightboxQuarto = GLightbox({"selector":".lightbox","descPosition":"bottom","loop":false,"closeEffect":"zoom","openEffect":"zoom"});
<script>var lightboxQuarto = GLightbox({"loop":false,"descPosition":"bottom","openEffect":"zoom","closeEffect":"zoom","selector":".lightbox"});
(function() {
let previousOnload = window.onload;
window.onload = () => {
Expand Down
6 changes: 3 additions & 3 deletions posts/catalog.out.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -297,7 +297,7 @@
"Pythia-12B is miscalibrated on 20% of the bigrams and 45% of the\n",
"trigrams when we ask for prediction of $p \\geq 0.45$."
],
"id": "205694df-cda5-49b9-ae2e-ef06511381f2"
"id": "fb4792dc-55d9-4a86-ae3d-d470bc99e039"
},
{
"cell_type": "code",
Expand Down Expand Up @@ -377,7 +377,7 @@
"The dataset is available on Huggingface:\n",
"[pile_scan_4](https://huggingface.co/datasets/Confirm-Labs/pile_scan_4)"
],
"id": "f4180f1b-7ed0-437b-91d9-715b0e994340"
"id": "4fcb39e7-a9d8-44ad-a321-af844c9a728f"
},
{
"cell_type": "code",
Expand Down Expand Up @@ -417,7 +417,7 @@
"Computational Linguistics, May 2022, pp. 95–136. doi:\n",
"[10.18653/v1/2022.bigscience-1.9](https://doi.org/10.18653/v1/2022.bigscience-1.9).</span>"
],
"id": "d7e94da0-fad0-4772-89d1-d9ec96e27a9c"
"id": "91e5f60d-553a-401b-a629-eae6f117f9dc"
}
],
"nbformat": 4,
Expand Down
Loading

0 comments on commit 26a275b

Please sign in to comment.