Skip to content

Commit c219873

Browse files
committed
Simplify duplication and crossref logics
1 parent 959afc6 commit c219873

File tree

14 files changed

+577
-384
lines changed

14 files changed

+577
-384
lines changed

.claude-plugin/marketplace.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
"path": "./"
1717
},
1818
"description": "A bibliography toolkit for LaTeX",
19-
"version": "1.6.0",
19+
"version": "1.7.0",
2020
"keywords": ["bibtex", "bibliography", "latex", "overleaf", "academic", "reference", "citation"],
2121
"category": "academic",
2222
"license": "MIT"

.claude-plugin/plugin.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"name": "bibtools",
33
"description": "A bibliography toolkit for LaTeX",
4-
"version": "1.6.0",
4+
"version": "1.7.0",
55
"author": {
66
"name": "Yunguan Fu"
77
},

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[project]
22
name = "bibtools"
3-
version = "1.6.0"
3+
version = "1.7.0"
44
description = "A bibliography toolkit for LaTeX, built as agent skills"
55
requires-python = ">=3.10"
66
license = "MIT"

skills/bibtidy/SKILL.md

Lines changed: 17 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,8 @@ Assume standard brace-style BibTeX entries like `@article{...}`. Parenthesized B
2828
| CrossRef DOI lookup | `python3 $TOOLS_DIR/crossref.py doi <DOI>` |
2929
| CrossRef title search | `python3 $TOOLS_DIR/crossref.py search "<title>"` |
3030
| CrossRef bibliographic search | `python3 $TOOLS_DIR/crossref.py bibliographic "<query>"` |
31-
| Duplicate detection | `python3 $TOOLS_DIR/duplicates.py <file.bib>` |
31+
| Remove exact duplicates | `python3 $TOOLS_DIR/duplicates.py <file.bib> --exact` |
32+
| Detect near-duplicates | `python3 $TOOLS_DIR/duplicates.py <file.bib>` |
3233
| **Apply edits** | `python3 $TOOLS_DIR/edit.py <file.bib> <patches.json>` |
3334
| Web verification | web search (preferred) or CrossRef scripts (fallback) |
3435

@@ -84,20 +85,22 @@ For unchanged entries, do NOT add any comments or URLs.
8485
1. Read the .bib file, note the file path
8586
2. Back up for format validation: `cp <file>.bib <file>.bib.orig`
8687
3. Preserve `@string`, `@preamble`, `@comment` blocks verbatim
87-
4. Run duplicate detection: `python3 $TOOLS_DIR/duplicates.py <file.bib>`
88-
5. **Run field comparison**: `python3 $TOOLS_DIR/compare.py <file.bib>` — this programmatically compares every entry against CrossRef and returns exact field-level mismatches. Do NOT skip this step or rely on visual comparison alone. The output is a JSON list; each element has `key`, `versions` (a list of alternative CrossRef candidate matches for the same entry, each with `mismatches`, `url`, `doi`, etc.), and `error`. When multiple versions are returned, choose the best matching candidate; do not combine fields from different versions. **Skip rule**: if an entry has zero mismatches across all versions and no error in the compare.py output, skip it entirely — do NOT investigate, modify, or add comments to it. Only proceed with entries that compare.py flagged (mismatches, errors, or duplicates from step 4).
89-
6. **Verify every planned modification with web search** — for entries that compare.py flagged with mismatches or errors, and for entries flagged as duplicates, verify the planned action via web search. For `fix` patches, gather one or more source URLs. Entries where `compare.py` returned an error (e.g. "No exact title match") still need full verification — the verification agent should search for the paper and check all fields. **Important: after selecting the best-matching version, verification agents MUST NOT override that selected version's `compare.py` field values.** CrossRef is the authoritative source for metadata (pages, volume, number, etc.) because it receives data directly from publishers via DOI registration. When web search finds a conflicting value (e.g. different page numbers on a conference website), always use the CrossRef value and add `% bibtidy: REVIEW` if desired — but do NOT keep the old value.
88+
4. **Remove exact duplicates**: `python3 $TOOLS_DIR/duplicates.py <file.bib> --exact` — this comments out entries that are identical (same key, same type, same fields). Safe to auto-remove since no information is lost.
89+
5. **Run field comparison**: `python3 $TOOLS_DIR/compare.py <file.bib>` — this programmatically compares every entry against CrossRef and returns exact field-level mismatches. Do NOT skip this step or rely on visual comparison alone. The output is a JSON list; each element has `key`, `versions` (a list of alternative CrossRef candidate matches for the same entry, each with `mismatches`, `url`, `doi`, etc.), and `error`. When multiple versions are returned, choose the best matching candidate; do not combine fields from different versions. **Skip rule**: if an entry has zero mismatches across all versions and no error in the compare.py output, skip it entirely — do NOT investigate, modify, or add comments to it. Only proceed with entries that compare.py flagged (mismatches or errors).
90+
6. **Verify every planned modification with web search** — for entries that compare.py flagged with mismatches or errors, verify the planned action via web search. For `fix` patches, gather one or more source URLs. Entries where `compare.py` returned an error (e.g. "No exact title match") still need full verification — the verification agent should search for the paper and check all fields. **Important: after selecting the best-matching version, verification agents MUST NOT override that selected version's `compare.py` field values.** CrossRef is the authoritative source for metadata (pages, volume, number, etc.) because it receives data directly from publishers via DOI registration. When web search finds a conflicting value (e.g. different page numbers on a conference website), always use the CrossRef value and add `% bibtidy: REVIEW` if desired — but do NOT keep the old value.
9091
7. **Flag hallucinated/non-existent references** — if compare.py returned an error (e.g. "No CrossRef results found" or "No exact title match in CrossRef results") AND web search also finds no matching paper, the reference likely does not exist. Add `% bibtidy: NOT FOUND — no matching paper on CrossRef or web search; verify this reference exists` above the entry, then comment out the entire entry (prefix every line with `% `). Do NOT add a URL line.
91-
8. Apply fixes **sequentially** using `edit.py` — do NOT edit the .bib file directly with agent editing tools (for example, Claude Code Edit or Codex `apply_patch`), and do NOT rewrite the entire file. Build a patches.json for each entry (or batch) and run `python3 $TOOLS_DIR/edit.py <file.bib> <patches.json>`. This ensures the commented original, source URLs, and explanation are always included. After selecting the correct version, you MUST apply **every** mismatch from that selected version — do not skip any field (including `number`, `pages`, `volume`). Use the `crossref_value` exactly as given (do NOT rephrase, reformat, or partially apply it). For title mismatches on preprint→published upgrades, replace the entire title with the CrossRef title — do NOT try to edit parts of the old title. Never reject a CrossRef value because another source disagrees. Every patch MUST include `urls` (list of source URLs) and `explanation` (what changed and why). Include the CrossRef URL from compare.py's `url` field when available, plus any other authoritative source (DOI URL, venue page) found via web search.
92-
9. Run format validation; fix violations and re-run until clean
93-
10. Delete backup: `rm <file>.bib.orig`
94-
11. Print a Markdown summary table with headers `Metric | Count` and exactly these rows: total entries, verified, fixed, not found. Do NOT include a separate "needs manual review" row.
92+
8. Apply fixes **sequentially** using `edit.py` — do NOT edit the .bib file directly with agent editing tools (for example, Claude Code Edit or Codex `apply_patch`), and do NOT rewrite the entire file. Build a patches.json for each entry (or batch) and run `python3 $TOOLS_DIR/edit.py <file.bib> <patches.json>`. This ensures the commented original, source URLs, and explanation are always included. After selecting the correct version, you MUST apply **every** mismatch from that selected version — do not skip any field (including `author`, `number`, `pages`, `volume`). In particular, if the bib entry uses `and others` but CrossRef returns the full author list, you MUST replace the truncated list with the complete one from CrossRef. Use the `crossref_value` exactly as given (do NOT rephrase, reformat, or partially apply it). For title mismatches on preprint→published upgrades, replace the entire title with the CrossRef title — do NOT try to edit parts of the old title. Never reject a CrossRef value because another source disagrees. Every patch MUST include `urls` (list of source URLs) and `explanation` (what changed and why). Include the CrossRef URL from compare.py's `url` field when available, plus any other authoritative source (DOI URL, venue page) found via web search.
93+
9. **Post-fix exact duplicate removal**: `python3 $TOOLS_DIR/duplicates.py <file.bib> --exact` — entries that were different before fixing may now be identical after metadata corrections. Comment out any new exact duplicates.
94+
10. **Detect near-duplicates**: `python3 $TOOLS_DIR/duplicates.py <file.bib>` — flag entries that share the same key, DOI, or title (with a shared author), plus likely preprint→published pairs with the same lead author and overlapping significant title words, but are not identical. Apply `duplicate` patches via `edit.py` to add `% bibtidy: DUPLICATE of <other_key>` comments. Do NOT delete or comment out near-duplicates.
95+
11. Run format validation; fix violations and re-run until clean
96+
12. Delete backup: `rm <file>.bib.orig`
97+
13. Print a Markdown summary table with headers `Metric | Count` and exactly these rows: total entries, verified, fixed, not found, exact duplicates removed, near-duplicates flagged. Do NOT include a separate "needs manual review" row.
9598

9699
## Parallel Verification with Subagents
97100

98101
Use subagents, when available, to verify multiple entries concurrently. This dramatically reduces wall-clock time (e.g., 7 entries: ~1 min parallel vs ~5 min sequential; 100 entries: ~3 min vs ~40 min). If subagents are unavailable, do the same verification work sequentially yourself.
99102

100-
**Step 1 — Dispatch verification agents:** For entries that `compare.py` flagged with mismatches or errors, and any duplicate entries you plan to annotate, launch a subagent that:
103+
**Step 1 — Dispatch verification agents:** For entries that `compare.py` flagged with mismatches or errors, launch a subagent that:
101104
- For mismatches: uses web search to confirm the CrossRef data (especially for preprint upgrades and author changes)
102105
- For errors (e.g. paper not found in CrossRef): uses web search to verify **every** field from scratch — title, author, journal/booktitle, volume, number, pages, year. Do NOT skip number or other fields just because they look plausible.
103106
- Returns a JSON summary: key, whether each mismatch is confirmed, source URL, CrossRef URL (if there is a CrossRef match), any additional corrections found
@@ -134,19 +137,19 @@ Entry:
134137

135138
## Duplicate Detection
136139

137-
```
138-
python3 $TOOLS_DIR/duplicates.py <file.bib>
139-
```
140+
Duplicate handling has three phases (see workflow steps 4, 9, 10):
141+
142+
**Exact duplicates** (same key, type, and all field values): `python3 $TOOLS_DIR/duplicates.py <file.bib> --exact` comments them out automatically. Run before and after metadata fixes.
140143

141-
Returns JSON array of duplicate pairs (by key, DOI, or title). For each duplicate, add: `% bibtidy: DUPLICATE of <other_key> — consider removing`
144+
**Near-duplicates** (same key, DOI, or title with shared author, plus likely preprint→published pairs with the same lead author and overlapping significant title words, but different content): `python3 $TOOLS_DIR/duplicates.py <file.bib>` returns a JSON array of pairs. For each, apply a `duplicate` patch via `edit.py` to add `% bibtidy: DUPLICATE of <other_key>`. Do NOT delete or comment out near-duplicates.
142145

143146
## Per-Entry Checks
144147

145148
For each `@article`, `@inproceedings`, `@book`, etc.:
146149

147150
**1. Verify existence** — Search for `"<title>" <first author last name>`. If not found: `% bibtidy: NOT FOUND — verify manually`
148151

149-
**2. Cross-check metadata**Always search via `crossref.py search "<title>"`. If DOI exists, also fetch via `crossref.py doi <DOI>`. If neither finds a match, fall back to `crossref.py bibliographic "<title>"`. Compare title, year, authors, journal, volume, number, pages.
152+
**2. Cross-check metadata**`compare.py` runs both `crossref.py search "<title>"` and `crossref.py bibliographic "<title>"` unconditionally, plus `crossref.py doi <DOI>` when a DOI exists, deduplicating results by DOI. Only exact normalized title matches are kept. Compare title, year, authors, journal, volume, number, pages.
150153

151154
**3. Check for published preprints** — If journal contains "arxiv"/"biorxiv"/"chemrxiv", search for published version. Update title, venue, year, volume, pages, entry type. Only update if confirmed via DOI or two independent sources.
152155

skills/bibtidy/tools/compare.py

Lines changed: 20 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,8 @@
1818
import sys
1919

2020
from crossref import fetch_doi, search_bibliographic, search_title
21-
from duplicates import normalize_doi, normalize_title, parse_bib_entries, split_bibtex_authors
21+
from duplicates import normalize_doi, normalize_title, split_bibtex_authors
22+
from parser import parse_bib_entries
2223

2324

2425
def _normalize_pages(pages: str) -> str:
@@ -185,34 +186,40 @@ def lookup_and_compare(entry: dict, timeout: int = 10) -> dict:
185186
result["error"] = "No DOI or title to search"
186187
return result
187188

188-
# Collect CrossRef results from multiple strategies.
189+
# Collect CrossRef results from all strategies, then deduplicate.
189190
matches = []
191+
seen_dois = set()
190192
last_error = None
191193
bib_title_norm = normalize_title(title)
192194

193-
def _search_and_filter(search_fn, query):
195+
def _add(item):
196+
item_doi = item.get("doi")
197+
if item_doi and item_doi in seen_dois:
198+
return
199+
if item_doi:
200+
seen_dois.add(item_doi)
201+
matches.append(item)
202+
203+
def _search(search_fn, query):
194204
nonlocal last_error
195205
cr = search_fn(query, rows=3, timeout=timeout)
196206
if "error" in cr:
197207
last_error = cr["error"]
198-
return []
199-
return [item for item in cr.get("results", [])
200-
if normalize_title(item.get("title") or "") == bib_title_norm]
208+
return
209+
for item in cr.get("results", []):
210+
if normalize_title(item.get("title") or "") == bib_title_norm:
211+
_add(item)
201212

202213
if title:
203-
matches = _search_and_filter(search_title, title)
214+
_search(search_title, title)
215+
_search(search_bibliographic, title)
204216

205217
if doi:
206218
cr = fetch_doi(normalize_doi(doi), timeout=timeout)
207219
if "error" in cr:
208220
last_error = cr["error"]
209221
else:
210-
existing_dois = {m.get("doi") for m in matches}
211-
if cr.get("doi") not in existing_dois:
212-
matches.append(cr)
213-
214-
if not matches and title:
215-
matches = _search_and_filter(search_bibliographic, title)
222+
_add(cr)
216223

217224
if not matches:
218225
result["error"] = last_error or "No exact title match in CrossRef results"

0 commit comments

Comments
 (0)