Skip to content
This repository was archived by the owner on Mar 20, 2026. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "bulbascan"
version = "0.1.1"
version = "0.1.2"
edition = "2024"
rust-version = "1.94"
description = "High-speed selective-proxy scanner for geo-block detection and geosite routing list generation"
Expand Down
28 changes: 28 additions & 0 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,36 @@ Unreachable ─ TCP timeout / DNS failure
> **ComparisonDecision** (dual-vantage output):
> `ConfirmedProxyRequired` · `CandidateProxyRequired` · `ConsistentBlocked` · `ConsistentDirect` · `NeedsReview`

> [!NOTE]
> `ConsistentBlocked` means **both** local and control-proxy paths are blocked. This
> could mean the site is dead globally, OR the control proxy is in the same blocked
> jurisdiction as the local path. An external (EU/US) proxy is required to
> correctly separate geo-blocks from genuinely dead domains.

---

## Dual-vantage accuracy model

The scanner classifies each domain independently on two paths:

```
Local path (home/residential IP in blocked country)
├── Accessible → DirectOk
├── GeoBlocked / WAF / Captcha → compare with control
└── Unreachable (timeout) → compare with control

Control path (MUST be external EU/US proxy)
├── DirectOk + Local blocked → ConfirmedProxyRequired ✅
├── DirectOk + Local timeout → ConfirmedProxyRequired ✅ (ISP block)
└── Blocked + Local blocked → ConsistentBlocked (dead or same region)
```

Key invariant: **the value of the tool scales directly with the geographic distance
between local path and control proxy**. Same-country proxy = useless for detecting
national-level blocks.

## Service model

Known services are structured as:
Expand Down
57 changes: 26 additions & 31 deletions docs/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,35 +2,9 @@

---

## v0.1.1 — Detection Accuracy

### Multi-probe consensus (~80 lines)
For ambiguous results (confidence 60–85), run 2–3 probes to different `probe_paths`. Require ≥2/3 agreement before finalising. Disagreement → `NeedsReview`. Reduces single-probe noise on flaky targets.

### Two-phase header classification (~40 lines)
Split `SIGNATURES_HEADERS` into CDN-presence (informational, confidence capped at 35 — already done on 200 OK via status-gating) and CDN-active-block (full confidence 84). Eliminate short generic patterns from the CDN-presence set so they never score on non-error responses. Completes the intent of the existing status-gating logic.

### Domain confidence adjustment (~20 lines)
Known service with `browser_verification = true` → +5 to captcha/geo confidence. Unknown domain without profile → -10 to WAF confidence. Currently all domains get identical confidence scoring regardless of their profile membership.

### Infra domain marker expansion (~15 lines)
`INFRA_DOMAIN_MARKERS` in `analysis.rs` is missing: `fastly`, `edgekey`, `edgesuite`, `azurewebsites`, `azureedge`, `trafficmanager` (already present but check scope). These are causing occasional WAF false-positives on infrastructure apex roots.

---

## v0.1.2 — Signal Quality

### Redirect chain deep scan (~60 lines)
`classify_redirect()` in `analysis.rs` checks only the final URL. Extend to inspect intermediate redirect hops. Some geo-blocks do a two-step redirect through a neutral CDN URL before landing on a blockpage.

### DOM-aware body scoring (~100 lines)
Weight body matches by HTML location: `<title>` = strongest, `<h1>` = strong, body text = weak. Pages with >20 `<a>` links get penalty halved (real pages have navigation; block pages don't). Requires a lightweight HTML tag scanner — no full DOM parser needed.

### `UnexpectedStatus` reclassification via comparison (~30 lines)
`UnexpectedStatus` (non-2xx without a block signature) always routes to `ManualReview`. When control proxy is present: if local = `UnexpectedStatus` and control = `DirectOk` → upgrade to `CandidateProxyRequired`. `comparison.rs` already handles `ConfirmedProxyRequired` and `CandidateProxyRequired` but misses this specific gap. Reduces manual review queue substantially.

### Confidence boost on consistent network evidence (~20 lines)
In `comparison.rs`, when local TCP/TLS fails but control path DNS resolves → already adds a network note. Extend: when network evidence strongly indicates ISP-level block (local tcp/443 fail + local DNS NXDOMAIN + control path DNS ok) → boost `CandidateProxyRequired` → 90+ confidence. `NetworkEvidence` struct already tracks all required fields.
(All planned features for this release have been implemented or moved to future releases)

---

Expand All @@ -42,8 +16,11 @@ Before HTTP probing, compare ISP DNS response (`resolve_host()` in `network.rs`
### SNI-based block detection (~120 lines)
`probe_tls_443()` in `network.rs` already sends a TLS ClientHello with the target SNI. Extend: if TCP 443 succeeds but TLS handshake resets (already captured in `tls_443.status`) → emit `TlsFailure` with SNI-block reason. Also probe the same IP with a benign SNI (e.g. `cloudflare.com`) — if that succeeds → confirmed SNI block. Uses existing `tokio-rustls` setup.

### ECH probing (~100 lines)
When domain publishes ECH keys in DNS HTTPS record (type 65), attempt TLS with ECH enabled. If ECH succeeds where plain SNI was blocked → confirmed SNI censorship. `rustls` 0.23+ supports ECH experimentally. Pairs with the DNS probe to detect ECH-aware DPI.
### DNS IP comparison in `compare_network_evidence` (~40 lines)
`compare_network_evidence()` in `comparison.rs` generates text notes when local DNS fails and control path DNS succeeds (lines 92–100), but **never compares the actual IP sets**. If local DNS returns a blockpage IP while control returns the real IP, the difference is invisible. Add `parse_ip_from_detail()` on `ProbeEvidence.detail` and compare local vs control resolved IPs — if they differ, emit a `dns_ip_mismatch` note that boosts `CandidateProxyRequired` confidence.

### TCP-80 probe result unused (~10 lines)
`collect_network_evidence()` in `network.rs` probes `tcp/80` and stores it in `NetworkEvidence.tcp_80`, but `compare_network_evidence()` in `comparison.rs` never reads it. Either use it (local tcp/80 up but local tcp/443 down → likely port-level block) or remove the probe to avoid dead runtime cost.

### IPv6 dual-stack probing (~60 lines)
Extend `collect_network_evidence()` in `network.rs` to probe AAAA records alongside A records. Report when IPv4 is blocked but IPv6 works. `reqwest` and `tokio` support IPv6 natively. Add `ipv6` field to `NetworkEvidence`.
Expand All @@ -52,8 +29,14 @@ Extend `collect_network_evidence()` in `network.rs` to probe AAAA records alongs

## v0.1.4 — Output & Export Formats

### `geoip.dat` generation (~150 lines)
DNS-resolve blocked domains → collect A/AAAA records → aggregate into CIDR subnets → compile into V2Ray `GeoIP` protobuf binary. Same `prost` setup already used in `geosite.rs`. No new dependencies needed.
### GeoIP — output: `geoip.dat` generation (~150 lines)
DNS-resolve blocked domains → collect A/AAAA records → aggregate into CIDR subnets → compile into V2Ray `GeoIP` protobuf binary. Same `prost` setup already used in `geosite.rs`, no new dependencies. Flag: `--emit-geoip geoip.dat`.

### GeoIP — input: `geoip.dat` import (~100 lines)
Mirror of `--import-geosite`: add `--import-geoip geoip.dat --import-geoip-category RU`. Decode CIDR blocks from the binary V2Ray `GeoIP` protobuf (same `prost` schema), then either reverse-rDNS each range or pass subnets directly as scan targets. Useful when the starting point is an IP blocklist rather than domain names.

### GeoIP — blockpage IP fingerprinting (~50 lines)
In `collect_network_evidence()`: if local DNS resolves to a known blockpage IP (built-in list or user file via `--blockpage-ips blockpage_ips.txt`), immediately emit a `dns_blockpage_ip` verdict with a confidence boost instead of a plain text note. Known examples: `95.213.255.1` (Rostelecom), `188.186.154.90` (MTS), `188.114.97.0/24` (Cloudflare WARP block range). Pairs with the DNS IP comparison item in v0.1.3.

### Clash / Mihomo rule-set export (~50 lines)
Add a `write_clash_rule_set()` to `router_exports.rs` using the existing `RouterExportSpec` pattern:
Expand Down Expand Up @@ -91,6 +74,12 @@ Currently `--state-dir` is a single directory. Add `--merge-state-dir` to ingest
### Auto-rescan of `manual_review` bucket (~20 lines)
`manual_review.txt` entries are always rescanned, but there is no mechanism to promote them after N failed rescans. Add a counter per domain — after 3 consecutive `ManualReview` results with no resolution, downgrade to `direct.txt` with a note, or flag as permanently inconclusive.

### State: no `manual_review` counter / promotion logic (~30 lines)
`LocalState` in `state.rs` stores `manual_review` as a plain `BTreeSet<String>` with no per-domain counter. The roadmap item "auto-rescan" already tracks this — but the data model must change first: replace plain string with `(domain, attempt_count)` tuple stored as `domain\t<n>` in the file. `read_domain_file()` / `write_domain_file()` need updating before the promotion logic can be wired in.

### Periodic state flush (~30 lines)
With `--state-dir`, state is committed to disk only once at the very end of `main()`. If the user kills the process mid-scan or the proxy crashes, progress is lost. Add a periodic flush: every N domains processed (e.g. 1000), call `local_state.save(dir)`. The `save()` method already exists and is async.

---

## v0.1.6 — Developer & Quality
Expand All @@ -106,3 +95,9 @@ A curated set of domains with known expected outcomes (annotated as `geo domain.

### Windows starter pack
Release archive: `bulbascan.exe` + `profiles.toml` + `example-domains.txt` + `QUICKSTART.txt` (3 lines). Reduces friction for non-technical users from target audience.

### User-loadable signatures file (~80 lines)
`signatures.rs` currently compiles all block signatures (body, header, API patterns) into the binary as Rust `const` arrays. Add support for loading an optional `signatures.toml` alongside the executable that extends or overrides the built-in set. The `BlockMatcher::new(file)` path already accepts an `Option<&Path>` — it just needs a TOML parser for the same schema.

### Cancellable retry sleep (~10 lines)
In `scan_domain()`, the `tokio::time::sleep()` between retest attempts (line ~768) is not cancellation-aware. If the user presses `q` during the backoff sleep, the worker does not react until the sleep expires. Wrap with `tokio::select! { () = sleep => {}, () = ct.cancelled() => break }`.
45 changes: 45 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,15 @@ Use `aggressive` only when you need the deepest possible confirmation and accept

Running with `--control-proxy` (short: `-x`) enables dual-vantage mode — each domain is scanned both locally and through the proxy, and verdicts are compared. This is the strongest path to geo-confirmation.

> [!IMPORTANT]
> **The control proxy MUST be located outside your blocked jurisdiction.**
> If it is in the same country or AS as your local connection, both paths see the
> same ISP blocks — domains will show `ConsistentBlocked` and NOT be added to
> your proxy list. Use a proxy in EU or US.
>
> Quick check: if `confirmed_proxy_required` output is suspiciously small (<1% of
> your list), the control proxy is likely in the wrong region.

Supported proxy formats:
- `http://user:pass@host:port`
- `socks5://user:pass@host:port`
Expand All @@ -190,6 +199,42 @@ bulbascan -i domains.txt -x socks5h://127.0.0.1:1080

---

## Understanding the results

### Why `unreachable` matters

`Unreachable` domains (DNS timeout / TCP timeout) are reported as `Dead` **without a control proxy**. However, many of them are actually ISP-level blocks that are indistinguishable from truly dead servers:

| Without `-x` | With `-x` (external proxy) |
|---|---|
| Local: timeout → `Unreachable` | Local: timeout, Proxy: `DirectOk` → `ConfirmedProxyRequired` ✅ |
| Local: timeout → `Unreachable` | Local: timeout, Proxy: timeout → `ConsistentBlocked` (globally dead) |

This is why running without an external control proxy yields a fraction of the real blocked list.

### WAF / captcha are not always global bot-protection

If you run Bulbascan from a **residential IP** in a country with active internet censorship:
- A Cloudflare managed challenge served to your residential IP = geo-specific block rule configured by the site owner
- A WAF 403 from Akamai / Imperva = may be an IP-range block targeting your country

Without a control proxy, these stay in `manual_review`. With an external proxy:
- Proxy sees `DirectOk` → `ConfirmedProxyRequired` (it IS a geo-block)
- Proxy also gets WAF → `ConsistentBlocked` (global bot protection, not your problem)

### Expected results (residential IP, ru-blocked list, external EU/US proxy)

```
proxy_required / confirmed_proxy_required: 3 000 – 10 000
unreachable (truly dead globally): 5 000 – 8 000
direct_ok: 40 000 – 55 000
manual_review (still ambiguous): < 5 000
```

If `proxy_required` is below 1 000 on a 75 K ru-blocked list, the control proxy is likely in the wrong region.

---

## Incremental state

Use `--state-dir` (short: `-W`) to build a persistent local block list over time:
Expand Down
93 changes: 89 additions & 4 deletions src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,39 @@ fn save_workers(path: &str, n: usize) {
let _ = std::fs::write(path, n.to_string());
}

/// Look up the 2-letter country code seen from `proxy` (or from the local path
/// when `proxy` is `None`) by querying <https://ipinfo.io/country>.
/// Returns `None` on any error (timeout, parse failure, etc.) — the check is
/// advisory and must never block the scan.
async fn fetch_country(proxy: Option<&str>, timeout_secs: u64) -> Option<String> {
use std::time::Duration;

let mut builder = reqwest::Client::builder()
.timeout(Duration::from_secs(timeout_secs.min(8)))
.user_agent("bulbascan-geo-check/1");

if let Some(proxy_url) = proxy {
builder = builder.proxy(reqwest::Proxy::all(proxy_url).ok()?);
}

let client = builder.build().ok()?;
let text = client
.get("https://ipinfo.io/country")
.send()
.await
.ok()?
.text()
.await
.ok()?;

let country = text.trim().to_uppercase();
if country.len() == 2 && country.chars().all(|c| c.is_ascii_alphabetic()) {
Some(country)
} else {
None
}
}

#[tokio::main]
#[allow(clippy::too_many_lines)]
async fn main() -> anyhow::Result<()> {
Expand Down Expand Up @@ -401,10 +434,11 @@ async fn main() -> anyhow::Result<()> {
// Output path string passed to LiveBar for the dynamic profile header.
let output_display = args.results_dir.display().to_string();

let mut final_workers = concurrency;
let scan_results = if domains.is_empty() {
Vec::new()
} else {
let Some((results, final_workers)) = scanner::run_scan(
let Some((results, fw)) = scanner::run_scan(
domains.clone(),
proxies,
concurrency,
Expand All @@ -429,7 +463,8 @@ async fn main() -> anyhow::Result<()> {
return Ok(());
};
// Persist the live-adjusted worker count for next run.
save_workers(WORKERS_FILE, final_workers);
final_workers = fw;
save_workers(WORKERS_FILE, fw);
results
};

Expand Down Expand Up @@ -545,13 +580,63 @@ async fn main() -> anyhow::Result<()> {
Err(e) => eprintln!("Error writing control proxy health: {e}"),
}

// ── Proxy geo-location validation ─────────────────────────────────────
// Warn the user if the control proxy appears to be in the same country
// as the local network path — in that case geo-blocking is undetectable.
{
let local_country = fetch_country(None, args.timeout).await;
let proxy_country = fetch_country(Some(&control_proxy), args.timeout).await;
match (local_country.as_deref(), proxy_country.as_deref()) {
(Some(local), Some(proxy)) if local.eq_ignore_ascii_case(proxy) => {
eprintln!(
"[WARN] Control proxy is in the same country ({local}) as the local path."
);
eprintln!(
" Geo-blocking will NOT be detectable. Use an external EU/US proxy."
);
}
(Some(local), Some(proxy)) => {
println!("Control proxy geo-check: local={local}, proxy={proxy} — OK");
}
_ => {
// ipinfo.io unreachable, non-fatal
}
}
}

if scanner::should_run_control_comparison(&control_health) {
// ── Smart comparison pre-filter ───────────────────────────────────
// Domains already confirmed direct_ok locally do not need a second
// scan — the control proxy won't change a working result. Only send
// unreachable, blocked, and manual_review domains for comparison.
// This reduces the comparison scan domain count by ~55% on typical
// RU/BY blocked lists where ~41 k of 75 k domains are direct_ok.
let comparison_domains: Vec<String> = domains
.iter()
.filter(|d| {
!scan_results.iter().any(|r| {
r.domain == **d && r.routing_decision == scanner::RoutingDecision::DirectOk
})
})
.cloned()
.collect();
let skipped_direct = domains.len().saturating_sub(comparison_domains.len());
if skipped_direct > 0 {
println!(
"Comparison pre-filter: skipping {skipped_direct} direct_ok domains (scanning {} domains via control proxy).",
comparison_domains.len()
);
}

// Blank separator so the second scan's progress bar does not
// overwrite the first scan's completion messages.
println!();
println!("Control proxy is healthy. Running comparison scan...");

let Some((control_results, _)) = scanner::run_scan(
domains,
comparison_domains,
vec![control_proxy],
concurrency,
final_workers,
args.results_dir.join("control_ok.log"),
args.results_dir.join("control_blocked.log"),
args.timeout,
Expand Down
15 changes: 13 additions & 2 deletions src/pipeline.rs
Original file line number Diff line number Diff line change
Expand Up @@ -51,11 +51,22 @@ pub(crate) fn blocked_domains_from_results(results: &[ScanResult]) -> Vec<String
domains
}

/// Collect confirmed-proxy-required domains from control comparison results.
/// Collect proxy-required domains from control comparison results.
/// Includes both `ConfirmedProxyRequired` (local routing was already
/// `ProxyRequired`) and `CandidateProxyRequired` (locally `ManualReview`
/// but control proxy proved the domain is accessible externally — these
/// are silent ISP-level drops that the dual-vantage scan is specifically
/// designed to detect).
pub(crate) fn blocked_domains_from_comparisons(comparisons: &[ComparisonResult]) -> Vec<String> {
let mut domains = comparisons
.iter()
.filter(|c| c.decision == ComparisonDecision::ConfirmedProxyRequired)
.filter(|c| {
matches!(
c.decision,
ComparisonDecision::ConfirmedProxyRequired
| ComparisonDecision::CandidateProxyRequired
)
})
.map(|c| c.domain.clone())
.collect::<Vec<_>>();
domains.sort();
Expand Down
Loading
Loading