Methodology

Capture

Every agency in the archive is crawled on a fixed cadence (weekly for priority-tier 1, biweekly for 2, monthly for 3). Each captured URL is rendered in a headless Chromium browser, hashed (SHA-256), stored to MinIO, submitted to the Wayback Machine and archive.today, and pinned to IPFS. Sub-resources are serialized to WARC. Capture is append-only — a re-fetch produces a new row, not a mutation.

Extraction

Captures pass through three tiers of extractors. Tier A adapters are hand-tuned per source (SIU Ontario director's reports, BEI Québec investigation detail pages, IIO BC Chief Civilian Director decisions, SIRT-NL director's reports, CanLII OCPC decisions). Tier B templates cover document families across municipalities. Tier C is an LLM fallback (Anthropic Claude) that handles every captured document for which no hand-tuned adapter exists.

Purge detection

Four detector families watch for record removal. The Ledger runs all four:

HTTP 404/410/451 — a URL that previously returned 200 is now not-found. Confirmed after three consecutive failures over at least seven days.
Content removal — a URL still returns 200 but the captured content has dropped by more than 30% between successive captures.
Redirect-to-generic — a URL that used to serve its own content now 302s to one of the agency's top-20 landing pages (home, news, 404 shell). Flagged on first sighting since redirects-to-home are a decisive signal.
Index delisting — a URL that was referenced on a same-agency roster page (one with at least five in-archive links) no longer appears there, while the roster itself still serves content. The highest-signal detector since it represents a targeted editorial action rather than a broken link.

Transparency grading

Each agency's grade is computed daily from extracted incidents and detected purges. The rubric is explicit:

F — any confirmed purges, OR publication-ban rate ≥ 50%, OR no incidents at all (likely nothing is published).
D — publication-ban rate ≥ 25%.
C — publication-ban rate ≥ 10%, or finding rate is zero.
B — finding rate ≥ 5%.
A — low ban rate, some findings, no purges.

Anonymization

The public product never publishes officer names. Every incident carries an HMAC-SHA256 token derived from a salt held outside the application database; rotating the salt re-tokenizes the corpus, so a compromised app DB does not compromise name→token linkage. Dates are published to quarter-level precision only; records wait at least 60 days from disposition before publication; cells with fewer than five similar records are suppressed or aggregated upward. Agencies with fewer than 50 sworn members publish only yearly aggregates. Publication bans are honored unconditionally.