Methodology
Capture
Every agency in the archive is crawled on a fixed cadence (weekly for priority-tier 1, biweekly for 2, monthly for 3). Each captured URL is rendered in a headless Chromium browser, hashed (SHA-256), stored to MinIO, submitted to the Wayback Machine and archive.today, and pinned to IPFS. Sub-resources are serialized to WARC. Capture is append-only — a re-fetch produces a new row, not a mutation.
Extraction
Captures pass through three tiers of extractors. Tier A adapters are hand-tuned per source (SIU Ontario director's reports, BEI Québec investigation detail pages, IIO BC Chief Civilian Director decisions, SIRT-NL director's reports, CanLII OCPC decisions). Tier B templates cover document families across municipalities. Tier C is an LLM fallback (Anthropic Claude) that handles every captured document for which no hand-tuned adapter exists.
Purge detection
Four detector families watch for record removal. The Ledger runs all four:
- HTTP 404/410/451 — a URL that previously returned 200 is now not-found. Confirmed after three consecutive failures over at least seven days.
- Content removal — a URL still returns 200 but the captured content has dropped by more than 30% between successive captures.
- Redirect-to-generic — a URL that used to serve its own content now 302s to one of the agency's top-20 landing pages (home, news, 404 shell). Flagged on first sighting since redirects-to-home are a decisive signal.
- Index delisting — a URL that was referenced on a same-agency roster page (one with at least five in-archive links) no longer appears there, while the roster itself still serves content. The highest-signal detector since it represents a targeted editorial action rather than a broken link.
Transparency grading
Each agency's grade is computed daily from extracted incidents and detected purges. The rubric is explicit:
- F — any confirmed purges, OR publication-ban rate ≥ 50%, OR no incidents at all (likely nothing is published).
- D — publication-ban rate ≥ 25%.
- C — publication-ban rate ≥ 10%, or finding rate is zero.
- B — finding rate ≥ 5%.
- A — low ban rate, some findings, no purges.
Anonymization
The public product never publishes officer names. Every incident carries an HMAC-SHA256 token derived from a salt held outside the application database; rotating the salt re-tokenizes the corpus, so a compromised app DB does not compromise name→token linkage. Dates are published to quarter-level precision only; records wait at least 60 days from disposition before publication; cells with fewer than five similar records are suppressed or aggregated upward. Agencies with fewer than 50 sworn members publish only yearly aggregates. Publication bans are honored unconditionally.