Methodology
Capture
Every agency in the archive is crawled on a fixed cadence (weekly for priority-tier 1, biweekly for 2, monthly for 3). Each captured URL is rendered in a headless Chromium browser, hashed (SHA-256), stored to MinIO, submitted to the Wayback Machine and archive.today, and pinned to IPFS. Sub-resources are serialized to WARC. Capture is append-only — a re-fetch produces a new row, not a mutation.
Extraction
Captures pass through three tiers of extractors. Tier A adapters are hand-tuned per source (SIU Ontario director's reports, BEI Québec investigation detail pages, IIO BC Chief Civilian Director decisions, SIRT-NL director's reports, CanLII OCPC decisions). Tier B templates cover document families across municipalities. Tier C is an LLM fallback (Anthropic Claude) that handles every captured document for which no hand-tuned adapter exists.
Purge detection
Four detector families watch for record removal. The Ledger runs all four:
- HTTP 404/410/451 — a URL that previously returned 200 is now not-found. Confirmed after three consecutive failures over at least seven days.
- Content removal — a URL still returns 200 but the captured content has dropped by more than 30% between successive captures.
- Redirect-to-generic — a URL that used to serve its own content now 302s to one of the agency's top-20 landing pages (home, news, 404 shell). Flagged on first sighting since redirects-to-home are a decisive signal.
- Index delisting — a URL that was referenced on a same-agency roster page (one with at least five in-archive links) no longer appears there, while the roster itself still serves content. The highest-signal detector since it represents a targeted editorial action rather than a broken link.
Transparency grading
Each agency's grade is computed daily from extracted incidents and detected purges. Agencies with fewer than 10 captured incidents are shown as "—" — insufficient data to grade fairly. A dash is not a passing grade; it means the archive hasn't seen enough of the agency's output yet. The lettered rubric is explicit:
- F — a confirmed purge event (the agency removed, redirected, or silently edited a public record), OR publication-ban rate ≥ 50% across at least 10 incidents.
- D — publication-ban rate ≥ 25% across at least 10 incidents.
- C — publication-ban rate ≥ 10% across at least 10 incidents, or finding rate at exactly zero across at least 10 incidents.
- B — finding rate ≥ 5% across at least 10 incidents.
- A — default once the archive clears the 10-incident floor and none of the penalties above apply.
Anonymization
The public product never publishes officer names. Every incident carries an HMAC-SHA256 token derived from a salt held outside the application database; rotating the salt re-tokenizes the corpus, so a compromised app DB does not compromise name→token linkage. Dates are published to quarter-level precision only; records wait at least 60 days from disposition before publication; cells with fewer than five similar records are suppressed or aggregated upward. Agencies with fewer than 50 sworn members publish only yearly aggregates. Publication bans are honored unconditionally.