A Canadian police-accountability archive. Private, names-retaining capture — public, anonymized surface — and a purge-detector watching every source.
Live Bilingual Federated CC-BY 4.0 data
A production system that mirrors every public disciplinary record, tribunal decision, and oversight-body report from Canadian police — then detects when those records are removed, altered, or silently edited out of their sources.
Two products, one codebase. A private archive that retains names, full source documents, and long-term forensic provenance (capture → WARC → Wayback → archive.today → IPFS, SHA-256 hashed every step). A public surface that publishes only anonymized, aggregated views — HMAC officer tokens, quarterly date precision, k-anonymity floor of 5, 60-day publication lag, publication bans honored unconditionally.
Canadian police-accountability data is fragmented across dozens of oversight bodies — each with its own publishing cadence, format, and willingness to keep older records online. Historical records are especially vulnerable: when an oversight body is dissolved or reorganized, its archives often move behind a new agency's website, and some quietly disappear. No single Canadian resource currently does all four of:
For a newsroom covering policing, The Ledger collapses 30+ FOI requests a year into a standing source. For a national desk, it's a federal-to-municipal query surface no individual reporter can assemble.
| Layer | Status | Notes |
|---|---|---|
| Capture infrastructure | Live | Playwright + WARC + Wayback + archive.today + IPFS + SHA-256 ledger |
| Tier-A extractors (hand-tuned) | 7 of 15 | SIU (ON), BEI (QC), IIO (BC), SIRT-NL, 3× CanLII tribunals |
| Tier-C LLM fallback | Live | Anthropic Claude + §9.3 review-gated redaction |
| Purge detection | 4 of 4 | HTTP 4xx, content removal, redirect-to-generic, index delisting |
| Anonymization | Live | HMAC officer tokens, k-anon floor (k=5), 60-day publication lag |
| Public frontend | Live | 14 routes, EN/FR parity (drift-guarded), RSS, JSON-LD federation |
| Admin review UI | Live | Officer merges, redaction templates, corrections triage |
| Bulk export | Live | Daily JSONL git commit, weekly Parquet, stable JSON Schema |
The public archive at policedata.ca is open to everyone, forever, under CC-BY 4.0. Daily JSONL + weekly Parquet bulk exports are committed to a public git repo. Anyone can fork the full history. We can't and won't paywall accountability data — the project's PIPEDA journalism posture and its federation partnerships both depend on it.
What is for sale is the software, the editorial partnership, and engineering time. Tiers below sit on top of the free public surface — they don't gate it.
All figures CAD. Tax extra. Terms negotiable for Canadian journalism nonprofits.
policedata.ca) and the running Hetzner deployment.healthdata.ca, prisondata.ca, a provincial cut).policedata.ca brand + the running dataset (both stay open).The capture side relies on PIPEDA's journalism exemption (s.4(2)(c)). Captured personal data never leaves the private archive, never exits the Canadian datacentre, and never reaches a public surface un-anonymized. Publication bans are honored unconditionally at every downstream stage. A right-of-reply workflow is live at /corrections — anyone named or cited can file a correction; accepted corrections are logged publicly with editor notes.
Publisher incorporation jurisdiction is open; recommended ON / BC / QC (all three carry anti-SLAPP statutes).
One Hetzner box (~$60 CAD/month, room to grow). Postgres 16 + MinIO + Meilisearch + Apache 2 on loopback behind TLS. Daily cron drives the capture → extract → score → publish → export chain. Weekly tier-1 crawl (15 agencies), biweekly tier-2 (30), monthly tier-3 (long tail). Editor time is the real recurring cost once volume grows.
30-minute intro call to scope the right tier for your newsroom. Bring your editorial lead, a data-desk contact, and one piece of unfinished FOI reporting — we'll show the same question answered against the live archive.
Contact via policedata.ca/about or corrections intake.
Public data released under Creative Commons BY 4.0. Private archive access governed by journalism exemption; names never surface on public URLs. Figures current as of April 2026 first production deploy — 75 agencies seeded, 804 raw captures, 30 published incidents, 740 tests green.
Brochure version 1.0 · docs/BROCHURE.md in the source repo holds the plain-text equivalent.