The brief
A fintech research team was spending most of every analyst's week manually visiting 40+ data sources, copy-pasting into spreadsheets, and chasing inconsistent schemas. They wanted scale without losing the analyst's judgment loop — the LLM was meant to do the boring extraction, not replace the human review at the edges.
What we built
- Resilient scrapers — per-source extractors with retry/backoff, rotating user agents, structured-error reporting, and a kill-switch if a source's layout changes more than a configured threshold.
- LLM normalization layer — pulls raw HTML/PDF/CSV into a single canonical schema; flags low-confidence rows for review rather than silently dropping them.
- Review queue UI — one analyst sweeps the day's flagged rows in ~30 minutes instead of three people each doing a full day.
- Audit trail — every cell's provenance traceable back to the source fetch + LLM call that produced it, so disputed numbers can be reproduced.
The hard parts
The scrapers themselves were straightforward; what was harder was the reviewability of the LLM step. An invisible normalization that silently mis-classifies a row would corrupt the downstream analysis with no way to know. Building the confidence scoring + review queue UI was where most of the project's discretion went.
We also ran an evaluation harness on every prompt change: a fixed corpus of ~500 historical rows with known-good outputs gets re-extracted on every prompt edit. A regression beyond 2% triggers a stop-the-line review before the change ships.
Outcome
Analyst time on data collection dropped from ~90 hours/week (3 FTE) to roughly 5 hours/week (one reviewer). The recovered capacity went straight into the analysis work that's actually the team's value-add. Source coverage has grown from 40 to 60+ since launch with no headcount increase.