Audit log ALL the things?

My app ReadStuffLater fundamentally revolves around scraping web pages with the Microlink API. Sometimes that goes wrong: the target web page has a problem. Or Microlink does. Or the target throws up a captcha or blocks data center IPs or something.

I thought I'd done an alright job of handling all the cases that could go wrong. API errors are retried, website errors retry or do something sensible based on common status codes.

Yet I still get links that intermittently fail to scrape for no apparent reason. There are a couple usual suspects where I thought I'd handled all the failure modes, but they keep going wrong. And random pages have problems sometimes, too.

My strategy has been “notice failed scrapes, watch the logs while reexecuting the scrape, then fix whatever I see”. This has problems:

  1. Detecting a legitimate failure in hard. Sometimes a scrape unrecoverably fails for reasons outside my control. False negatives are real.
  2. This doesn't let me diagnose transient failures, where everything is working again by the time I manually verify the problem.
  3. It's a pain since I don't have good tools to peek into the scrape process. I'm always setting up something ad-hoc like console.log in local dev.

I think the right call here is an audit log. Every scrape would get its raw result stored in the DB. Status codes, body, Microlink metadata. Everything.

I guess I'd need a way to look up the audit records for a given link. A CLI script on the prod box would probably be okay. Maybe reformat the data in a way that's convenient to munge with jq.

The interesting question is: at what point should I have recognized that I needed an audit log?

Should I be logging ALL external API calls? What internal operations should I log? Do I just wait until I have problems and then start logging? I'm not sure.

The logical extreme of this is “just adopt event sourcing”. And yes, that would solve this problem. But what other problems would it cause? Maybe I should just adopt it piecemeal, and not for the whole system? But then I'm paying the implementation complexity cost for minimal benefit.

Idunno. All I know is right now I sure need an audit log for this one piece of the system 🙂