Project honeypots.work: kick-off

As part of my day job, I look after a top 5000 website in the world. We serve hundreds of terabytes of content monthly - and unsurprisingly, that traffic attracts plenty of bad actors, scrapers, and bots. They make engineers’ lives harder.
There are commercial solutions for bots and scrapers. But I want to build my own - specifically, a system that ranks inbound requests by their probability of being malicious.
Why build when you can buy? Because I learn best by building.
While everyone’s excited about LLMs (myself included), I’m increasingly curious about the fundamentals underneath: the CPU vs GPU vs TPU trade-offs, the mechanics of training and inference. Theory alone doesn’t stick for me. I need a hard problem to chase, learning the tools and techniques through iteration. I develop opinions about technology by using it, not just reading about it.
Bot detection is that problem.
The data challenge
At work, I have access to petabytes of data - access logs, tracking events. On my personal laptop? Nothing. And finding quality, recent access log datasets is surprisingly difficult. Kaggle’s offerings are tiny and dated.
So I’ll make my own data. Here’s the plan:
-
I’ll instrument my personal sites - https://deploy.live, https://gcpcost.com, and gcpiam.com (once I build it).
-
I’ll add fingerprinting to capture rich client-side signals.
-
Here’s where it gets meta: I’m considering honeypots.work - a site about honeypots and bot detection that might attract the very bad actors I want to study.
I’ll document the journey here, which creates a feedback loop: writing attracts traffic, traffic generates data, data enables learning, learning produces writing.
Eventually, I want to contribute these datasets back to Kaggle and explore real-time access log streaming, so others can learn from them too.
The ML journey starts now.