branber.io
Back to projects

MMA Almanac Scrapers

A Dockerized UFC/MMA data-collection pipeline using Playwright with Tor IP rotation and Cloudflare bypass.

Last pushed Nov 2025PythonShellDockerfile
View on GitHub

About this project

What it is

A Python data-collection pipeline that drives Playwright browser automation with Tor IP rotation and a custom Cloudflare-bypass HTTP client to reliably scrape Sherdog fighter profiles and UFC event/fight-statistics pages. Scraped data passes through a set of parsers and enrichers — including a fighter-stats interpolator and name-matcher — before being seeded into PostgreSQL via Prisma. The pipeline runs on a schedule triggered by GitHub Actions and AWS EventBridge, and the whole scraper runs inside Docker for reproducible execution.

Engineering highlights

  • Playwright browser automation with human-delay simulation to avoid bot detection
  • Tor IP rotation (rotate_tor_ip) to cycle exit nodes between scraping sessions
  • Cloudflare-bypass HTTP client for sites that block headless browsers
  • Session-state save/load to resume scraping without re-authenticating
  • Prisma ORM upsert seeders keep the PostgreSQL schema in sync with scraper output
  • Dockerized execution with GitHub Actions + EventBridge scheduling
  • Explicit data-leakage test suite confirming no future statistics bleed into training features

Stack

PythonPlaywrightTorPrismaPostgreSQLDockerGitHub ActionsAWS EventBridge

Part of the MMA Almanac system

This repo is one service in the four-part MMA Almanac platform. The system diagram below shows how the scrapers, ML engine, web UI, and AWS infrastructure fit together.