Autonomous Vision Agent for Live Esports

Esports is a billion-dollar global industry, yet its live broadcasts suffer from a surprising problem: Data Blindness.

Imagine a chaotic 3K clutch during the VCT Grand Finals. The caster yells, the crowd roars, but what are the real-time stats? Currently, broadcasters, analysts, and fans have to wait for the official, rate-limited match-history APIs to update long after the round is over. The live data is trapped entirely inside the pixels on the screen.

As a 4th-year student diving deep into Data Engineering, I realized that analyzing clean, static CSVs doesn't prepare you for the messy reality of live data. For the Vision Possible Hackathon, I decided to build a system that extracts its own data from the wild.

I built an autonomous, talkative Vision Agent that watches live Twitch and YouTube video feeds, builds its own real-time telemetry database using Edge AI and Cloud LLMs, and answers natural language questions using voice synthesis.

Here is how I built a production-grade Vision ETL (Extract, Transform, Load) pipeline from scratch.

1. The Edge Intelligence (Image Ingestion & Gatekeeping)

A real-time broadcast cannot wait for cloud latency. I used yt-dlp to pipe live 1080p YouTube streams directly into OpenCV memory. Every frame is processed locally on Edge hardware.

To optimize performance and save compute, I built a YOLO-based Gatekeeper. The neural network doesn't process every frame; it only triggers the expensive OCR extraction steps when it actively detects weapon icons in the killfeed. This saved over 90% of local compute costs and guaranteed zero-latency tracking.

2. Agentic Routing & The OCR Fallback

Player tags in games are highly ambiguous (s0m, xXCooLeRzXx), making traditional OCR unreliable. My Vision Agent implements Agentic Routing to maximize accuracy:

Primary Extractor: Local EasyOCR makes the first pass and generates a Confidence Score.
The Gatekeeper: If confidence drops below 70%, the Agent flags the data as a hallucination.
Autonomous Escalation: The Agent converts the cropped image to Base64 and sends it to Groq’s Llama Vision model on ultra-fast LPU hardware to correct the OCR failure.

3. Fighting Compression with Dynamic Scaling

During testing, I hit a massive Computer Vision roadblock: live streams buffer. If YouTube dynamically drops a 1080p stream to 720p, traditional OpenCV template matching crashes because the pixel boundaries mismatch.

Instead of allowing the pipeline to crash, I engineered a Dynamic Resolution Scaler. The pipeline calculates the incoming frame height in real-time and mathematically scales the reference templates to match it. Whether the stream is 4K or 480p, the pipeline never drops a headshot metric.

4. Master Data Management (MDM)

AI hallucinates. When feeding compressed video to an LLM, it sometimes guesses wild, incorrect player names. To protect the database integrity, I implemented an iron-clad MDM layer. Using difflib, the pipeline mathematically normalizes the messy OCR/LLM output (lowercasing and stripping whitespace) and snaps it strictly to the active match roster before inserting it into the PostgreSQL database.

5. The Live Dashboard & Voice Narrator

Data is useless if you can’t act on it. I built a broadcast-style web dashboard using Flask and Tailwind CSS that uses AJAX auto-polling to silently fetch new data every 2 seconds.

But I wanted to go a step further and build an Enterprise-grade feature: A Voice-Activated Narrator Agent. Users can type a natural language question like, "Who is dominating the match right now?" 1. Flask intercepts the query. 2. Python runs a targeted SQL query against Supabase for the most recent match telemetry. 3. The compact JSON context is passed to Groq's Llama 3.1 8B Instant model. 4. The browser's built-in Web Speech API instantly reads the AI's analysis out loud through the user's speakers.

Conclusion

This hackathon was an intense engineering sprint. The biggest takeaway? Agentic Routing is the future of data pipelines. You do not need to send every frame to the cloud. By building an intelligent, self-aware local Agent that can check its own confidence, dynamically scale to video compression, and fuzzily match names, you can create a system that is highly accurate, blazingly fast, and incredibly cheap to run.

Vision Agents are the missing link between the noisy, unstructured physical world of live video and the structured digital world of actionable analytics.

Check out the full open-source architecture on my GitHub: github.com/ig-mik1/vision-etl-pipeline

From Pixels to PostgreSQL: Building an Autonomous Vision Agent for Live Esports

1. The Edge Intelligence (Image Ingestion & Gatekeeping)

2. Agentic Routing & The OCR Fallback

3. Fighting Compression with Dynamic Scaling

4. Master Data Management (MDM)

5. The Live Dashboard & Voice Narrator

Conclusion

Comments

Command Palette

1. The Edge Intelligence (Image Ingestion & Gatekeeping)

2. Agentic Routing & The OCR Fallback

3. Fighting Compression with Dynamic Scaling

4. Master Data Management (MDM)

5. The Live Dashboard & Voice Narrator

Conclusion

Comments