explaingit

choucisan/nba_games

13Audience · dataComplexity · 2/5ActiveLicenseSetup · easy

TLDR

Dataset linking 189 full-length NBA game YouTube videos to official NBA.com box scores and play-by-play data. Videos are not redistributed, only IDs and URLs.

Mindmap

mindmap
  root((nba_games))
    Inputs
      YouTube playlist IDs
      NBA.com game pages
      Team abbreviation maps
    Outputs
      nba_games.jsonl index
      Box-score JSONL
      Play-by-play JSONL
      Per-game metadata.json
    Use Cases
      Sports analytics training data
      Video grounded QA
      Box-score parsing demos
    Tech Stack
      JSONL
      Python
      LLM assist
      MIT

Things people build with this

USE CASE 1

Train a sports highlight model using box-score labels aligned to video timestamps

USE CASE 2

Build a basketball stats lookup tool driven by the per-game JSONL files

USE CASE 3

Run analytics on 81k play-by-play rows across 166 games

Tech stack

JSONLPythonYouTubeNBA.com

Getting it running

Difficulty · easy Time to first run · 30min

Video files are not included, so reproducing experiments requires downloading 347 hours of YouTube footage yourself.

MIT license, free to use, modify, and redistribute the dataset metadata with attribution.

In plain English

This repository publishes a dataset that links full-length NBA game videos on YouTube to the matching official statistics from NBA.com. The videos themselves are not redistributed. Instead, the dataset stores YouTube video IDs and URLs, leaving it to each user to download the video files where their use case and local rules allow. Each entry pairs a video with a cleaned matchup string, a verified game date, an official box score, and, when available, official play-by-play event data. The README walks through how the data was built. It starts from a well-known YouTube playlist of NBA full games and extracts 595 raw entries with IDs, URLs, titles, descriptions, durations, upload dates, and channels. Placeholder entries are dropped, leaving 217 candidates. Titles are normalized into a Team A vs. Team B format and dates are parsed from titles or descriptions, with LLM-assisted filling and manual checks for missing fields. Each candidate is matched against NBA.com date pages, with care for historical team abbreviations like PHL/PHI, NJN/BKN, NOH/NOP/CHA, and SEA/OKC. Games that cannot be confidently matched are dropped, and the final release contains 189 verified games covering about 347 hours of footage. The folder layout puts each game in a directory named YYYY-MM-DD-away-vs-home, for example 2016-02-27-gsw-vs-okc. Each game folder includes an empty video subfolder for the user to fill in, a box-score JSONL file, a play-by-play JSONL file (sometimes empty for older games), and a metadata.json describing the link between sources. A top-level nba_games.jsonl gives one record per retained YouTube game with fields for ID, URL, cleaned title, date, duration in seconds, and description. The box-score files contain two row types: team rows and player rows. Both include linkage fields like the NBA.com game URL, YouTube video ID, game date, and game ID. Team rows add side, team ID, name, tricode, final score, period-by-period scores, and a team statistics object. Player rows add player ID, names, slug, position, jersey number, a DNP or injury comment when present, and a per-player statistics object covering minutes, shooting, rebounds, assists, steals, blocks, turnovers, points, and plus-minus. Headline figures from the overview section are 189 verified games, a total duration close to 347 hours, 5,194 box-score rows, and 81,355 play-by-play rows across 166 games (the remaining 23 older games have empty play-by-play files). The project is released under the MIT license and links out to a Hugging Face mirror, a blog post, and a RedNote post.

Copy-paste prompts

Prompt 1
Write a Python script that reads nba_games.jsonl and downloads each YouTube video to the matching games/<date-away-vs-home>/video/ folder using yt-dlp
Prompt 2
Parse box-score.jsonl for one game and compute team shooting percentages from the statistics object
Prompt 3
Join the play-by-play JSONL events with the player rows from box-score.jsonl on person_id to get per-event player stats
Prompt 4
Filter nba_games.jsonl to games after 2016 and produce a CSV with title, date, and duration in minutes
Prompt 5
Build a small FastAPI endpoint that returns the box score for a given YYYY-MM-DD-away-vs-home folder name
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.