Train a sports highlight model using box-score labels aligned to video timestamps
Build a basketball stats lookup tool driven by the per-game JSONL files
Run analytics on 81k play-by-play rows across 166 games
Video files are not included, so reproducing experiments requires downloading 347 hours of YouTube footage yourself.
This repository publishes a dataset that links full-length NBA game videos on YouTube to the matching official statistics from NBA.com. The videos themselves are not redistributed. Instead, the dataset stores YouTube video IDs and URLs, leaving it to each user to download the video files where their use case and local rules allow. Each entry pairs a video with a cleaned matchup string, a verified game date, an official box score, and, when available, official play-by-play event data. The README walks through how the data was built. It starts from a well-known YouTube playlist of NBA full games and extracts 595 raw entries with IDs, URLs, titles, descriptions, durations, upload dates, and channels. Placeholder entries are dropped, leaving 217 candidates. Titles are normalized into a Team A vs. Team B format and dates are parsed from titles or descriptions, with LLM-assisted filling and manual checks for missing fields. Each candidate is matched against NBA.com date pages, with care for historical team abbreviations like PHL/PHI, NJN/BKN, NOH/NOP/CHA, and SEA/OKC. Games that cannot be confidently matched are dropped, and the final release contains 189 verified games covering about 347 hours of footage. The folder layout puts each game in a directory named YYYY-MM-DD-away-vs-home, for example 2016-02-27-gsw-vs-okc. Each game folder includes an empty video subfolder for the user to fill in, a box-score JSONL file, a play-by-play JSONL file (sometimes empty for older games), and a metadata.json describing the link between sources. A top-level nba_games.jsonl gives one record per retained YouTube game with fields for ID, URL, cleaned title, date, duration in seconds, and description. The box-score files contain two row types: team rows and player rows. Both include linkage fields like the NBA.com game URL, YouTube video ID, game date, and game ID. Team rows add side, team ID, name, tricode, final score, period-by-period scores, and a team statistics object. Player rows add player ID, names, slug, position, jersey number, a DNP or injury comment when present, and a per-player statistics object covering minutes, shooting, rebounds, assists, steals, blocks, turnovers, points, and plus-minus. Headline figures from the overview section are 189 verified games, a total duration close to 347 hours, 5,194 box-score rows, and 81,355 play-by-play rows across 166 games (the remaining 23 older games have empty play-by-play files). The project is released under the MIT license and links out to a Hugging Face mirror, a blog post, and a RedNote post.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.