explaingit

builderio/gpt-crawler

22,248TypeScriptAudience · developerComplexity · 2/5QuietLicenseSetup · moderate

TLDR

Automatically crawl a website and convert its pages into a JSON knowledge base for training a custom OpenAI chatbot.

Mindmap

mindmap
  root((repo))
    What it does
      Crawl websites
      Extract text content
      Generate JSON output
    How to use
      Set start URL
      Define link patterns
      Upload to OpenAI
    Configuration
      CSS selectors
      Page limits
      File size caps
    Outputs
      Custom GPT
      Custom Assistant
      JSON knowledge base
    Tech stack
      TypeScript
      Node.js
      Docker

Things people build with this

USE CASE 1

Build a chatbot trained on your product's documentation so customers can get instant answers.

USE CASE 2

Create an AI assistant for your company's help center or internal knowledge base.

USE CASE 3

Turn any public website into a custom GPT you can share with others without manual data entry.

Tech stack

TypeScriptNode.jsDocker

Getting it running

Difficulty · moderate Time to first run · 30min

Requires OpenAI API key and Docker to run the crawler; web scraping setup may need URL configuration.

Use freely for any purpose including commercial. Keep the copyright notice.

In plain English

GPT Crawler is a tool that automatically visits and reads the pages of a website, then saves their text content into a single JSON file that you can upload to OpenAI to create a custom AI assistant trained on that site's content. In other words, it lets you turn any documentation site or web resource into the knowledge base for your own chatbot, without any manual copy-pasting. The way it works: you give it a starting URL and a pattern for which links to follow (for example, "start at the developer docs homepage and follow any link that matches /docs/**"). You can also specify a CSS selector, a way of identifying which part of the page contains the useful text, so it skips navigation menus, footers, and other noise. The crawler visits each matching page, extracts the relevant text, and saves everything to a file called output.json. Configuration options let you cap how many pages it visits, limit the output file size, and exclude certain file types from being fetched. Once you have the output file, you upload it to OpenAI's platform to power either a "custom GPT" (a shareable chatbot you can build through OpenAI's web interface) or a "custom assistant" (an AI you can integrate into your own product via the API). The README includes a step-by-step walkthrough for both paths. You would use this when you want an AI assistant that knows the content of a specific website, a product's documentation, a company's help center, or your own site, and can answer questions about it. It is written in TypeScript and runs on Node.js (version 16 or higher). It can also be run inside a Docker container or started as an API server.

Copy-paste prompts

Prompt 1
How do I set up gpt-crawler to crawl my documentation site and create a custom GPT from it?
Prompt 2
Show me how to configure the CSS selector in gpt-crawler to extract only the main content and skip navigation.
Prompt 3
I want to crawl my website with gpt-crawler and integrate the output into my own app using OpenAI's API, what are the steps?
Prompt 4
How do I limit the number of pages gpt-crawler visits and control the output file size?
Open on GitHub → Explain another repo

Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.