explaingit

harelba/q

10,347PythonAudience · dataComplexity · 2/5Setup · easy

TLDR

A command-line tool that lets you run SQL queries directly against CSV and text files, filter, group, join, and count data without importing it into a database.

Mindmap

mindmap
  root((q))
    What it does
      SQL on CSV files
      SQL on SQLite files
      Pipe query support
    Features
      Auto schema detect
      File caching index
      Cross-file joins
    Platforms
      Windows Mac Linux
    Audience
      Data analysts
      DevOps engineers
      Power users
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Query a large CSV export from your analytics tool with SQL to filter rows and calculate totals without opening Excel.

USE CASE 2

Join two CSV files together using SQL to combine data from different exports in a single terminal command.

USE CASE 3

Pipe shell command output (like a process list) into q and query it with SQL to find patterns or group results.

USE CASE 4

Cache a large file index so repeat SQL queries on a 4 GB+ CSV run in seconds instead of minutes.

Tech stack

Python

Getting it running

Difficulty · easy Time to first run · 5min

In plain English

q is a command-line tool that lets you run SQL queries directly against CSV and other delimited text files, without importing them into a database first. SQL is the standard language used to ask questions of structured data, for example to filter rows, count things, group results, or join two datasets together. Normally you need a database program to do that. q brings those same capabilities to plain files sitting on your hard drive. You run it from a terminal by writing a SQL statement and pointing it at a file path instead of a database table name. It detects column names and data types automatically, so you do not need to define a schema. It also lets you pipe the output of other commands into it, so for instance you could take the list of running processes and query it with SQL to find out which user is running the most processes. For larger files, q supports a caching feature that pre-processes the file and stores an index. The difference in speed can be dramatic: on a 4.8 GB file with five million rows, a query that takes nearly five minutes without caching takes under two seconds with caching enabled. Caching is turned off by default because it uses extra disk space, but can be enabled with a flag. Beyond CSV files, q can also query SQLite database files directly, including joining across tables that live in different database files. SQLite is a common format for storing structured data in a single file. The project is open source and free. Installation instructions for Windows, Mac, and Linux are available on the project website. The author donates Patreon proceeds to a domestic violence prevention center in Ramla, Israel.

Copy-paste prompts

Prompt 1
I have a CSV file called sales.csv with columns date, region, and revenue. Write a q command to show total revenue grouped by region, sorted descending.
Prompt 2
Show me how to use q to join two CSV files, orders.csv and customers.csv, on a shared customer_id column and output the combined result.
Prompt 3
I want to query the output of `ps aux` using SQL with q. Write the pipeline command to find which user has the most running processes.
Prompt 4
How do I enable caching in q so that repeat queries against a large CSV file run fast, and when should I turn caching off?
Open on GitHub → Explain another repo

← harelba on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.