explaingit

apache/hive

5,965JavaAudience · dataComplexity · 4/5LicenseSetup · hard

TLDR

Apache Hive is a data warehouse tool that lets you query massive datasets on a Hadoop cluster using familiar SQL, without writing distributed computing code yourself.

Mindmap

mindmap
  root((repo))
    What it does
      SQL on Hadoop
      Bulk data analytics
    Capabilities
      Standard SQL queries
      Custom UDF functions
      Analytics and subqueries
    Scale
      Billions of rows
      Multi-machine clusters
    Tech stack
      Java
      Hadoop 3.x
      MapReduce or Tez
    Use cases
      Data transformation
      Large-scale reporting
      ETL pipelines
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Run SQL aggregations across billions of rows stored in a Hadoop cluster without writing MapReduce jobs

USE CASE 2

Write custom Java functions to extend Hive SQL for specialized data transformations

USE CASE 3

Load and transform large raw datasets into structured reports for business intelligence tools

USE CASE 4

Build ETL pipelines that read from HDFS, transform data with HiveQL, and write to downstream systems

Tech stack

JavaHadoopSQLMapReduceApache Tez

Getting it running

Difficulty · hard Time to first run · 1day+

Requires a running Hadoop 3.x cluster and a matching Java version, not practical to run locally without significant infrastructure.

Apache License 2.0, use freely for any purpose including commercial products, as long as you include the license and copyright notice.

In plain English

Apache Hive is a data warehouse system built on top of Apache Hadoop, which is a framework for storing and processing very large amounts of data spread across many computers. Hive's main job is to let analysts and engineers query that data using SQL, the same query language used in traditional databases, without needing to write the low-level distributed computing code that Hadoop normally requires. When you write a SQL query in Hive, the system translates it into jobs that run across a cluster of machines. This means it can handle datasets far too large to fit on a single computer. Hive supports standard SQL features including analytics functions, subqueries, and common table expressions, and it can be extended with custom functions written in Java or other languages when the built-in functions are not enough. Hive is not a replacement for a traditional relational database for everyday transactional work such as recording individual sales or user logins. It is designed for bulk analytical tasks: reading large datasets, transforming them, loading them into reports, or running aggregations across billions of rows. The README notes it is best suited for workloads where the scale of data justifies a distributed system. This repository is the source code for the project. Getting it running requires Hadoop 3.x and a version of Java that matches the Hive version you want to use. The project is maintained by the Apache Software Foundation under the Apache License 2.0, and community support happens through mailing lists listed in the README.

Copy-paste prompts

Prompt 1
I have a Hadoop 3.x cluster with Hive installed. Write a HiveQL query to calculate daily revenue by product category from a 2-billion-row transactions table using window functions.
Prompt 2
I need a custom Hive UDF in Java that parses a proprietary timestamp format from a string column. Walk me through the GenericUDF implementation and how to register it.
Prompt 3
Set up an Apache Hive metastore backed by PostgreSQL instead of Derby. Provide the hive-site.xml configuration properties I need to change.
Open on GitHub → Explain another repo

← apache on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.