explaingit

pentaho/pentaho-kettle

8,339JavaAudience · dataComplexity · 4/5Setup · hard

TLDR

Pentaho Data Integration (Kettle/PDI) is a Java ETL tool with a visual drag-and-drop designer for building data pipelines that extract, transform, and load data between databases, files, and web services.

Mindmap

mindmap
  root((repo))
    What it does
      ETL data pipelines
      Visual designer
      CLI engine
    Data Sources
      Databases
      Flat files
      Web services
    Pipeline Stages
      Extract data
      Transform records
      Load destination
    Tech
      Java
      Maven build
    Modules
      Core library
      Execution engine
      Plugin system
      Desktop UI
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Build a visual ETL pipeline that pulls records from a MySQL database, cleans them, and loads them into a data warehouse without writing SQL by hand.

USE CASE 2

Schedule a nightly data migration job that moves records from a legacy system into a new database using the PDI command-line engine.

USE CASE 3

Connect multiple heterogeneous data sources, databases, CSV files, web services, and merge them into a unified dataset for reporting.

USE CASE 4

Extend Pentaho Kettle with a custom plugin step to handle a transformation that the built-in steps do not support.

Tech stack

JavaMaven

Getting it running

Difficulty · hard Time to first run · 1day+

Requires Java 11 and a full Maven build of multiple modules, building from source takes significant time.

In plain English

Pentaho Data Integration, also known as Kettle or PDI, is a tool for moving and transforming data between different systems. ETL stands for Extract, Transform, Load, which describes the basic idea: pull data out of one place, reshape or clean it, and put it somewhere else. This is a common task when combining data from multiple databases, migrating from one system to another, or preparing raw data for reporting and analysis. The software has both a visual designer and a command-line engine. Users can build data pipelines by dragging and dropping steps in a graphical interface, connecting them to form a workflow that processes records row by row. The engine then runs those workflows, which can be scheduled or triggered programmatically. It supports connecting to databases, flat files, web services, and many other data sources. This repository is the source code for the open-source community edition of the product. It is organized into several modules: a core library, the main execution engine, an engine extension layer, a database connection dialog, a user interface module, and a plugins folder that extends functionality. The codebase is built with Maven, a Java build tool, and requires Java 11. Developers who want to build it from source run a standard Maven build command. The project includes unit tests and integration tests, and contributors are expected to attach pull requests to a Jira issue tracker. Code style is enforced with a checkstyle configuration included in the project. The community forum for questions and support is hosted at the Hitachi Vantara community site, which now maintains the project.

Copy-paste prompts

Prompt 1
I want to build a Pentaho Kettle transformation that reads from a PostgreSQL table, filters rows where status is 'active', and writes the results to a CSV file. Walk me through setting up the steps in the visual designer.
Prompt 2
How do I run a Pentaho Kettle transformation from the command line using the Pan script? Show me the exact command syntax and the key options for specifying the transformation file and logging level.
Prompt 3
I need to schedule a PDI job to run every night at 2am on a Linux server. What are my options for scheduling the Kettle command-line tool and what does a basic shell script setup look like?
Prompt 4
How do I build Pentaho Kettle from source using Maven? Walk me through the build commands, which Java version to use, and how to run the unit tests to verify the build.
Open on GitHub → Explain another repo

← pentaho on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.