explaingit

kerner-lab/earthshift

23PythonAudience · researcherComplexity · 3/5Setup · moderate

TLDR

A benchmarking testbed that measures how well AI models trained on satellite imagery hold up when deployed under new conditions like different regions, sensors, seasons, or resolutions they never saw during training.

Mindmap

mindmap
  root((EarthShift))
    Shift types
      Geographic region
      Satellite sensor
      Time and season
      Spatial resolution
    Tasks
      Classification
      Semantic segmentation
      Object detection
    Models tested
      8 foundation models
      General and specialized
    Purpose
      Robustness benchmarking
      Research baseline
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Benchmark a geospatial foundation model against 5 types of distribution shift using EarthShift's standardized paired datasets and command-line pipeline.

USE CASE 2

Reproduce the paper's finding that models drop about 20% in performance out-of-distribution, as a baseline for your own robustness research.

USE CASE 3

Run classification, semantic segmentation, or object detection tasks on EarthShift datasets to evaluate how robust a model is before deploying it in a new geographic region.

Tech stack

Python

Getting it running

Difficulty · moderate Time to first run · 1h+

Requires downloading paired datasets for each shift type and matching Python environment setup, no GPU requirement mentioned but remote sensing datasets can be large.

No license information was mentioned in the explanation.

In plain English

EarthShift is a benchmarking testbed for measuring how well large AI models hold up when the conditions at deployment differ from the conditions they trained on. The specific focus is geospatial foundation models, meaning large AI models trained on satellite and remote sensing imagery to recognize land use, detect objects, or segment geographic features. Most existing benchmarks for these models measure performance on data that looks similar to what the model saw during training. EarthShift tests something different: does the model still work well when it encounters a new geographic region it has never seen, a different satellite sensor, a different time period or season, a different data provider, or a different spatial resolution? These changes are called distribution shifts, and they happen constantly when models are used in the real world. The researchers ran experiments across 8 geospatial foundation models and 11 different tasks covering all five shift types. Their finding was consistent: models perform about 20% worse out-of-distribution, and this holds regardless of model size, architecture, or how the model was fine-tuned. Notably, the robustness of specialized geospatial models was similar to that of general-purpose vision models, meaning the field-specific training did not make them meaningfully more reliable under changing conditions. The testbed provides paired datasets for each shift type. Researchers can run the pipeline from the command line, specifying a model, a task (classification, semantic segmentation, or object detection), a shift type, and a dataset pair. Results are saved to a specified output directory. The code and datasets are released to give the community a standard way to measure and improve robustness in Earth observation AI. The repository accompanies a paper published on arXiv.

Copy-paste prompts

Prompt 1
Using EarthShift, run a benchmark for a geospatial model on a geographic shift task where it trains on one region and tests on another. Show me the command-line invocation and how to interpret the output metrics.
Prompt 2
I want to add my own geospatial model to the EarthShift benchmark. What format do model outputs need to be in, and how do I run it against all 5 shift types and 11 tasks?
Prompt 3
Help me reproduce the EarthShift paper's 20% out-of-distribution performance drop finding by running the benchmark for 3 of the 8 included foundation models across the sensor shift tasks.
Open on GitHub → Explain another repo

← kerner-lab on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.