Automatically generate hundreds of candidate features from sensor readings or stock prices to feed into a scikit-learn classifier without hand-crafting each metric.
Reduce manual feature engineering by letting tsfresh extract and statistically filter time series features before training a predictive model.
Process large collections of time series in parallel across multiple CPU cores when the dataset is too big for single-threaded extraction.
Parallel extraction across many long series can require significant RAM, check memory before running on large datasets.
TSFRESH is a Python package that automatically extracts large numbers of descriptive characteristics from time series data. A time series is any sequence of measurements recorded over time, such as sensor readings, stock prices, or patient vital signs. Instead of a data scientist manually deciding which properties of those sequences to compute (averages, peaks, patterns of change), TSFRESH does that extraction automatically, producing hundreds of potential features from each input series. The motivation is to reduce the manual work involved in preparing data for machine learning. Before training a model to classify or predict something from time series, someone typically needs to turn the raw sequence into a set of numbers that a model can work with. TSFRESH automates that step. It applies methods from statistics, signal processing, and time series analysis to each input and produces a table of measurements that can then be fed directly into standard machine learning libraries, including scikit-learn. Because hundreds of automatically generated features are likely to include many that are irrelevant for any particular task, TSFRESH also includes a filtering step. This step tests each feature statistically to determine how much it actually explains the outcome you are trying to predict, and removes features that do not carry useful information. The filtering method is grounded in hypothesis testing theory and is described in academic papers cited in the README. The package supports parallel processing so that extraction across large numbers of time series can be distributed across multiple CPU cores or machines. It works with time series of different lengths, which is useful when recordings in a dataset are not all the same duration. Installation is through pip. Documentation is hosted on Read the Docs, and Jupyter notebook examples are available through the repository.
← blue-yonder on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.