Convert a phone video of a room into a labeled 3D map showing furniture positions and room boundaries.
Build a robotics navigation system that understands indoor spaces from depth camera or LiDAR point cloud data.
Run the model on your own indoor scans to get structured room layouts for architecture or design tools.
Fine-tune the provided pretrained models on a custom 3D scene dataset for specialized object detection.
Requires Python 3.11, a CUDA-capable GPU, and the Poetry package manager, no CPU fallback for training.
SpatialLM is a research project, accepted at NeurIPS 2025, that trains large language models to understand 3D indoor spaces. You give it a point cloud, which is a collection of 3D data points captured by a camera or sensor that represents a room or building, and it produces a structured description of that space: walls, doors, windows, and the positions and sizes of furniture items, each labeled with what it is. What makes this approach notable is that it works with point clouds from multiple sources. You can feed it data from a regular video taken on a phone (converted to a point cloud using other software), from depth cameras, or from professional LiDAR sensors. Earlier approaches to this kind of 3D scene understanding typically required specialized scanning equipment. SpatialLM lowers that barrier by accepting more common data sources. The project provides four pretrained model sizes, built on top of two different base language models (Llama 1B and Qwen 0.5B), and offers version 1.0 and version 1.1 of each. The version 1.1 models double the point cloud resolution and add a more capable point cloud encoder. They also support detection of specific object categories, so you can ask the model to only identify certain types of furniture rather than predicting all 59 supported categories. The repository includes scripts for running inference on a point cloud file, for visualizing results using a tool called Rerun, and for evaluating model performance on a provided test set. A fine-tuning guide is also included for researchers who want to train the model on their own data. Installation requires Python 3.11, PyTorch, and CUDA, and uses the Poetry package manager. A training dataset is available separately on Hugging Face. The intended applications described in the README include robotics, autonomous navigation, and general 3D scene analysis.
← manycore-research on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.