Find key papers to read when starting research on how AI systems combine text, images, and other data types.
Use the accompanying CMU course materials as a structured curriculum for self-study in multimodal machine learning.
Locate code repositories that original research authors made available alongside their published papers.
Get a field overview through the linked tutorial paper before diving into individual research topics.
This is a curated reading list of research papers and resources on multimodal machine learning, maintained by Paul Liang at Carnegie Mellon University. Multimodal machine learning is the study of AI systems that work with multiple types of data at once, such as combining text and images, audio and video, or language and sensor readings, rather than processing a single kind of input. The list is organized into broad topic areas. Core technical areas include how to combine information from different sources (multimodal fusion), how to align concepts across different types of data (multimodal alignment), how to train general-purpose models on multiple types of data at once (multimodal pretraining), and how to translate between modalities such as generating a description of an image. There are also sections on handling missing data, making models more interpretable, and addressing bias in multimodal datasets. The applications section covers specific tasks where combining modalities matters: answering questions about images, generating video from text descriptions, recognizing emotions from speech and text together, navigation using language instructions, and using multimodal inputs in healthcare and robotics contexts. The list accompanies academic courses at CMU. An introductory graduate course (11-777) and an advanced follow-up course (11-877) both have their materials linked from the README. A tutorial paper titled Foundations and Recent Trends in Multimodal Machine Learning is highlighted as an entry point into the broader field. This is a reference document for researchers and students rather than a software project. There is no code to install or run. The repository contains a single long markdown file listing papers with links to preprints, published versions, and code repositories where the original authors made them available. The full README is longer than what was shown.
← pliang279 on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.