pliang279/awesome-multimodal-ml

★ 6,870Audience · researcherComplexity · 1/5Setup · easy

Mindmap

mindmap
  root((Multimodal ML List))
    Core topics
      Multimodal fusion
      Alignment across data
      Pretraining models
      Modality translation
    Applications
      Visual question answering
      Text to video
      Emotion recognition
      Healthcare and robotics
    Courses
      CMU 11-777 intro
      CMU 11-877 advanced
    Entry points
      Tutorial paper
      Paper links and code

mindmap root((Multimodal ML List)) Core topics Multimodal fusion Alignment across data Pretraining models Modality translation Applications Visual question answering Text to video Emotion recognition Healthcare and robotics Courses CMU 11-777 intro CMU 11-877 advanced Entry points Tutorial paper Paper links and code

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Find key papers to read when starting research on how AI systems combine text, images, and other data types.

USE CASE 2

Use the accompanying CMU course materials as a structured curriculum for self-study in multimodal machine learning.

USE CASE 3

Locate code repositories that original research authors made available alongside their published papers.

USE CASE 4

Get a field overview through the linked tutorial paper before diving into individual research topics.

Getting it running

Difficulty · easy Time to first run · 5min

No license information was mentioned in the explanation.

In plain English

This is a curated reading list of research papers and resources on multimodal machine learning, maintained by Paul Liang at Carnegie Mellon University. Multimodal machine learning is the study of AI systems that work with multiple types of data at once, such as combining text and images, audio and video, or language and sensor readings, rather than processing a single kind of input. The list is organized into broad topic areas. Core technical areas include how to combine information from different sources (multimodal fusion), how to align concepts across different types of data (multimodal alignment), how to train general-purpose models on multiple types of data at once (multimodal pretraining), and how to translate between modalities such as generating a description of an image. There are also sections on handling missing data, making models more interpretable, and addressing bias in multimodal datasets. The applications section covers specific tasks where combining modalities matters: answering questions about images, generating video from text descriptions, recognizing emotions from speech and text together, navigation using language instructions, and using multimodal inputs in healthcare and robotics contexts. The list accompanies academic courses at CMU. An introductory graduate course (11-777) and an advanced follow-up course (11-877) both have their materials linked from the README. A tutorial paper titled Foundations and Recent Trends in Multimodal Machine Learning is highlighted as an entry point into the broader field. This is a reference document for researchers and students rather than a software project. There is no code to install or run. The repository contains a single long markdown file listing papers with links to preprints, published versions, and code repositories where the original authors made them available. The full README is longer than what was shown.

Copy-paste prompts

Prompt 1

I want to learn multimodal machine learning from scratch. Based on the awesome-multimodal-ml list structure, give me a 4-week study plan covering fusion, alignment, and pretraining with specific paper recommendations for each week.

Prompt 2

I am building an image question-answering system. Which topic areas in the multimodal ML reading list should I prioritize and what architectures do the key papers propose?

Prompt 3

Explain multimodal fusion techniques for combining text and image data in plain English, then suggest the most relevant section of the awesome-multimodal-ml list to explore.

Prompt 4

What is the difference between multimodal alignment and multimodal fusion? Give me a concrete example of each and recommend two papers from this reading list for each concept.

Open on GitHub → Explain another repo

← pliang279 on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.