DecoFuse: Decomposing and Fusing the "What", "Where", and "How" for Brain-Inspired fMRI-to-Video Decoding

Fudan University
arXiv

Abstract

Decoding visual experiences from brain activity is a significant challenge. Existing fMRI-to-video methods often focus on semantic content while overlooking spatial and motion information. However, these aspects are all essential and are processed through distinct pathways in the brain. Motivated by this, we propose DecoFuse, a novel brain-inspired framework for decoding videos from fMRI signals. It first decomposes the video into three components-semantic, spatial, and motion-then decodes each component separately before fusing them to reconstruct the video. This approach not only simplifies the complex task of video decoding by decomposing it into manageable sub-tasks, but also establishes a clearer connection between learned representations and their biological counterpart, as supported by ablation studies. Further, our experiments show significant improvements over previous state-of-the-art methods, achieving 82.4% accuracy for semantic classification, 70.6% accuracy in spatial consistency, a 0.212 cosine similarity for motion prediction, and 21.9% 50-way accuracy for video generation. Additionally, neural encoding analyses for semantic and spatial information align with the two-streams hypothesis, further validating the distinct roles of the ventral and dorsal pathways. Overall, DecoFuse provides a strong and biologically plausible framework for fMRI-to-video decoding.

Overview


Interpolate start reference image.

Details of DecoFuse framework. Neural features are extracted by an fMRI encoder and decomposed to semantic, spatial and motion embeddings through three independent encoders. These components are then fused to generate video via three stages: (1) fMRI-to-image decoding, which uses Stable Diffusion and ControlNet to generate static images based on high-level semantic and low-level spatial embeddings; (2) fMRI-to-motion decoding, predicting optical flow using an image- and fMRI-based motion decoder to capture dynamic elements of the video; (3) fMRI-to-video decoding, where the decoded image and optical flow are combined to generate the final video using a motion-conditioned video diffusion model.


Main Results

fMRI-to-image Decoding

Interpolate start reference image.

Results of fMRI-to-image reconstruction. Our model successfully generates images that align well with the ground truth in both semantic and spatial aspects. By comparing the results with and without semantic("what")/spatial("where") embeddings, we demonstrate that semantic and spatial embeddings significantly enhance the model's ability to accurately reconstruct and localize objects within the image.

fMRI-to-motion Decoding

Interpolate start reference image.

Results of fMRI-to-motion decoding. Our model effectively predicts optical flow based on fMRI and image data, demonstrating accurate motion decoding performance.

fMRI-to-video Decoding

Ground Truth

Ground Truth video Ground Truth video Ground Truth video Ground Truth video Ground Truth video

Subject 1

Predicted video Predicted video Predicted video Predicted video Predicted video

Subject 2

Predicted video Predicted video Predicted video Predicted video Predicted video

Subject 3

Predicted video Predicted video Predicted video Predicted video Predicted video

More Results for Subject 1

GT

Ground Truth video Ground Truth video Ground Truth video Ground Truth video Ground Truth video Ground Truth video

Pred

Predicted video Predicted video Predicted video Predicted video Predicted video Predicted video

GT

Ground Truth video Ground Truth video Ground Truth video Ground Truth video Ground Truth video Ground Truth video

Pred

Predicted video Predicted video Predicted video Predicted video Predicted video Predicted video

Differential Neural Encoding


Interpolate start reference image.

Results of differential neural encoding. The differential encoding distribution for "what" and "where" is represented by p_{spa} and visualized on the medial view of the brain surface. Red indicates regions that encode "where" information, while blue indicates regions that encode "what" information. These results align with the two-streams hypothesis.