Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer

Abstract Reconstructing images seen by people from their fMRI brain recordings provides a non-invasive window into the human brain. Despite recent progress enabled by diffusion models, current methods often lack faithfulness to the actual seen images. We present Brain-IT, a brain-inspired approach that addresses this challenge through a Brain Interaction Transformer (BIT), allowing effective interactions between clusters of functionally-similar brain-voxels. These functional clusters are shared by all subjects, serving as building blocks for integrating information both within and across brains. All model components are shared by all clusters & subjects, allowing efficient training with a limited amount of data. To guide the image reconstruction, BIT predicts two complementary localized patch-level image features: (i) high-level semantic features, which steer the diffusion model toward the correct semantic content of the image; and (ii) low-level structural features, which help to initialize the diffusion process with the correct coarse layout of the image. BIT’s design enables direct flow of information from brain-voxel clusters to localized image features. Through these principles, our method achieves image reconstructions from fMRI that faithfully reconstruct the seen images, and surpass current state-of-the-art approaches both visually and by standard objective metrics. Moreover, with only 1 hour of fMRI data from a new subject, we achieve results comparable to current methods trained on full 40 hour recordings.

Brain-IT Overview The Brain Interaction Transformer (BIT) transforms fMRI signals into Semantic and VGG features using a shared Voxel-to-Cluster (V2C) mapping. Two branches are applied: (i) the Low-Level branch reconstructs a coarse image from VGG features, used to initialize the (ii) Semantic branch, which uses semantic features to guide the diffusion model. Each voxel from every subject is mapped to a functional cluster shared across subjects, enabling integration within and across brains. Our Brain-IT pipeline thus reconstructs images directly from fMRI activations by first predicting meaningful image features with BIT, then refining them through a diffusion model guided by semantic conditioning and a Deep Image Prior (DIP) that ensures structural fidelity.

Brain-Interaction Transformer (BIT) The BIT model predicts image features from voxel activations (fMRI). The Brain Tokenizer maps the fMRI activations into Brain Tokens, which are representations of the aggregated information from all the voxels of a single cluster (one token per cluster). The Cross-Transformer Module integrates information from the Brain Tokens to refine their representation, and employs query tokens to retrieve information from the Brain Tokens and transform it into image features, with each query token predicting a single output image feature.

Results Qualitative Comparisons (40h) Qualitative Comparisons - limited amount of subject-specific data (1 hour) Quantitative Metrics Brain-IT demonstrates strong semantic fidelity and structural accuracy across multiple evaluation metrics. Importantly, with just 1 hour of data, Brain-IT is comparable to prior methods trained on the full 40 hours.