MultiSEMF: Multi-Modal Supervised Expectation-Maximization Framework

Author

Ilia Azizi

Abstract
This paper introduces MultiSEMF, a Multi-modal Supervised Expectation-Maximization Framework that extends SEMF to integrate heterogeneous data modalities within a unified probabilistic model. MultiSEMF allows modality-specific architectures, such as convolutional networks for images, transformers for text, and gradient boosting models for tabular data, to jointly learn a shared latent representation. Copulas are incorporated to model dependencies between modalities, improving the understanding of inter-modal relationships. To handle complex, non-Gaussian latent structures, MultiSEMF borrows ideas from normalizing flows and KL-divergence regularization. Empirical evaluations on real-world and curated datasets show that MultiSEMF achieves strong predictive performance and reliable uncertainty estimation while maintaining a flexible and interpretable multi-modal learning approach.