Training machine-learning models with synthetically generated data can
alleviate the problem of data scarcity when acquiring diverse and
sufficiently large datasets is costly and challenging. Here we show that
cascaded diffusion models can be used to synthesize realistic whole-slide
image tiles from latent representations of RNA-sequencing data from human
tumours. Alterations in gene expression affected the composition of cell
types in the generated synthetic image tiles, which accurately preserved
the distribution of cell types and maintained the cell fraction observed in
bulk RNA-sequencing data, as we show for lung adenocarcinoma, kidney renal
papillary cell carcinoma, cervical squamous cell carcinoma, colon
adenocarcinoma and glioblastoma. Machine-learning models pretrained with
the generated synthetic data performed better than models trained from
scratch. Synthetic data may accelerate the development of machine-learning
models in scarce-data settings and allow for the imputation of missing data
modalities.