Publications
2026
- Laure Ciernik*, Agnieszka Kraft*, Florian Barkmann, Josephine Yates, and Valentina BoevaGenome Research, 2026
In the field of single-cell RNA sequencing (scRNA-seq), gene signature scoring is integral for pinpointing and characterizing distinct cell populations. However, challenges arise in ensuring the robustness and comparability of scores across various gene signatures and across different batches and conditions. Addressing these challenges, we evaluated the stability of established methods such as Scanpy, UCell, and JASMINE in the context of scoring cells of different types and states. Additionally, we introduced a new scoring method, the Adjusted Neighbourhood Scoring (ANS), that builds on the traditional Scanpy method and improves the handling of the control gene sets. We further exemplified the usability of ANS scoring in differentiating between cancer-associated fibroblasts and malignant cells undergoing epithelial-mesenchymal transition (EMT) in four cancer types and evidenced excellent classification performance (AUCPR train: 0.95-0.99, AUCPR test: 0.91-0.99). In summary, our research introduces the ANS as a robust and deterministic scoring approach that enables the comparison of diverse gene signatures. The results of our study contribute to the development of more accurate and reliable methods for analyzing scRNA-seq data.
- Flavia Pedrocchi*, Florian Barkmann*, Amir Joudaki, and Valentina BoevaGen^2 workshop at ICLR, 2026
Single-cell foundation models (scFMs) hold promise for applications in cell type annotation and data integration, but their internal mechanisms remain poorly understood. We investigate the structure of these models by training sparse autoencoders (SAEs) on the hidden representations of two widely used scFMs, scGPT and scFoundation. The learned features reveal diverse and complex biological and technical signals, which emerge even in pre-trained models. We also observe that the encoding of this information differs between scFMs with distinct training protocols and architectures. Further, we find that while many features capture the information about cell types across several studies, they often fall short of unifying it into a single generalized representation. Finally, we demonstrate that SAE-derived features are causally related to model behavior and can be intervened upon to reduce unwanted technical effects while steering model outputs to preserve the core biological signal. These findings provide a path toward more interpretable and controllable single-cell foundation models.
- Constantin Ahlmann-Eltze*, Florian Barkmann*, Jan Lause*, Valentina Boeva, and Dmitry KobakRNA, 2026
Single-cell RNA sequencing (scRNA-seq) has become a cornerstone experimental technique in cellular biology, with gene expression data for over 100 million sequenced cells available in public repositories. The high dimensionality, sparsity, and technical noise inherent to scRNA-seq data have motivated the development of a broad spectrum of representation learning approaches. These methods learn denoised, low-dimensional representations of single-cell transcriptomes that can then be used for clustering, visualization, trajectory inference, and other downstream analyses. Furthermore, methods have emerged that learn latent representations based on scRNA-seq data pooled across multiple experiments. In this review, we frame factor models, autoencoders, contrastive learning approaches, and transformer-based foundation models as distinct paradigms of representation learning for scRNA-seq. We provide a coherent taxonomy of these methods that articulates their conceptual foundations, shared assumptions, and key distinctions. We also discuss existing benchmarks and identify the major challenges and open questions that will shape the future of the field.
2025
- Florian Barkmann*, Josephine Yates*, Paweł Czyż, Agnieszka Kraft, Marc Glettig, Niko Beerenwinkel, and Valentina BoevaCancer Research, 2025
Single-cell RNA-sequencing (scRNA-seq) facilitates the discovery of gene expression signatures that define cell states across patients, which could be used in patient stratification and precision oncology. However, the lack of standardization in computational methodologies that are used to analyze these data impedes the reproducibility of signature detection. To address this, we developed CanSig, a comprehensive benchmarking tool that evaluates methods for identifying transcriptional signatures in cancer. CanSig integrates metrics for batch correction and biological signal conservation with a transcriptional signature correlation metric to score methods according to signature rediscovery, cross-dataset reproducibility, and clinical relevance. CanSig was applied to thirteen methods on twelve scRNA-seq datasets from five human cancer types-glioblastoma, breast cancer, lung adenocarcinoma, rhabdomyosarcoma, and cutaneous squamous cell carcinoma-representing 185 patients and 174,000 malignant cells. The signatures identified with these methods correlated with clinically relevant outcomes, including patient survival and lymph node metastasis. These results identified Harmony, BBKNN, and fastMNN as the highest-scoring integration methods for discovering shared transcriptional states in cancer. Overall, CanSig provides a standardized, reproducible framework for uncovering clinically relevant cancer cell states in single-cell transcriptomics.
- Olga Ovcharenko*, Florian Barkmann*, Philip Toma*, Imant Daunhawer, Julia E Vogt, Sebastian Schelter, and Valentina BoevaICML, 2025Spotlight
Self-supervised learning (SSL) has proven to be a powerful approach for extracting biologically meaningful representations from single-cell data. To advance our understanding of SSL methods applied to single-cell data, we present scSSL-Bench, a comprehensive benchmark that evaluates nineteen SSL methods. Our evaluation spans nine datasets and focuses on three common downstream tasks: batch correction, cell type annotation, and missing modality prediction. Furthermore, we systematically assess various data augmentation strategies. Our analysis reveals task-specific trade-offs: the specialized single-cell frameworks, scVI, CLAIRE, and the finetuned scGPT excel at uni-modal batch correction, while generic SSL methods, such as VICReg and SimCLR, demonstrate superior performance in cell typing and multi-modal data integration. Random masking emerges as the most effective augmentation technique across all tasks, surpassing domain-specific augmentations. Notably, our results indicate the need for a specialized single-cell multi-modal data integration framework. scSSL-Bench provides a standardized evaluation platform and concrete recommendations for applying SSL to single-cell analysis, advancing the convergence of deep learning and single-cell genomics.
- Agnieszka Kraft, Josephine Yates, Florian Barkmann, and Valentina BoevaUnder revision in Cancer Research, 2025
Intratumor transcriptional heterogeneity (ITTH), defined by the coexistence of diverse cell states within one tumor, complicates cancer treatment by contributing to variable therapeutic responses. Although single-cell RNA sequencing can resolve this complexity, its cost and technical demands limit its large-scale use. Bulk RNA-seq data provide a scalable alternative, but most deconvolution methods depend on predefined references, restricting their ability to detect novel malignant states. Unsupervised approaches avoid these constraints but are not tailored to capture heterogeneity within the malignant compartment. To address these limitations, we introduce CDState, an unsupervised method for inferring malignant cell subpopulations from bulk RNA-seq data. CDState utilizes non-negative matrix factorization improved with sum-to-one constraint and a cosine similarity-based optimization to deconvolve bulk gene expression into distinct cell state profiles. We demonstrate robustness of CDState on bulkified single-cell RNA-seq datasets from five cancer types, showing that it outperforms existing unsupervised deconvolution methods in the estimation of both cell state proportions and gene expression profiles. Applied to 33 cancer types from The Cancer Genome Atlas, CDState reveals recurrent gene programs, including epithelial-mesenchymal transition, MYC targets, and oxidative phosphorylation, as major contributors to malignant cell ITTH. We further link malignant states to patient clinical features, identifying states associated with poor prognosis. We propose an intratumor heterogeneity index and show its association with patient survival, clinical characteristics, and therapeutic response. Finally, we identify mutations and copy number alterations in genes such as TP53, KRAS, PIK3CA, SOX2, and SATB1 as potential genetic drivers of malignant cell ITTH across cancer types.
2024
- Moritz Vandenhirtz*, Florian Barkmann*, Laura Manduchi, Julia E Vogt, and Valentina BoevaAccMLBio workshop at ICML & SPIGM workshop at ICML, 2024Spotlight
We propose a novel method, scTree, for single-cell Tree Variational Autoencoders, extending a hierarchical clustering approach to single-cell RNA sequencing data. scTree corrects for batch effects while simultaneously learning a tree-structured data representation. This VAE-based method allows for a more in-depth understanding of complex cellular landscapes independently of the biasing effects of batches. We show empirically on seven datasets that scTree discovers the underlying clusters of the data and the hierarchical relations between them, as well as outperforms established baseline methods across these datasets. Additionally, we analyze the learned hierarchy to understand its biological relevance, thus underpinning the importance of integrating batch correction directly into the clustering procedure.
- Alexander Theus*, Florian Barkmann*, David Wissel, and Valentina BoevaAIDrugX workshop at NeurIPS, 2024
We present CancerFoundation, a novel single-cell RNA-seq foundation model (scFM) trained exclusively on malignant cells. Despite being trained on only one million total cells, a fraction of the data used by existing models, CancerFoun- dation outperforms other scFMs in key tasks such as zero-shot batch integration and drug response prediction. During training, we employ tissue and technology- aware oversampling and domain-invariant training to enhance performance on un- derrepresented cancer types and sequencing technologies. We propose survival prediction as a new downstream task to evaluate the generalizability of single-cell foundation models to bulk RNA data and their applicability to patient stratifica- tion. CancerFoundation demonstrates superior batch integration performance and shows significant improvements in predicting drug responses for both unseen cell lines and drugs. These results highlight the potential of focused, smaller founda- tion models in advancing drug discovery and our understanding of cancer biology.
2023
- Michael Prummer, Anne Bertolini, Lars Bosshard, Florian Barkmann, Josephine Yates, Valentina Boeva, Daniel Stekhoven, and Franziska SingerNAR Genomics and Bioinformatics, 2023
Identifying cell types based on expression profiles is a pillar of single cell analysis. Existing machine-learning methods identify predictive features from annotated training data, which are often not available in early-stage studies. This can lead to overfitting and inferior performance when applied to new data. To address these challenges we present scROSHI, which utilizes previously obtained cell type-specific gene lists and does not require training or the existence of annotated data. By respecting the hierarchical nature of cell type relationships and assigning cells consecutively to more specialized identities, excellent prediction performance is achieved. In a benchmark based on publicly available PBMC data sets, scROSHI outperforms competing methods when training data are limited or the diversity between experiments is large.
- Florian Barkmann, Yair Censor, and Niklas WahlFrontiers in Oncology, 2023
*: equal contribution, **: Co-supervision