Ha T.H. Vu, Han Sun, Seth Sharp, Parul Kudtarkar, Liza Brusman, Julie Jurgens, The PanKbase Consortium, Jason Flannick, Noel Burtt, Shuibing Chen, Jie Liu, Jean-Philippe Cartailler, Benjamin F. Voight, Michael Lee Stitzel, Marcela Brissova, Anna L. Gloyn, Kyle Gaulton, Stephen C.J. Parker
a. Reprocessed raw data so that runs of the same library were merged prior to alignment, enabling correct handling of PCR duplication.
b. Added hashtagged-oligo samples.
c. Relaxed thresholds for number of cells per sample, enabling more samples retained.
d. Added an additional round of ambient RNA correction where model parameters were optimized for every sample.
e. Multiple doublet detection methods for better separation of cell populations
f. Code availability: All code for the analyses are provided at https://github.com/PanKbase/snRNAseq-NextFlow and https://github.com/PanKbase/PanKbase-scRNA-seq.
| Cell type | Number of cells |
|---|---|
| Alpha | 194443 |
| Beta | 120502 |
| Acinar | 50105 |
| Ductal | 33686 |
| Active Stellate | 16907 |
| Delta | 11680 |
| Gamma + Epsilon | 7087 |
| MUC5B+ Ductal | 3515 |
| Endothelial | 3385 |
| Immune | 2755 |
| Cycling Alpha | 2738 |
| Quiescent Stellate | 2132 |
scRNA-seq data and metadata generated from islet tissues using the 10X Chromium platform were downloaded from https://hpap.pmacs.upenn.edu/ (accessed March 2024) [1], [2], [3], [4].
Literature searches were conducted in order to identify original studies involving scRNA-seq generated using islets distributed by IIDP and Prodo. Briefly, pancreatic islet datasets of expression profiling by high throughput sequencing were identified via GEO search for the date range before July 2024, then studies with scRNA-seq data were determined manually. Datasets were then downloaded from the studies' corresponding GEO page using the fastq-dump command from the SRA toolkit (v3.1.1) [5]. Data from the following GEO accession IDs were used: GSE142465 [6], GSE159556 [7], GSE183568 [8], GSE190447 [9], GSE201256 [10], GSE217837 [11], GSE221156 [12], GSE251730 [13], GSE251912 [14].
The quality of raw reads was assessed using FastQC (v0.11.9) [15], then individual samples' reports were compiled together using MultiQC (v1.11) [16]. These reports were inspected in order to detect technical issues such as flow cell problems, base calling errors, and the presence of unexpected technical sequences. Based on the Per Tile Quality and Quality Score plots, we subsequently included sequencing runs with more than 50% of tiles indicated as high quality (i.e., the quality score was at or above the average) across all bases.
STARsolo (v2.7.10a) [17] was used to align raw reads to the human genome build hg38. Comprehensive annotation from Gencode release 39 (https://www.gencodegenes.org/human/release_39.html) was used to build STAR index files with the command STAR --runMode genomeGenerate and default parameters. As a result, count matrices quantifying various features such as gene counts were obtained. For downstream analyses, count matrices GeneFull_ExonOverIntron were used. Additionally, MultiQC was used to compile log files output by STARsolo, and the report was used to examine various relevant metrics such as percent of uniquely mapped reads and mapping rates.
High quality reads were obtained using SAMtools (v1.14) [18] and the following command: samtools view -h -b -q 255 -F 4 -F 256 -F 2048 $bam.
For each sample, DoubletFinder (v2.0.3) [27] was used twice on the RNA ambient-corrected gene count matrix obtained using modified runs of CellBender, and the second round of DoubletFinder was run without doublets detected in the first round. Next, we jointly clustered libraries across each tissue source (HPAP, IIDP and Prodo) using Seurat with doublets included, using Harmony (v1.2.0) [28] to correct for the following covariates: sex, BMI, age, studies, treatments and chemistry, and resolution of 1.8. We removed doublets and doublet-enriched clusters which are defined as ones with doublet rates > 65%.
All data integration was performed using Seurat. In particular, due to the large number of samples, we first combined all samples per source (data generated using HPAP, IIDP, and Prodo samples) to create three Seurat objects, retaining cells that expressed at least one protein-coding gene, and protein-coding genes that are expressed in at least one cell. Subsequently, we merged all objects and carried out integration using Harmony, correcting for the following covariates: sex, BMI, age, studies, treatments, chemistry and tissue sources. As a result, we obtained a set of 481,154 cells expressing 19,546 protein-coding genes.
Next, we adopted an iterative approach to further clean up the data.
This work used data acquired from the database (https://hpap.pmacs.upenn.edu/) of the Human Pancreas Analysis Program (HPAP; RRID:SCR_016202; PMID: 31127054; PMID: 36206763; PMID: 35228745 and PMID: 37188822). HPAP is part of a Human Islet Research Network (RRID:SCR_014393) consortium (UC4-DK112217, U01-DK123594, UC4-DK112232, and U01-DK123716).
Additional islet samples and tissues were provided by the Integrated Islet Distribution Program (IIDP) (RRID:SCR_014387), funded by NIH Grant #2UC4DK098085.
[1] S. N. Shapira, A. Naji, M. A. Atkinson, A. C. Powers, and K. H. Kaestner, "Understanding islet dysfunction in type 2 diabetes through multidimensional pancreatic phenotyping: The Human Pancreas Analysis Program," Cell Metab., vol. 34, no. 12, pp. 1906–1913, Dec. 2022, doi: 10.1016/j.cmet.2022.09.013.
[2] K. H. Kaestner, A. C. Powers, A. Naji, HPAP Consortium, and M. A. Atkinson, "NIH Initiative to Improve Understanding of the Pancreas, Islet, and Autoimmunity in Type 1 Diabetes: The Human Pancreas Analysis Program (HPAP)," Diabetes, vol. 68, no. 7, pp. 1394–1402, Jul. 2019, doi: 10.2337/db19-0058.
[3] A. R. Patil, J. Schug, A. Naji, K. H. Kaestner, R. B. Faryabi, and G. Vahedi, "Single-cell expression profiling of islets generated by the Human Pancreas Analysis Program," Nat. Metab., vol. 5, no. 5, pp. 713–715, May 2023, doi: 10.1038/s42255-023-00806-x.
[4] M. Fasolino et al., "Single-cell multi-omics analysis of human pancreatic islets reveals novel cellular states in type 1 diabetes," Nat. Metab., vol. 4, no. 2, pp. 284–299, Feb. 2022, doi: 10.1038/s42255-022-00531-x.
[5] SRA Toolkit Development Team, SRA Toolkit. [Online]. Available: https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software
[6] B. Marquina-Sanchez et al., "Single-cell RNA-seq with spike-in cells enables accurate quantification of cell-specific drug effects in pancreatic islets," Genome Biol., vol. 21, no. 1, p. 106, Dec. 2020, doi: 10.1186/s13059-020-02006-2.
[7] X. Tang et al., "SARS-CoV-2 infection induces beta cell transdifferentiation," Cell Metab., vol. 33, no. 8, pp. 1577-1591.e7, Aug. 2021, doi: 10.1016/j.cmet.2021.05.015.
[8] S. Shrestha et al., "Combinatorial transcription factor profiles predict mature and functional human islet α and β cells," JCI Insight, vol. 6, no. 18, p. e151621, Sep. 2021, doi: 10.1172/jci.insight.151621.
[9] G. Basile et al., "Excess pancreatic elastase alters acinar-β cell communication by impairing the mechano-signaling and the PAR2 pathways," Cell Metab., vol. 35, no. 7, pp. 1242-1260.e9, Jul. 2023, doi: 10.1016/j.cmet.2023.05.007.
[10] W. Xu et al., "Architecture of androgen receptor pathways amplifying glucagon-like peptide-1 insulinotropic action in male pancreatic β cells," Cell Rep., vol. 42, no. 5, p. 112529, May 2023, doi: 10.1016/j.celrep.2023.112529.
[11] R. B. Kang et al., "Single-nucleus RNA sequencing of human pancreatic islets identifies novel gene sets and distinguishes β-cell subpopulations with dynamic transcriptome profiles," Genome Med., vol. 15, no. 1, p. 30, May 2023, doi: 10.1186/s13073-023-01179-2.
[12] K. Bandesh et al., "Single-cell decoding of human islet cell type-specific alterations in type 2 diabetes reveals converging genetic- and state-driven β-cell gene expression defects," Jan. 22, 2025, Genomics. doi: 10.1101/2025.01.17.633590.
[13] J. S. Stancill, M. Y. Kasmani, W. Cui, and J. A. Corbett, "Single Cell RNAseq Analysis of Cytokine-Treated Human Islets: Association of Cellular Stress with Impaired Cytokine Responsiveness," Function, vol. 5, no. 4, p. zqae015, Jul. 2024, doi: 10.1093/function/zqae015.
[14] E. K. Sokolowski et al., "Multi-omic human pancreatic islet endoplasmic reticulum and cytokine stress response mapping provides type 2 diabetes genetic insights," Cell Metab., vol. 36, no. 11, pp. 2468-2488.e7, Nov. 2024, doi: 10.1016/j.cmet.2024.09.006.
[15] S. Andrews, FastQC. [Online]. Available: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
[16] P. Ewels, M. Magnusson, S. Lundin, and M. Käller, "MultiQC: summarize analysis results for multiple tools and samples in a single report," Bioinformatics, vol. 32, no. 19, pp. 3047–3048, Oct. 2016, doi: 10.1093/bioinformatics/btw354.
[17] B. Kaminow, D. Yunusov, and A. Dobin, "STARsolo: accurate, fast and versatile mapping/quantification of single-cell and single-nucleus RNA-seq data," May 05, 2021, Bioinformatics. doi: 10.1101/2021.05.05.442755.
[18] H. Li et al., "The Sequence Alignment/Map format and SAMtools," Bioinformatics, vol. 25, no. 16, pp. 2078–2079, Aug. 2009, doi: 10.1093/bioinformatics/btp352.
[19] A. Fort et al., "MBV : a method to solve sample mislabeling and detect technical bias in large combined genotype and sequencing assay datasets," Bioinformatics, vol. 33, no. 12, pp. 1895–1897, Jun. 2017, doi: 10.1093/bioinformatics/btx074.
[20] O. Delaneau, H. Ongen, A. A. Brown, A. Fort, N. I. Panousis, and E. T. Dermitzakis, "A complete tool set for molecular QTL discovery and analysis," Nat. Commun., vol. 8, no. 1, p. 15452, May 2017, doi: 10.1038/ncomms15452.
[21] 10x Genomics, "Quality Assessment Using the Cell Ranger Web Summary." [Online]. Available: https://www.10xgenomics.com/analysis-guides/quality-assessment-using-the-cell-ranger-web-summary
[22] A. T. L. Lun et al., "EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data," Genome Biol., vol. 20, no. 1, p. 63, Dec. 2019, doi: 10.1186/s13059-019-1662-y.
[23] Abraham. Savitzky and M. J. E. Golay, "Smoothing and Differentiation of Data by Simplified Least Squares Procedures.," Anal. Chem., vol. 36, no. 8, pp. 1627–1639, Jul. 1964, doi: 10.1021/ac60214a047.
[24] J. G. Aaron Lun, DropletUtils. (2018). Bioconductor. doi: 10.18129/B9.BIOC.DROPLETUTILS.
[25] S. J. Fleming et al., "Unsupervised removal of systematic background noise from droplet-based single-cell experiments using CellBender," Nat. Methods, vol. 20, no. 9, pp. 1323–1335, Sep. 2023, doi: 10.1038/s41592-023-01943-7.
[26] Y. Hao et al., "Integrated analysis of multimodal single-cell data," Cell, vol. 184, no. 13, pp. 3573-3587.e29, Jun. 2021, doi: 10.1016/j.cell.2021.04.048.
[27] C. S. McGinnis, L. M. Murrow, and Z. J. Gartner, "DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors," Cell Syst., vol. 8, no. 4, pp. 329-337.e4, Apr. 2019, doi: 10.1016/j.cels.2019.03.003.
[28] I. Korsunsky et al., "Fast, sensitive and accurate integration of single-cell data with Harmony," Nat. Methods, vol. 16, no. 12, pp. 1289–1296, Dec. 2019, doi: 10.1038/s41592-019-0619-0.
Visit our Citation Policies for details on how to cite this dataset. Please cite all underlying resources whose data were used to generate the composite dataset.