Single cell expression map of pancreatic islets using data from HPAP, IIDP and Prodo (V1 Freeze)

Ha TH Vu, Han Sun, Seth Sharp, Marcela Brissova, Kyle Gaulton, Anna L Gloyn, Stephen CJ Parker

DOI: 10.5281/zenodo.15588240

Result summary

1. Cell annotations:

Cell typeNumber of cells
Alpha225637
Beta160063
Acinar54489
Ductal47221
Active Stellate28747
Delta17292
Gamma + Epsilon13163
Endothelial4031
MUC5B+ Ductal3828
Quiescent Stellate3663
Cycling Alpha3646
Immune (Macrophage)3608

Methods

1. Data collection:

1.1. Single-cell (sc) RNA-seq from the Human Pancreas Analysis Program (HPAP):

scRNA-seq data and metadata generated from islet tissues using the 10X Chromium platform were downloaded from https://hpap.pmacs.upenn.edu/ (accessed March 2024) [1], [2], [3], [4].

1.2. Single-cell (sc) RNA-seq generated using islets distributed by Integrated Islet Distribution Program (IIDP), and Prodo Labs (Prodo):

Literature searches were conducted in order to identify original studies involving scRNA-seq generated using islets distributed by IIDP and Prodo. Literature searches were conducted in order to identify original studies involving scRNA-seq generated using islets distributed by IIDP and Prodo. Briefly, pancreatic islet datasets of expression profiling by high throughput sequencing were identified via GEO search for the date range before July 2024, then studies with scRNA-seq data were determined manually. Datasets were then downloaded from the studies' corresponding GEO page using the fastq-dump command from the SRA toolkit (v3.1.1) [5]. Data from the following GEO accession IDs were used: GSE142465 [6], GSE159556 [7], GSE183568 [8], GSE190447 [9], GSE201256 [10], GSE217837 [11], GSE221156 [12], GSE251730 [13], GSE251912 [14].

2. Data preprocessing:

  • The quality of raw reads was assessed using FastQC (v0.11.9) [15], then individual samples' reports were compiled together using MultiQC (v1.11) [16]. These reports were inspected in order to detect technical issues such as flow cell problems, base calling errors, and the presence of unexpected technical sequences. Based on the Per Tile Quality and Quality Score plots, we subsequently included sequencing runs with more than 50% of tiles indicated as high quality (i.e., the quality score was at or above the average) across all bases.
  • STARsolo (v2.7.10a) [17] was used to align raw reads to the human genome build hg38. Comprehensive annotation from Gencode release 39 (https://www.gencodegenes.org/human/release_39.html) was used to build STAR index files with the command STAR --runMode genomeGenerate and default parameters. As a result, count matrices quantifying various features such as gene counts were obtained. For downstream analyses, count matrices GeneFull_ExonOverIntron were used. Additionally, MultiQC was used to compile log files output by STARsolo, and the report was used to examine various relevant metrics such as percent of uniquely mapped reads and mapping rates.
  • High quality reads were obtained using SAMtools (v1.14) [18] and the following command: samtools view -h -b -q 255 -F 4 -F 256 -F 2048 $bam.

3. Quality control:

  • In order to confirm donor metadata and check for sample swaps, we compared the read profiles with genotyping data using the mbv tool [19] included in QTLtools (v1.3.1) [20]. When donor unique identifiers were not supplied in the original studies, we assigned donor identities to samples when perc_het_consistent and perc_hom_consistent from mbv are both larger than 0.8. When donors did not have genotype data, we manually searched for individuals with matching age, sex, BMI and allocation centers to assign donor identifiers to samples. Subsequently, we retained only samples with a determined identifier.
  • To systematically inspect quality metrics underlying all libraries, we utilized the UMI numbers and their ranks using the following algorithm. We ranked every barcode b based on their UMI numbers (nb) in a decreasing order, then considered log(nb) as a function of the rank's logarithm. In typical single-cell libraries, we expect this function to have two "knee" points corresponding to the transition between three subsets of barcodes: those with large, intermediate, and small nb [21]. These barcode subsets generally represent cell-containing droplets, a mix of empty and non-empty droplets, and the barcodes associated with empty Gel Beads-in-emulsion, respectively [21], [22]. Next, we calculated the first-order discrete difference between two consecutive barcodes to obtain the rate of changes in nb along the rank axis, then smoothed out this difference signal profile using the Savitzky–Golay filter [23]. As a result, we identified points along the profile with the largest change rates, which aligned with the "knees", and quantified the "knee" point numbers. We retained all libraries with at least two "knee" points. For the libraries with one knee point, we manually inspected not only the barcode rank behaviors but also other quality control metrics (listed below) to decide if they would be retained.
  • Ambient RNA was corrected using CellBender (v0.3.0) [24] with FPR of 0.05 and other parameters set to default settings.
  • To distinguish potential cells from empty droplets, EmptyDrops [22] from the DropletUtils (v1.26.0) [25] was used. Droplets with significant deviations from the ambient profile at FDR < 5% were regarded as non-empty.
  • In order to identify high-quality cells, we defined the following criteria:
    • Probability that a barcode corresponds to a true cell (obtained from CellBender) ≥ 0.99.
    • Cells were detected as non-empty using EmptyDrops.
    • Cells with fractions of ambient reads less than a dynamic threshold determined per sample using the Multi-Otsu Thresholding algorithm on non-empty cells detected using EmptyDrops.
    • Cells with percent of chrMT reads less than a dynamic threshold determined per sample using the Multi-Otsu Thresholding algorithm. From the barcode rank plots of each sample, we adapted EmptyDrops' approaches to detect the end of the cliff point. This cliff end is defined as the point with log-rank between inflection points (detected using EmptyDrops) and the plateau point (detected using CellBender) and minimizes the signed curvatures. Next, we used Multi-Otsu to find thresholds on the combination of two sets: non-empty cells, and cells whose chrMT read rates < 0.3 and rank higher than that of cliff end points.
  • We retained samples with a number of qualified cells > 500 for downstream analyses.

4. Doublet detection:

For each sample, DoubletFinder (v2.0.3) [26] was used on the RNA ambient-corrected gene count matrix. Next, we jointly clustered libraries using Seurat (v4.4.0) [27] with doublets included, and then filtered out any cells that were marked as doublets or were in clusters with doublet rates > 37.5%.

5. Data integration:

All data integration was performed using Seurat. In particular, due to the large number of samples, we first combined all samples per source (data generated using HPAP, IIDP, and Prodo samples) to create three Seurat objects. For each object, we then retained a) protein-coding genes that were in ≥ 0.01% of the cells, and b) cells that expressed ≥ 5% of the protein-coding genes. Subsequently, we merged all objects and carried out integration using Harmony (v1.2.0) [28], correcting for the following covariates: samples, studies, donor identifier, tissue sources, and chemistry. As a result, we obtained a set of 665,256 cells expressing 18,704 protein-coding genes.

Next, we adopted an iterative approach to further clean up the data:

  • Step 1: We clustered the cells (resolution = 0.6), inspected their number of genes, number of UMIs, and chrMT read rates, and removed clusters with different metric profiles than the rest. We repeated this step one additional time.
  • Step 2: Using the remaining cells, we detected doublets on a per-sample level using DoubletFinder and filtered out any cells that were marked as doublets or were in clusters with doublet rates > 30%.
  • Step 3: We repeated step 1 one additional time. As a result, we retained 565,388 high-quality cells.
  • Step 4: We re-clustered and harmonized the data (resolution = 0.7), resulting in 19 clusters corresponding to 13 major cell types.

Acknowledgment

This work used data acquired from the database (https://hpap.pmacs.upenn.edu/) of the Human Pancreas Analysis Program (HPAP; RRID:SCR_016202; PMID: 31127054; PMID: 36206763; PMID: 35228745 and PMID: 37188822). HPAP is part of a Human Islet Research Network (RRID:SCR_014393) consortium (UC4-DK112217, U01-DK123594, UC4-DK112232, and U01-DK123716).

Additional islet samples and tissues were provided by the Integrated Islet Distribution Program (IIDP) (RRID:SCR_014387), funded by NIH Grant #2UC4DK098085.

Citation

[1] S. N. Shapira, A. Naji, M. A. Atkinson, A. C. Powers, and K. H. Kaestner, "Understanding islet dysfunction in type 2 diabetes through multidimensional pancreatic phenotyping: The Human Pancreas Analysis Program," Cell Metab., vol. 34, no. 12, pp. 1906–1913, Dec. 2022, doi: 10.1016/j.cmet.2022.09.013.

[2] K. H. Kaestner, A. C. Powers, A. Naji, HPAP Consortium, and M. A. Atkinson, "NIH Initiative to Improve Understanding of the Pancreas, Islet, and Autoimmunity in Type 1 Diabetes: The Human Pancreas Analysis Program (HPAP)," Diabetes, vol. 68, no. 7, pp. 1394–1402, Jul. 2019, doi: 10.2337/db19-0058.

[3] A. R. Patil, J. Schug, A. Naji, K. H. Kaestner, R. B. Faryabi, and G. Vahedi, "Single-cell expression profiling of islets generated by the Human Pancreas Analysis Program," Nat. Metab., vol. 5, no. 5, pp. 713–715, May 2023, doi: 10.1038/s42255-023-00806-x.

[4] M. Fasolino et al., "Single-cell multi-omics analysis of human pancreatic islets reveals novel cellular states in type 1 diabetes," Nat. Metab., vol. 4, no. 2, pp. 284–299, Feb. 2022, doi: 10.1038/s42255-022-00531-x.

[5] SRA Toolkit Development Team, SRA Toolkit. [Online]. Available: https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

[6] B. Marquina-Sanchez et al., "Single-cell RNA-seq with spike-in cells enables accurate quantification of cell-specific drug effects in pancreatic islets," Genome Biol., vol. 21, no. 1, p. 106, Dec. 2020, doi: 10.1186/s13059-020-02006-2.

[7] X. Tang et al., "SARS-CoV-2 infection induces beta cell transdifferentiation," Cell Metab., vol. 33, no. 8, pp. 1577-1591.e7, Aug. 2021, doi: 10.1016/j.cmet.2021.05.015.

[8] S. Shrestha et al., "Combinatorial transcription factor profiles predict mature and functional human islet α and β cells," JCI Insight, vol. 6, no. 18, p. e151621, Sep. 2021, doi: 10.1172/jci.insight.151621.

[9] G. Basile et al., "Excess pancreatic elastase alters acinar-β cell communication by impairing the mechano-signaling and the PAR2 pathways," Cell Metab., vol. 35, no. 7, pp. 1242-1260.e9, Jul. 2023, doi: 10.1016/j.cmet.2023.05.007.

[10] W. Xu et al., "Architecture of androgen receptor pathways amplifying glucagon-like peptide-1 insulinotropic action in male pancreatic β cells," Cell Rep., vol. 42, no. 5, p. 112529, May 2023, doi: 10.1016/j.celrep.2023.112529.

[11] R. B. Kang et al., "Single-nucleus RNA sequencing of human pancreatic islets identifies novel gene sets and distinguishes β-cell subpopulations with dynamic transcriptome profiles," Genome Med., vol. 15, no. 1, p. 30, May 2023, doi: 10.1186/s13073-023-01179-2.

[12] K. Bandesh et al., "Single-cell decoding of human islet cell type-specific alterations in type 2 diabetes reveals converging genetic- and state-driven β-cell gene expression defects," Jan. 22, 2025, Genomics. doi: 10.1101/2025.01.17.633590.

[13] J. S. Stancill, M. Y. Kasmani, W. Cui, and J. A. Corbett, "Single Cell RNAseq Analysis of Cytokine-Treated Human Islets: Association of Cellular Stress with Impaired Cytokine Responsiveness," Function, vol. 5, no. 4, p. zqae015, Jul. 2024, doi: 10.1093/function/zqae015.

[14] E. K. Sokolowski et al., "Multi-omic human pancreatic islet endoplasmic reticulum and cytokine stress response mapping provides type 2 diabetes genetic insights," Cell Metab., vol. 36, no. 11, pp. 2468-2488.e7, Nov. 2024, doi: 10.1016/j.cmet.2024.09.006.

[15] S. Andrews, FastQC. [Online]. Available: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

[16] P. Ewels, M. Magnusson, S. Lundin, and M. Käller, "MultiQC: summarize analysis results for multiple tools and samples in a single report," Bioinformatics, vol. 32, no. 19, pp. 3047–3048, Oct. 2016, doi: 10.1093/bioinformatics/btw354.

[17] B. Kaminow, D. Yunusov, and A. Dobin, "STARsolo: accurate, fast and versatile mapping/quantification of single-cell and single-nucleus RNA-seq data," May 05, 2021, Bioinformatics. doi: 10.1101/2021.05.05.442755.

[18] H. Li et al., "The Sequence Alignment/Map format and SAMtools," Bioinformatics, vol. 25, no. 16, pp. 2078–2079, Aug. 2009, doi: 10.1093/bioinformatics/btp352.

[19] A. Fort et al., "MBV : a method to solve sample mislabeling and detect technical bias in large combined genotype and sequencing assay datasets," Bioinformatics, vol. 33, no. 12, pp. 1895–1897, Jun. 2017, doi: 10.1093/bioinformatics/btx074.

[20] O. Delaneau, H. Ongen, A. A. Brown, A. Fort, N. I. Panousis, and E. T. Dermitzakis, "A complete tool set for molecular QTL discovery and analysis," Nat. Commun., vol. 8, no. 1, p. 15452, May 2017, doi: 10.1038/ncomms15452.

[21] 10x Genomics, "Quality Assessment Using the Cell Ranger Web Summary." [Online]. Available: https://www.10xgenomics.com/analysis-guides/quality-assessment-using-the-cell-ranger-web-summary

[22] A. T. L. Lun et al., "EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data," Genome Biol., vol. 20, no. 1, p. 63, Dec. 2019, doi: 10.1186/s13059-019-1662-y.

[23] Abraham. Savitzky and M. J. E. Golay, "Smoothing and Differentiation of Data by Simplified Least Squares Procedures.," Anal. Chem., vol. 36, no. 8, pp. 1627–1639, Jul. 1964, doi: 10.1021/ac60214a047.

[24] S. J. Fleming et al., "Unsupervised removal of systematic background noise from droplet-based single-cell experiments using CellBender," Nat. Methods, vol. 20, no. 9, pp. 1323–1335, Sep. 2023, doi: 10.1038/s41592-023-01943-7.

[25] J. G. Aaron Lun, DropletUtils. (2018). Bioconductor. doi: 10.18129/B9.BIOC.DROPLETUTILS.

[26] C. S. McGinnis, L. M. Murrow, and Z. J. Gartner, "DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors," Cell Syst., vol. 8, no. 4, pp. 329-337.e4, Apr. 2019, doi: 10.1016/j.cels.2019.03.003.

[27] Y. Hao et al., "Integrated analysis of multimodal single-cell data," Cell, vol. 184, no. 13, pp. 3573-3587.e29, Jun. 2021, doi: 10.1016/j.cell.2021.04.048.

[28] I. Korsunsky et al., "Fast, sensitive and accurate integration of single-cell data with Harmony," Nat. Methods, vol. 16, no. 12, pp. 1289–1296, Dec. 2019, doi: 10.1038/s41592-019-0619-0.


Visit our Citation Policies for details on how to cite this dataset. Please cite all underlying resources whose data were used to generate the composite dataset.