Single cell expression map of pancreatic islets using data from HPAP, IIDP and Prodo (V3 Freeze)

Ha T.H. Vu, Han Sun, Seth Sharp, Parul Kudtarkar, Liza Brusman, Julie Jurgens, The PanKbase Consortium, Jason Flannick, Noel Burtt, Shuibing Chen, Jie Liu, Jean-Philippe Cartailler, Benjamin F. Voight, Michael Lee Stitzel, Marcela Brissova, Anna L. Gloyn, Kyle Gaulton, Stephen C.J. Parker

DOI: 10.5281/zenodo.15596314

Result summary

1. Release notes (major updates since V1 freeze)

a. Reprocessed raw data so that runs of the same library were merged prior to alignment, enabling correct handling of PCR duplication.

b. Added hashtagged-oligo samples.

c. Relaxed thresholds for number of cells per sample, enabling more samples retained.

d. Added an additional round of ambient RNA correction where model parameters were optimized for every sample.

e. Multiple doublet detection methods for better separation of cell populations

f. Code availability: All code for the analyses are provided at https://github.com/PanKbase/snRNAseq-NextFlow and https://github.com/PanKbase/PanKbase-scRNA-seq.

2. Cell annotations:

Cell typeNumber of cells
Alpha194443
Beta120502
Acinar50105
Ductal33686
Active Stellate16907
Delta11680
Gamma + Epsilon7087
MUC5B+ Ductal3515
Endothelial3385
Immune2755
Cycling Alpha2738
Quiescent Stellate2132

Methods

1. Data collection:

1.1. Single-cell (sc) RNA-seq from the Human Pancreas Analysis Program (HPAP):

scRNA-seq data and metadata generated from islet tissues using the 10X Chromium platform were downloaded from https://hpap.pmacs.upenn.edu/ (accessed March 2024) [1], [2], [3], [4].

1.2. Single-cell (sc) RNA-seq generated using islets distributed by Integrated Islet Distribution Program (IIDP), and Prodo Labs (Prodo):

Literature searches were conducted in order to identify original studies involving scRNA-seq generated using islets distributed by IIDP and Prodo. Briefly, pancreatic islet datasets of expression profiling by high throughput sequencing were identified via GEO search for the date range before July 2024, then studies with scRNA-seq data were determined manually. Datasets were then downloaded from the studies' corresponding GEO page using the fastq-dump command from the SRA toolkit (v3.1.1) [5]. Data from the following GEO accession IDs were used: GSE142465 [6], GSE159556 [7], GSE183568 [8], GSE190447 [9], GSE201256 [10], GSE217837 [11], GSE221156 [12], GSE251730 [13], GSE251912 [14].

2. Data preprocessing:

The quality of raw reads was assessed using FastQC (v0.11.9) [15], then individual samples' reports were compiled together using MultiQC (v1.11) [16]. These reports were inspected in order to detect technical issues such as flow cell problems, base calling errors, and the presence of unexpected technical sequences. Based on the Per Tile Quality and Quality Score plots, we subsequently included sequencing runs with more than 50% of tiles indicated as high quality (i.e., the quality score was at or above the average) across all bases.

STARsolo (v2.7.10a) [17] was used to align raw reads to the human genome build hg38. Comprehensive annotation from Gencode release 39 (https://www.gencodegenes.org/human/release_39.html) was used to build STAR index files with the command STAR --runMode genomeGenerate and default parameters. As a result, count matrices quantifying various features such as gene counts were obtained. For downstream analyses, count matrices GeneFull_ExonOverIntron were used. Additionally, MultiQC was used to compile log files output by STARsolo, and the report was used to examine various relevant metrics such as percent of uniquely mapped reads and mapping rates.

High quality reads were obtained using SAMtools (v1.14) [18] and the following command: samtools view -h -b -q 255 -F 4 -F 256 -F 2048 $bam.

3. Quality control:

  • In order to confirm donor metadata and check for sample swaps, we compared the read profiles with genotyping data using the mbv tool [19] included in QTLtools (v1.3.1) [20]. When donor unique identifiers were not supplied in the original studies, we assigned donor identities to samples when "perc_het_consistent" and "perc_hom_consistent" from mbv are both larger than 0.8. When donors did not have genotype data, we manually searched for individuals with matching age, sex, BMI and allocation centers to assign donor identifiers to samples. Subsequently, we retained only samples with a determined identifier.
  • To systematically inspect quality metrics underlying all libraries, we utilized the UMI numbers and their ranks using the following algorithm. We ranked every barcode b based on their UMI numbers (nb) in a decreasing order, then considered log(nb) as a function of the rank's logarithm. In typical single-cell libraries, we expect this function to have two "knee" points corresponding to the transition between three subsets of barcodes: those with large, intermediate, and small nb[21]. These barcode subsets generally represent cell-containing droplets, a mix of empty and non-empty droplets, and the barcodes associated with empty Gel Beads-in-emulsion, respectively [21], [22]. Next, we calculated the first-order discrete difference between two consecutive barcodes to obtain the rate of changes in nb along the rank axis, then smoothed out this difference signal profile using the Savitzky–Golay filter [23]. As a result, we identified points along the profile with the largest change rates, which aligned with the "knees", and quantified the "knee" point numbers. We retained all libraries with at least two "knee" points. For the libraries with one knee point, we manually inspected not only the barcode rank behaviors but also other quality control metrics (listed below) to decide if they would be retained.
  • To distinguish potential cells from empty droplets, EmptyDrops [22] from the DropletUtils (v1.26.0) [24] was used. Droplets with significant deviations from the ambient profile at FDR < 5% were regarded as non-empty.
  • Adopting the mathematical models underlying EmptyDrops, we detected the end of the cliff point from the barcode rank plots of each sample. The cliff point is defined as the point with log-rank higher than that of inflection points (detected using EmptyDrops) and minimizes the signed curvatures.
  • Ambient RNA was corrected using CellBender (v0.3.0) [25] in two rounds.
    • First round (default run): Used a false positive rate (FPR) of 0.05 with all other parameters set to default.
    • Second round (modified run): Set parameters as follows: --expected_cells: number of barcodes identified as corresponding to cells and --total_droplets_included: average of UMI numbers between inflection and end of cliff point.
  • In order to identify barcodes that correspond to cells, we defined the following criteria:
    • Probability that a barcode corresponds to a true cell (obtained from CellBender using default settings) ≥ 0.99.
    • Cells were detected as non-empty using EmptyDrops.
    • Cells with fractions of ambient reads < a dynamic threshold determined per sample using the Multi-Otsu Thresholding algorithm on non-empty cells detected using EmptyDrops.
    • Cells with percent of chrMT reads < a dynamic threshold determined per sample using the Multi-Otsu Thresholding algorithm on the combination of two sets: non-empty cells, and cells whose chrMT read fraction < 0.3 and rank higher than that of cliff end points.
  • We retained samples with a number of qualified cells > 200 for downstream analyses.

a. Processing of HTO samples:

2. Doublet detection:

For each sample, DoubletFinder (v2.0.3) [27] was used twice on the RNA ambient-corrected gene count matrix obtained using modified runs of CellBender, and the second round of DoubletFinder was run without doublets detected in the first round. Next, we jointly clustered libraries across each tissue source (HPAP, IIDP and Prodo) using Seurat with doublets included, using Harmony (v1.2.0) [28] to correct for the following covariates: sex, BMI, age, studies, treatments and chemistry, and resolution of 1.8. We removed doublets and doublet-enriched clusters which are defined as ones with doublet rates > 65%.

3. Data integration:

All data integration was performed using Seurat. In particular, due to the large number of samples, we first combined all samples per source (data generated using HPAP, IIDP, and Prodo samples) to create three Seurat objects, retaining cells that expressed at least one protein-coding gene, and protein-coding genes that are expressed in at least one cell. Subsequently, we merged all objects and carried out integration using Harmony, correcting for the following covariates: sex, BMI, age, studies, treatments, chemistry and tissue sources. As a result, we obtained a set of 481,154 cells expressing 19,546 protein-coding genes.

Next, we adopted an iterative approach to further clean up the data.

  • Step 1: We clustered the cells (resolution = 1.8), and tested for clusters with significantly different profiles of gene numbers and UMI numbers using Wilcoxon rank sum test. Here, we define a fold change as median(tested cluster's metrics) / median(all other clusters' metrics). A cluster is determined to be significantly different if their adjusted p-value < 0.05 and fold change > 2, and they are removed from the subsequent analysis. As a result, we obtained a set of 476,429 cells, expressing 19,546 protein-coding genes.
  • Step 2: We focused on identifying doublet-like cells using the remaining cells. For each sample, we began by clustering cells at a resolution of 4. We pinpointed clusters that were enriched with cells exhibiting high UMI counts and expressing markers from at least two cell populations in 50% or more of the cells. To achieve this, we performed Binomial tests to assess whether a cluster had more cells with UMI counts exceeding the 90th percentile of the samples than the expected rate of 0.1, using a Benjamini-Hochberg adjusted P-value threshold of < 0.05. Cells within these enriched clusters were labeled as "doublet-like cells." Next, we employed Fisher's exact test to identify clusters in the integrated map that were enriched with doublet-like cells (Benjamini-Hochberg adjusted P-value threshold of < 0.05). Clusters in the integrated map that both were enriched with doublet-like cells and expressed markers from at least two cell populations in 50% or more of the cells were subsequently removed. Additionally, all doublet-like cells were removed.
  • Step 3: We re-clustered and harmonized the data (resolution = 1.8), resulting in 48 clusters. Using the following list of markers, we annotated the cell clusters to 13 major cell types:
    • Beta: 'INS', 'IAPP'
    • Alpha: 'GCG'
    • Delta: 'SST'
    • Gamma: 'PPY'
    • Epsilon: 'GHRL'
    • Ductal: 'CFTR', KRT19
    • Acinar: 'REG1A', 'CTRB2', 'PRSS1', 'PRSS2', 'CPA1'
    • Active Stellate: 'PDGFRB', 'COL6A1'
    • Quiescent Stellate: 'PDGFRB', 'COL6A1', 'RGS5'
    • Endothelial: 'PECAM1', 'PLVAP', 'ESAM', 'VWF'
    • Immune: 'PTPRC'
    • Cycling Alpha: 'GCG', 'MKI67', 'CDK1'

Acknowledgment

This work used data acquired from the database (https://hpap.pmacs.upenn.edu/) of the Human Pancreas Analysis Program (HPAP; RRID:SCR_016202; PMID: 31127054; PMID: 36206763; PMID: 35228745 and PMID: 37188822). HPAP is part of a Human Islet Research Network (RRID:SCR_014393) consortium (UC4-DK112217, U01-DK123594, UC4-DK112232, and U01-DK123716).

Additional islet samples and tissues were provided by the Integrated Islet Distribution Program (IIDP) (RRID:SCR_014387), funded by NIH Grant #2UC4DK098085.

Citation

[1] S. N. Shapira, A. Naji, M. A. Atkinson, A. C. Powers, and K. H. Kaestner, "Understanding islet dysfunction in type 2 diabetes through multidimensional pancreatic phenotyping: The Human Pancreas Analysis Program," Cell Metab., vol. 34, no. 12, pp. 1906–1913, Dec. 2022, doi: 10.1016/j.cmet.2022.09.013.

[2] K. H. Kaestner, A. C. Powers, A. Naji, HPAP Consortium, and M. A. Atkinson, "NIH Initiative to Improve Understanding of the Pancreas, Islet, and Autoimmunity in Type 1 Diabetes: The Human Pancreas Analysis Program (HPAP)," Diabetes, vol. 68, no. 7, pp. 1394–1402, Jul. 2019, doi: 10.2337/db19-0058.

[3] A. R. Patil, J. Schug, A. Naji, K. H. Kaestner, R. B. Faryabi, and G. Vahedi, "Single-cell expression profiling of islets generated by the Human Pancreas Analysis Program," Nat. Metab., vol. 5, no. 5, pp. 713–715, May 2023, doi: 10.1038/s42255-023-00806-x.

[4] M. Fasolino et al., "Single-cell multi-omics analysis of human pancreatic islets reveals novel cellular states in type 1 diabetes," Nat. Metab., vol. 4, no. 2, pp. 284–299, Feb. 2022, doi: 10.1038/s42255-022-00531-x.

[5] SRA Toolkit Development Team, SRA Toolkit. [Online]. Available: https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

[6] B. Marquina-Sanchez et al., "Single-cell RNA-seq with spike-in cells enables accurate quantification of cell-specific drug effects in pancreatic islets," Genome Biol., vol. 21, no. 1, p. 106, Dec. 2020, doi: 10.1186/s13059-020-02006-2.

[7] X. Tang et al., "SARS-CoV-2 infection induces beta cell transdifferentiation," Cell Metab., vol. 33, no. 8, pp. 1577-1591.e7, Aug. 2021, doi: 10.1016/j.cmet.2021.05.015.

[8] S. Shrestha et al., "Combinatorial transcription factor profiles predict mature and functional human islet α and β cells," JCI Insight, vol. 6, no. 18, p. e151621, Sep. 2021, doi: 10.1172/jci.insight.151621.

[9] G. Basile et al., "Excess pancreatic elastase alters acinar-β cell communication by impairing the mechano-signaling and the PAR2 pathways," Cell Metab., vol. 35, no. 7, pp. 1242-1260.e9, Jul. 2023, doi: 10.1016/j.cmet.2023.05.007.

[10] W. Xu et al., "Architecture of androgen receptor pathways amplifying glucagon-like peptide-1 insulinotropic action in male pancreatic β cells," Cell Rep., vol. 42, no. 5, p. 112529, May 2023, doi: 10.1016/j.celrep.2023.112529.

[11] R. B. Kang et al., "Single-nucleus RNA sequencing of human pancreatic islets identifies novel gene sets and distinguishes β-cell subpopulations with dynamic transcriptome profiles," Genome Med., vol. 15, no. 1, p. 30, May 2023, doi: 10.1186/s13073-023-01179-2.

[12] K. Bandesh et al., "Single-cell decoding of human islet cell type-specific alterations in type 2 diabetes reveals converging genetic- and state-driven β-cell gene expression defects," Jan. 22, 2025, Genomics. doi: 10.1101/2025.01.17.633590.

[13] J. S. Stancill, M. Y. Kasmani, W. Cui, and J. A. Corbett, "Single Cell RNAseq Analysis of Cytokine-Treated Human Islets: Association of Cellular Stress with Impaired Cytokine Responsiveness," Function, vol. 5, no. 4, p. zqae015, Jul. 2024, doi: 10.1093/function/zqae015.

[14] E. K. Sokolowski et al., "Multi-omic human pancreatic islet endoplasmic reticulum and cytokine stress response mapping provides type 2 diabetes genetic insights," Cell Metab., vol. 36, no. 11, pp. 2468-2488.e7, Nov. 2024, doi: 10.1016/j.cmet.2024.09.006.

[15] S. Andrews, FastQC. [Online]. Available: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

[16] P. Ewels, M. Magnusson, S. Lundin, and M. Käller, "MultiQC: summarize analysis results for multiple tools and samples in a single report," Bioinformatics, vol. 32, no. 19, pp. 3047–3048, Oct. 2016, doi: 10.1093/bioinformatics/btw354.

[17] B. Kaminow, D. Yunusov, and A. Dobin, "STARsolo: accurate, fast and versatile mapping/quantification of single-cell and single-nucleus RNA-seq data," May 05, 2021, Bioinformatics. doi: 10.1101/2021.05.05.442755.

[18] H. Li et al., "The Sequence Alignment/Map format and SAMtools," Bioinformatics, vol. 25, no. 16, pp. 2078–2079, Aug. 2009, doi: 10.1093/bioinformatics/btp352.

[19] A. Fort et al., "MBV : a method to solve sample mislabeling and detect technical bias in large combined genotype and sequencing assay datasets," Bioinformatics, vol. 33, no. 12, pp. 1895–1897, Jun. 2017, doi: 10.1093/bioinformatics/btx074.

[20] O. Delaneau, H. Ongen, A. A. Brown, A. Fort, N. I. Panousis, and E. T. Dermitzakis, "A complete tool set for molecular QTL discovery and analysis," Nat. Commun., vol. 8, no. 1, p. 15452, May 2017, doi: 10.1038/ncomms15452.

[21] 10x Genomics, "Quality Assessment Using the Cell Ranger Web Summary." [Online]. Available: https://www.10xgenomics.com/analysis-guides/quality-assessment-using-the-cell-ranger-web-summary

[22] A. T. L. Lun et al., "EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data," Genome Biol., vol. 20, no. 1, p. 63, Dec. 2019, doi: 10.1186/s13059-019-1662-y.

[23] Abraham. Savitzky and M. J. E. Golay, "Smoothing and Differentiation of Data by Simplified Least Squares Procedures.," Anal. Chem., vol. 36, no. 8, pp. 1627–1639, Jul. 1964, doi: 10.1021/ac60214a047.

[24] J. G. Aaron Lun, DropletUtils. (2018). Bioconductor. doi: 10.18129/B9.BIOC.DROPLETUTILS.

[25] S. J. Fleming et al., "Unsupervised removal of systematic background noise from droplet-based single-cell experiments using CellBender," Nat. Methods, vol. 20, no. 9, pp. 1323–1335, Sep. 2023, doi: 10.1038/s41592-023-01943-7.

[26] Y. Hao et al., "Integrated analysis of multimodal single-cell data," Cell, vol. 184, no. 13, pp. 3573-3587.e29, Jun. 2021, doi: 10.1016/j.cell.2021.04.048.

[27] C. S. McGinnis, L. M. Murrow, and Z. J. Gartner, "DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors," Cell Syst., vol. 8, no. 4, pp. 329-337.e4, Apr. 2019, doi: 10.1016/j.cels.2019.03.003.

[28] I. Korsunsky et al., "Fast, sensitive and accurate integration of single-cell data with Harmony," Nat. Methods, vol. 16, no. 12, pp. 1289–1296, Dec. 2019, doi: 10.1038/s41592-019-0619-0.


Visit our Citation Policies for details on how to cite this dataset. Please cite all underlying resources whose data were used to generate the composite dataset.

R Object's Metadata README

  1. 'cell_type': Cell annotations.
  2. 'ncount_rna': Number of UMIs for barcodes that expressed at least one non-duplicate protein-coding gene
  3. 'nfeature_rna': Number of non-duplicate protein-coding genes that are expressed in at least one barcode
  4. 'rna_total_reads': Total number of reads aligned to all genes (not just non-duplicate protein-coding genes) obtained using STARsolo GeneFull_ExonOverIntron setting.
  5. 'rna_uniquely_mapped_reads': Number of uniquely aligned reads to all genes (not just non-duplicate protein-coding genes) obtained using STARsolo GeneFull_ExonOverIntron setting
  6. 'rna_secondary_alignments': Number of supplementary aligned reads to all genes (not just non-duplicate protein-coding genes) obtained using STARsolo GeneFull_ExonOverIntron setting
  7. 'rna_umis': Number of UMIs per barcode prior to CellBender correction, using all genes (not just non-duplicate protein-coding genes)
  8. 'rna_pct_mitochondrial': Fraction of mitochondrial reads, calculated as number of reads aligned to chrMT divided by total number of uniquely aligned reads
  9. 'samples': Sample names
  10. 'barcodes': Original barcodes, without sample name appended
  11. 'treatments': Treatments of samples. Samples that did not go through any treatments are "no_treatment", otherwise, the treatments are noted according to individual studies. Samples from the study "GSE251912" were generated using HTO assays; therefore, cells from the same samples may have different treatments
  12. 'chemistry': Chemistry of 10x kits used for data generation
  13. 'source': Tissue source
  14. 'center_donor_id': Donor center ID identifier for cross-referencing
  15. 'rrid': Research Resource Identifier, with "RRID:" as prefix
  16. 'donor_accession': Unique identifier (prefixed with PKB, assigned by server)
  17. 'sex': Self-reported gender
  18. 'age_(years)': Age of the donor in years (numeric value).
  19. 'aliases': Lab-specific identifiers
  20. 'bmi': Body mass index in kg/m²
  21. 'c_peptide_(ng/ml)': C-Peptide concentration in ng/ml.
  22. 'cause_of_death': Primary medical condition that led to death.
  23. 'reported_ethnicity': Self-reported ethnicity of the donor
  24. 'description_of_diabetes_status': Indicates the specific type of diabetes or absence of it. It provides more detail than just a yes/no classification.
  25. 'diabetes_duration_(years)': Duration of diabetes in years.
  26. 'diabetes_status': Ontology Term for diabetes status (MONDO IDs)
  27. 'hba1c_(percentage)': HbA1C percentage measurement
  28. 'family_history_of_diabetes': Boolean indicating family history of diabetes
  29. 'family_history_of_diabetes_relationship': Array describing relationship to diabetic family members
  30. 'glucose_lowering_therapy': Type of therapy for managing blood glucose levels
  31. 'aab_gada_positive': Boolean indicating presence of GADA autoantibodies
  32. 'aab_gada_value_(unit/ml)': Numeric value of GADA autoantibodies in unit/ml
  33. 'aab_ia2_positive': Boolean indicating presence of IA2 autoantibodies
  34. 'aab_ia2_value_(unit/ml)': Numeric value of IA2 autoantibodies in unit/ml
  35. 'aab_znt8_positive': Boolean indicating presence of ZNT8 autoantibodies
  36. 'aab_znt8_value_(unit/ml)': Numeric value of ZNT8 autoantibodies in unit/ml
  37. 'aab_iaa_positive': Boolean indicating presence of IAA autoantibodies
  38. 'aab_iaa_value_(unit/ml)': Numeric value of IAA autoantibodies in unit/ml
  39. 'biosample_accession': Unique identifier for biosamples (prefixed with PKB, assigned by server)
  40. 'biosample_cold_ischaemia_time_(hours)': Duration in hours that the pancreas was kept at a low temperature after removal from the donor
  41. 'biosample_hand_picked': Whether the pancreas or its components were manually selected or processed
  42. 'islet_function_available': Whether functional assays or data are available for the isolated pancreatic islets pre- or post-shipping (boolean)
  43. 'ieq_pancreas_weight_(grams)': Ratio of Islet Equivalents to the weight of the pancreas
  44. 'islet_yield_ieq': Total number of Islet Equivalents obtained from the pancreas
  45. 'organ_source': Type of organ donor from which the pancreas or pancreatic tissue was obtained (deceased, living, other classifications, unknown)
  46. 'isolation_center': Facility or location where the islets were isolated, including islet isolation centers affiliated with the Integrated Islet Distribution Program (IIDP), IsletCore at the University of Alberta (Edmonton), Human Pancreas Analysis Program (HPAP), and Prodo.
  47. 'biosample_warm_ischaemia_duration_down_time_(hours)': Duration in hours that the pancreas was without blood supply at body temperature
  48. 'hospital_stay_(hours)': Total hours of hospitalization
  49. 'donation_type': Type of organ donation (Donation after Circulatory Death, Donation after Brain Death, Natural Death Donation, Medical Assistance in Dying)