HLAProfiler utilizes k-mer profiles to improve HLA calling accuracy for rare and common alleles in RNA-seq data
- Martin L. Buchkovich†1Email authorView ORCID ID profile,
- Chad C. Brown†1,
- Kimberly Robasky1,
- Shengjie Chai2, 3,
- Sharon Westfall1,
- Benjamin G. Vincent2, 3, 4,
- Eric T. Weimer5 and
- Jason G. Powers1
© The Author(s). 2017
Received: 5 May 2017
Accepted: 11 September 2017
Published: 27 September 2017
The human leukocyte antigen (HLA) system is a genomic region involved in regulating the human immune system by encoding cell membrane major histocompatibility complex (MHC) proteins that are responsible for self-recognition. Understanding the variation in this region provides important insights into autoimmune disorders, disease susceptibility, oncological immunotherapy, regenerative medicine, transplant rejection, and toxicogenomics. Traditional approaches to HLA typing are low throughput, target only a few genes, are labor intensive and costly, or require specialized protocols. RNA sequencing promises a relatively inexpensive, high-throughput solution for HLA calling across all genes, with the bonus of complete transcriptome information and widespread availability of historical data. Existing tools have been limited in their ability to accurately and comprehensively call HLA genes from RNA-seq data.
We created HLAProfiler (https://github.com/ExpressionAnalysis/HLAProfiler), a k-mer profile-based method for HLA calling in RNA-seq data which can identify rare and common HLA alleles with > 99% accuracy at two-field precision in both biological and simulated data. For 68% of novel alleles not present in the reference database, HLAProfiler can correctly identify the two-field precision or exact coding sequence, a significant advance over existing algorithms.
HLAProfiler allows for accurate HLA calls in RNA-seq data, reliably expanding the utility of these data in HLA-related research and enabling advances across a broad range of disciplines. Additionally, by using the observed data to identify potential novel alleles and update partial alleles, HLAProfiler will facilitate further improvements to the existing database of reference HLA alleles. HLAProfiler is available at https://expressionanalysis.github.io/HLAProfiler/.
The human leukocyte antigen (HLA) complex is the set of genes on chromosome 6 encoding proteins of the major histocompatibility complex (MHC). These genes are divided into multiple classes with similar but distinct functions. Class I genes, such as HLA-A, HLA-B, and HLA-C, are expressed in nearly all nucleated cells and are important for recognizing endogenous foreign antigens. These antigens can arise via infection or from somatic variations, such as those introduced in cancer. Class II genes, expressed on antigen-presenting cells, generally recognize exogenous foreign antigens, such as viral peptides entering the cytoplasm after apoptosis of infected cells. Other genes in the HLA complex can have more specialized functions (e.g., HLA-E for cell recognition by natural killer cells) or are associated with HLA (e.g., TAP1/TAP2 for antigen peptide transport) [1, 2].
HLA genes are highly polymorphic, and the number of known alleles continues to grow . Accurately identifying which alleles are present in an individual is important in many areas, such as drug safety , disease susceptibility [5, 6], neoantigen prediction for cancer treatment , regenerative medicine , and organ transplantation . Traditional “gold standard” assays for HLA typing, such as sequence-specific oligonucleotide probe PCR (PCR-SSOP), sequence-specific primed PCR (PCR-SSP), and Sanger-based sequence-based typing (SBT), are labor intensive, costly, and relatively low throughput [10, 11].
A new class of assays has emerged in recent years which uses targeted next-generation sequencing (NGS)-based methods as a new gold standard while increasing throughput and decreasing both the reagent and labor costs of HLA typing .
Most of these NGS assays for HLA typing use customized PCR, designed to specifically interrogate the HLA region, such as TruSight HLA (Illumina), Holotype (Omixon), and NGSgo (GenDx). Published data suggest that these platforms are highly accurate for calling HLA types [12, 13]. Due to their targeted nature, however, these assays do not provide information outside the HLA region, preventing the identification of other biologically meaningful data such as somatic variation found in other genes, thereby limiting their general utility. Also, only a limited amount of HLA-specific targeted-capture sequencing data is publicly available, making HLA typing of historical data using these methods difficult. To overcome the limitations inherent to HLA targeted DNA approaches, methods have been developed to identify HLA types in whole-genome sequencing (WGS) and whole-exome sequencing (WES) data [14, 15], but these methods also have limitations.
Due to the high variability found in the HLA loci, these regions must have sufficient coverage to accurately ascertain HLA types. Both studies found that it is possible to call HLA types accurately from WGS; however, both also found that this high accuracy requires nearly half a billion sequencing reads . Although sequencing costs continue to drop, this depth of sequencing is currently not a practical way to resolve the thousands of possible HLA alleles for many investigators . Conversely, while WES and other large targeted panels require far fewer sequencing reads to achieve a high depth of coverage, the ability to correctly identify HLA types using general targeted DNA-capture methods is highly dependent on the sequences of the probes used to capture the DNA. Polymorphisms within HLA genes may disrupt capture and cause portions of or entire exons to be missing from the sequencing data. This can be especially true if rare alleles or as-yet-undiscovered alleles were not considered during probe design. Finally, heterozygous alleles may have different capture efficiencies, leading to non-equal sequencing coverage, potentially complicating analysis. Indeed, HLA-typing accuracy for exome sequencing has been shown to be poor [15, 16].
Based on these limitations, HLA typing in RNA sequencing (RNA-seq) data promises several advantages: i) it provides an unbiased dataset that fully covers both fully expressed parental alleles equally; ii) no special reagents or protocols are needed; and iii) RNA-seq data offer a myriad of uses beyond HLA typing. Existing tools have been used to perform HLA typing using RNA-seq data with limited success, suffering from poor accuracy or the inability to call rare or novel alleles or offering HLA calling for only a limited number of genes.
To overcome these limitations, we have developed HLAProfiler to accurately determine HLA types using RNA-seq data. HLAProfiler uses the k-mer content of RNA-seq reads to identify the most likely HLA type for the sample. Using both simulated RNA-seq data and RNA-seq data from lymphoblastoid cell lines, we have evaluated the performance of HLAProfiler. In data simulating both common and rare alleles, HLAProfiler correctly identified > 99% of alleles. In the biological dataset, HLAProfiler correctly identified HLA alleles with > 98% concordance to gold standard Sanger sequencing calls. After using a third technology to resolve the discrepancies with Sanger sequencing, HLAProfiler accuracy increased to > 99% in the biological data. We identified multiple cases where HLAProfiler correctly identified alleles associated with disease risks that Sanger missed, including ankylosing spondylitis (AS). Additionally, HLAProfiler has the ability to correctly identify the full protein sequences of both novel alleles and alleles with an incomplete reference sequence. Finally, we demonstrate that HLAProfiler can be extended to identify the alleles of other immunologically important gene classes, such as killer-cell immunoglobulin-like receptor (KIR) genes. Overall, we demonstrate that HLAProfiler excels at correctly identifying HLA alleles in the wealth of RNA-seq data already generated and being generated daily, opening the door for further advances in immunological and disease research.
To assess the accuracy of HLAProfiler and five other HLA-typing software tools, two sets of data were used. The first was a set of 358 real samples that were HLA typed using Sanger sequencing and whose RNA-seq data were also publicly available. The second set comprised data simulated to assess the accuracy of each tool for common, rare, and novel alleles, as well as KIR genes.
RNA-seq data from lymphoblastoid cell lines from the Geuvadis study
Publicly available RNA-seq data from the Geuvadis study were used . Briefly, 462 lymphoblastoid cell lines (LCLs) were sequenced on the Illumina HiSeq2000 platform using paired-end 75-bp reads, with 29.9 million reads on average (range 8.6–83 million, with an interquartile range of 24–34 million). HLA types for 358 of these samples from diverse populations (Additional file 1: Table S1) were determined using Sanger sequencing for HLA-A, HLA-B, HLA-C, HLA-DRB1, and HLA-DQB1 [18, 19]. Sanger sequencing uses select exons to assign an HLA type, introducing ambiguity into the gold standard HLA calls. To account for this ambiguity, we downloaded the curated set of ambiguous HLA alleles from IMGT/IPD  and converted all predicted and gold standard HLA calls to the “G” group, when applicable, before calculating concordance. To assess the accuracy of HLA calling programs under reduced read numbers, this dataset was also downsampled to five million paired reads.
Simulated RNA-seq data
Five different types of simulated datasets were created. In all cases, a transcript FASTA file was generated from the human transcriptome extracted from GENCODE v24 [20, 21], with the HLA genes replaced with specific HLA alleles unique to each subject. HLA allele sequences were obtained from the IMGT/HLA database (v24) . Each sequence was over-represented by repeating it 100 times in the FASTA file so that the total number of HLA reads roughly matched those observed in the Geuvadis study. This modified transcript file was then used to generate simulated FASTQ files using the program wgsim. Specifically, 20 million paired-end 50-bp reads were generated for each individual, using an error rate of 0.001 (base qualities of Q30), a mean insert size of 250, and an insert standard deviation of 75. Wgsim did not allow for the simulation of genes with sequence lengths less than 400 with the current parameter set. Although this restriction did not affect HLA or KIR alleles, it did affect many other real human transcripts, potentially making the simulated data less diverse than a truly biological sample. For this reason, wgsim was altered slightly to allow for simulation of genes with sequence lengths down to 200 bp using the current parameter set. All simulated datasets are available at http://www.q2labsolutions.com/genomics-laboratories/bioinformatics (select “Request Example Data”).
The first dataset simulated RNA-seq data from 109 subjects with HLA types made to be consistent with the two-field HLA types for samples from the GeT-RM program hosted by the Centers for Disease Control [23, 24]. For the alleles where two-field precision matches several actual HLA alleles (e.g., A*01:01, matches A*01:01:01:01, A*01:01:01:02, etc.), the first allele with a “complete” status (i.e., lowest accession number and the entire transcript sequence is known) from IMGT was chosen. Most of the simulated HLA alleles for this dataset were previously categorized as “common” or “well documented” [25, 26].
The second dataset simulated rare alleles. This set was generated using alleles with complete status (entire sequence known) from IMGT that were not previously listed as “common” or “well documented” (Additional file 1: Table S2). For this set, HLA types for 100 individuals (200 alleles) were simulated by randomly drawing from these rare alleles with replacement.
Historically, only exons 2 and 3 of class I genes or exon 2 of class II genes have been submitted to the IMGT/HLA database, because these exons contain the variability influencing peptide binding . As a result, over 87% of the alleles listed in the IMGT/HLA database have partial status, where the entire sequence has not been submitted. Typing for these alleles is challenging since many reads will match much better to similar alleles whose complete sequence is in the database than to the incomplete, correct sequence. To simulate this unique situation, the third simulated dataset was identical to the second dataset except that each sample had one allele swapped with an allele that had a partial, or incomplete, sequence in IMGT/HLA release 3.24.0 but later had a complete sequence in release 3.26.0 (Additional file 1: Table S2). In these cases, the full allele sequences from the later release were used to simulate the data, but information from only the earlier, incomplete release was included in the profile database used for calling HLA types.
The fourth simulation dataset includes novel alleles (i.e., not in the IMGT/HLA database) and was created similarly to the third simulation dataset. Alleles included in release 3.26.0 with a complete sequence but absent in 3.24.0 were considered novel alleles. The novel allele dataset was generated by taking the “rare allele” dataset and replacing exactly one allele for each individual with a novel allele (Additional file 1: Table S2).
The fifth simulated dataset was for alleles from the KIR genes . For this simulation, full alleles from the IMGT/KIR database were used to generate two alleles randomly for each of the 17 different KIR genes across ten samples. The simulation parameters were the same as for the other datasets.
HLAProfiler algorithm overview
Creating a reference for HLAProfiler
The reference HLA k-mer profiles were constructed in four steps: reference profile creation step (RC)1) simulate FASTQ reads using a reference FASTA file of HLA sequences; RC2) create a Kraken database using a custom HLA taxonomy; RC3) use Kraken to assign the simulated reads to HLA genes in the taxonomy; and RC4) aggregate the k-mer counts for each allele into gene-level, reference k-mer profiles. Once created, the reference profiles can be reused to identify types in various new RNA-seq samples. HLAProfiler has a build option to build a reference database. Because reads are simulated for every allele in the database (>14K alleles), database creation can take as long as 24 h using 48 cores. The database used in this study was created using the complete IPD-IMGT/HLA version 3.24.0 reference. Database creation steps are detailed below.
RC1: simulate FASTQ using a reference FASTA of HLA sequences
The gene-level reference profiles are created using an HLA reference nucleotide FASTA file following the standard IPD-IMGT/HLA nomenclature. While the work described here uses the nucleotide FASTA from IPD-IMGT/HLA release 3.24.0, reference profile creation will also work with custom reference sequences as long as the same naming conventions are followed.
A simulated FASTQ file is made from each unique allele sequence in the reference FASTA. Identical allele sequences are merged together, and the merge is reported any time that sequence is included among the predicted HLA types. The paired-end read fragment lengths are simulated assuming a Pareto distribution, with a user-defined scale, shape, and maximum insert size. Additionally, the user specifies read length and number of reads. Reads simulated for this work used the following default parameters: scale 80, shape 0.7, max_insert 1000, num_reads 500,000, read_length 50. For all simulations, read 1 is reverse complemented, comparable to a stranded sequencing protocol.
RC2: create a Kraken database using a custom HLA taxonomy
A custom HLA taxonomy is created from the same reference FASTA used for the simulation of allele reads (Additional file 1: Figure S1). The taxonomic organization is based on the standard HLA nomenclature. The tree has two main branches: HLA alleles and the distractome. The distractome comprises all gene transcripts except transcripts falling within the HLA region and is important for accounting for homology with non-HLA genes. The HLA common ancestor has three branches: class I genes, class II genes, and other HLA genes. Leaves of the tree on these branches represent each individual allele (i.e., A*02:02:01:01), with all other nodes, or common ancestors, on each branch representing the various fields of precision for that leaf (i.e., A, A*02, A*02:02, A*02:02:01). The distractome was created using transcripts from GENCODE version 24. This custom taxonomy was built into a Kraken database using the default HLAProfiler parameters for minimizer length (-mi 3) and k-mer length (-k 31) using 24 threads (-c 24). The taxonomy uses the NCBI database format with node names and node relationships found in names.dmp and nodes.dmp, respectively, located in the taxonomy subdirectory of the built database.
RC3: use Kraken to assign the simulated reads to HLA genes
Kraken is typically used to understand species diversity in sequencing data, assuming a given taxonomy. This methodology was applied to HLAProfiler using only the HLA taxonomy rather than the standard NCBI species taxonomy. Leveraging the k-mers unique to each taxonomic unit, Kraken outputs a taxonomic classification for each sequence read. For HLA typing, we have modified Kraken to split the reads into individual FASTQ files based on these classifications . For HLAProfiler, reads are placed into FASTQ files corresponding to the gene level, or higher, node in the tree.
RC4: aggregate the k-mer counts for each allele into gene-level profiles
All observed k-mers of length 50 are counted independently for each HLA gene-specific FASTQ file and collected across all alleles for a particular gene to create a list of k-mer hash table counts. This same procedure is used for calling HLA types, where the profile of the k-mers observed in the sample FASTQ is calculated and can be compared to the k-mer profile for each allele of the gene. The k-mer classification approach of Kraken, combined with the custom taxonomy helps to overcome homology between HLA genes. While k-mers unique to each gene enable Kraken to correctly classify the read, Kraken also considers the lowest common ancestors when making a classification. A read with many k-mers shared across genes and just a few unique k-mers can still be filtered to the proper gene. Kraken also processes paired-end reads together, further increasing the likelihood of gene-specific k-mers and successful classification.
Calling HLA alleles using HLAProfiler and an HLAProfiler-specific reference
Using the HLA k-mer profiles to type samples involves five steps: HC1) use Kraken to assign sample FASTQ reads to HLA genes; HC2) aggregate the k-mer counts for each allele into gene-level sample profiles; HC3) compare sample k-mer count profiles to the reference profiles to determine top allele pair candidates; HC4) competitively align each pair of candidate alleles and count the number of reads aligning uniquely to only one of the pairs; and HC5) refine the allele call to identify novel or updated alleles. The final score of the predicted HLA type is based on the strength of the sample k-mer profile match to the reference profile and the competitive pairwise alignment counts.
HC1: use Kraken to assign sample FASTQ reads to HLA genes
See “RC3: use Kraken to assign the simulated reads to HLA genes”.
HC2: aggregate the k-mer counts for each allele into gene-level sample profiles
See “RC4: Aggregate the k-mer counts for each allele into gene-level reference profiles”.
HC3: compare sample k-mer count profiles to reference profiles to determine top allele pair candidates
HC4: competitive alignment between the top 20 candidate allele pairs
HC5: allele refinement
This step is optional. Paired reads are mapped using a custom aligner a final time to the predicted allele pair, this time allowing for one mismatch. During this process, the read support for each possible nucleotide is recorded. Positions that have more support for a mismatching nucleotide than for the reference nucleotide (default > 75% of coverage mismatch) are substituted into a putative novel allele sequence. All reference alleles for the same gene from IMGT are then checked to determine whether they are an exact substring of the novel allele and are also listed as having only a “partial” sequence with one or more missing exons. If any allele satisfies these requirements, the predicted allele is changed to the partial reference allele, with a “U” (for updated) added to the accession ID. If no alleles satisfy these requirements, the accession ID of the top-scoring candidate is appended with an “N” (for novel). In either case, the modified sequence is output to file by the program, and a new simulation profile is created for the novel allele. If the allele refinement option is specified as “recalculate” or “all”, then all metrics are calculated for the novel allele to obtain an updated prediction score.
Considerations of k-mer size
HLAProfiler utilizes k-mers in two different ways: 1) sequence read filtering; and 2) profile creation. The longer the k-mer, the more likely it is to be unique to a specific HLA gene or allele. Using shorter k-mer sizes will decrease the number of unique k-mers across the HLA genes, which in turn will decrease the memory footprint of the Kraken database, but also reduce the sensitivity of filtering. We recommend using the maximum k-mer size allowed by Kraken of 31 bp.
The fewer the number of HLA alleles that contain a given k-mer, the greater the contribution of that k-mer for identifying the correct HLA alleles. As a profile captures the combination and quantity of k-mers of each allele, even k-mers shared across many alleles can contribute valuable information. The k-mer size is mainly limited by sequence read length, as HLAProfiler will not work properly if k is greater than the sequence length. While there is no lower bound on k, small values for k will negatively impact algorithm performance. Additionally, using a k that is much smaller than the sequence length can increase runtime of HLAProfiler. A final consideration is that each value for k requires the creation of a new set of reference k-mer profiles, a computationally intense process. Given these considerations, we have characterized HLAProfiler performance using a k-mer size of 50, which maximizes k-mer content of the profiles while accommodating many of the current and historic RNA-seq read lengths.
TruSight HLA confirmation to update gold standard truth
A total of 324 Geuvadis samples were concordant between Sanger sequencing, RNA-seq with HLAProfiler, and RNA-seq with OptiType (when available) for all five reported genes (HLA-A, HLA-B, HLA-C, DRB1, and DQB1). For these, we assumed that this reflected the true state of the sample. For the 34 samples that had at least one allele from any gene that was discordant between Sanger sequencing and either OptiType or HLAProfiler, discrepancies were resolved by further sequencing using Illumina’s TruSight HLA assay. Two samples with only one-field accuracy reported by Sanger sequencing were also tested, In all cases, the TruSight HLA result matched either one of the RNA-seq results or the Sanger result.
Whenever TruSight HLA represented a consensus (matching either Sanger or RNA-seq), we assumed that the consensus type represents the true state. This information was used in the establishment of an updated truth for the Geuvadis data, from which accuracy, not just concordance to another technology, can be estimated. The results of TruSight HLA are available at http://www.q2labsolutions.com/genomics-laboratories/bioinformatics (select “Request Example Data”).
HLAProfiler is written entirely in Perl, with the exception of Jellyfish , a Kraken dependency, and the modified version of Kraken. All analyses were performed on a Linux cluster with eight CPU cores and 8 GB of memory. Computational performance was assessed from a random sample of 30 Geuvadis FASTQ pairs using 8 GB of RAM and 12 cores of a Dell PowerEdge R720 server with 256 GB of total RAM and two 12-core CPUs. Total runtimes ranged from 7 to 17 min, with a mean of 12.5 min, making HLAProfiler’s speed comparable to the other available tools (Additional file 1: Figure S2). These analyses used HLAProfiler v1.0.0. The latest version of HLAProfiler can be downloaded from GitHub (see “Availability of data and materials”) or as part of the bioconda repository (https://bioconda.github.io/recipes/hlaprofiler/README.html).
Running other callers
All callers were run on a Linux cluster using eight CPU cores. OptiType  was installed as directed and run using razers3-3.5.3 and glpk and default parameters. We ran seq2HLA v2.2  against a reference using only the “common” and “well-documented” alleles and default parameters. HLAForest  was installed as directed, and a new reference was created using IMGT/HLA v3.24.0 sequences. CallHaplotypesPE.sh was run using default parameters to type each sample. HLAMiner  was installed as directed and run using default parameters and a reference database. Phlat-1.0  used IMGT/HLA version 3.8.0 and Bowtie2 version 2.3.0. The paired-end parameter “-pe” was set to 1, under which the data are treated as paired-end. Orientation parameter “-orientation” was set to “—fr”, under which the orientation of the mate pair reads are in forward and reverse. Where documentation existed, every attempt was made to update the reference for these tools to be comparable to IMGTv3.24.0. HLAForest and HLAMiner were successfully updated, but HLAMiner performance decreased with the updated reference, so the reference packaged with the software was used.
HLA types were predicted in five simulated RNA-seq datasets and from biological RNA-seq data from lymphoblastoid cells using HLAProfiler. For comparison, types were also predicted using five other tools capable of detecting HLA types in RNA-seq data: OptiType, PHLAT, seq2hla, HLAMiner, and HLAForest. The accuracy of calls from simulated data was calculated as the proportion of alleles that matched exactly to the truth at the specified precision. Likewise, accuracy for the biological data was calculated by comparing the allele calls to the Sanger sequencing/TruSight HLA consensus described in “Implementation”. One-field precision refers to the allele group (results not shown), two-field precision refers to the protein sequence, and exact precision refers to an exact match of the allele at the highest precision possible (some alleles have only two-field precision defined). Allele calls unable to be predicted by a tool in samples with truth information available for the gene were considered incorrect or discordant.
HLAProfiler perfectly called HLA types in RNA-seq data simulated from GeT-RM HLA types
HLAProfiler accurately predicted the HLA types of rare alleles
We observed that the only allele missed by OptiType in the GeT-RM simulation was a rare allele. This motivated an effort to quantify the ability of each caller to correctly identify rare alleles. We calculated the call rate and accuracy of HLA typing in 100 simulated RNA-seq datasets comprising only rare alleles. HLAProfiler had > 99.5% two-field precision accuracy and > 94.0% exact-allele accuracy for every gene, while other callers performed poorly (Fig. 2a; Additional file 2: Table S3). Indeed, with the exception of HLAForest, all callers had two-field accuracies less than 78%, and for HLAForest, accuracies ranged from 35–95%. Whether this drop in performance for rare alleles is caused by the algorithm or the use of outdated references distributed with the other tools remains unclear.
HLAProfiler demonstrated high concordance with orthogonal methods in lymphoblastoid cell lines
Having evaluated performance in two simulated datasets, we next predicted HLA types in RNA-seq data generated by the Geuvadis consortium. These RNA-seq data, generated from 358 different LCLs, have gold standard HLA types available from orthogonal methods for genes A, B, C, DRB1, and DQB1 . Once again, OptiType and HLAProfiler performed similarly on class I genes, with 99.0 and 98.9% average concordance, respectively, and > 98% concordance across all individual genes. This result is a marked improvement over other callers, which had < 96% average concordance for class I alleles (Fig. 2b; Additional file 2: Table S4; Additional file 2: Table S5). HLAProfiler also had an average concordance of 99.1% for class II genes, with 99.2% for DRB1 and 98.9% for DQB1. To evaluate HLA typing performance in samples with a low number of reads, we downsampled all 358 RNA-seq datasets to five million reads and predicted HLA types (Fig. 2b; Additional file 2: Table S4). Reduced read depths had little influence on HLAProfiler predictions for HLA-A, DRB1, and DQB1 (0.1–0.3% lower), while genes B and C were more affected, with 1.8 and 1% lower concordances, respectively. For comparison, read depth had less influence on OptiType performance, with a 0.1% increase for HLA-A and a 0.6% decrease for HLA-B and no change for HLA-C. The number of filtered, HLA gene-specific reads depends on HLA gene expression and can vary between samples and tissue types. To understand the influence of filtered read depth on accuracy, we selected at random 50 samples from the Geuvadis dataset for which all alleles were correctly predicted. After downsampling these filtered reads, we found that HLAProfiler can identify HLA type for all genes with greater than 90% accuracy with as few as 1000 reads (Additional file 2: Table S3). For comparison the number of filtered reads range from 3–100K for the full dataset from these Geuvadis samples (total read depths vary). Accuracy varies across genes for a specific read depth and factors such as the sample’s tissue type and the HLA genes of interest should be considered carefully when setting minimum read depth thresholds for HLA calling.
Multiple factors can influence HLA calling, which makes accurate HLA identification for every gene across every sample extremely difficult when using RNA-seq data. Because accurate HLA determination relies on sequencing of the RNA associated with each HLA gene, accuracy is heavily influenced by transcription. The transcript levels of each HLA gene can vary greatly between tissues—for example, class II expression is much higher in the blood than in the liver—making HLA calling more difficult to predict for certain gene and tissue combinations. Transcript levels can also vary for the same gene and tissue across individuals, influencing the success of HLA calling. The combination of alleles expressed in an individual can also influence HLA calling success. For example, in some samples there may be allelic expression imbalance, a situation in which one particular allele may be expressed at a significantly higher level than the partner allele. Additionally, some alleles may be rare and not included in the database or two combinations of different alleles have very similar nucleotide content. In these cases, the HLA call rates may suffer for a specific subset of HLA alleles. Finally, some algorithms are only limited to calling HLA types for a subset of genes and alleles, often through database design, and these algorithms will be limited in their ability to call HLA alleles. Of the tools evaluated, only HLAProfiler and HLAForest were able to provide > 99% call rates for all of the major HLA genes (HLA-A, HLA-B, HLA-C, HLA-DP, HLA-DM, HLA-DO, HLA-DQ, and HLA-DR) (Additional file 2: Table S5). Of the two tools, HLAProfiler had slightly higher call rates for some important genes, such as MICA, MICB, TAP1, and TAP2 (99–100 vs. 42–96%), while HLAForest had better rates for HLA-H (25 vs. 100%) and many pseudo genes (e.g., J, K, L).
TruSight HLA typing resolved discrepancies and increased HLAProfiler accuracy
Resolving HLA types using Sanger sequencing, while very accurate, is not perfect. For this reason, the true state of the sample is unclear whenever RNA sequencing results and Sanger results disagree. In this scenario, a third highly accurate technology can be used as the arbitrator. This approach will yield an improved understanding of the true state of the sample, provided that this technology has biases and errors that are reasonably independent of the first two. We attempted this arbitration using Illumina’s TruSight HLA panel. We believe that this method yields results that are sufficiently different from RNA sequencing because 1) it uses DNA as the genomic material; 2) it utilizes a targeted, rather than whole-transcriptome, approach; and 3) the software uses a different algorithm than any of the other tools. Including all samples discordant between any caller and the gold standard was cost prohibitive; therefore, we chose to investigate only discordances from the top two performing algorithms.
We collected all samples that had any discordance between either HLAProfiler or OptiType and Sanger sequencing, which amounted to 35 samples and 39 discordant calls. We sequenced 34 of these samples with Illumina’s TruSight HLA panel (Additional file 2: Table S6), in addition to two samples with only one-field HLA precision, previously available for one of the genes. After replacing the Sanger calls with the TruSight HLA calls for these 40 (38 discordant, two one-field precision only) allele calls in the truth data, performance metrics were recalculated (Fig. 2b; Additional file 2: Table S4). Using this updated truth, HLAProfiler had 99.9, 99.0, and 99.6% accuracy for HLA-A, HLA-B, and HLA-C, while OptiType had 99.6, 99.4, and 100% accuracy. In addition, HLAProfiler had 99.3 and 99.9% accuracy for DRB1 and DQB1. Finally, we used this updated truth to estimate the accuracy of the Sanger sequencing calls for these samples, which was found to be 99.7, 99.1, and 99.3% for HLA-A, HLA-B, and HLA-C and 99.0 and 99.4% for DQB1 and DRB1. For every gene except HLA-B, HLAProfiler produced a more accurate result than Sanger sequencing for this dataset.
New analytical methods find disease alleles missed by gold standard technologies
HLAProfiler successfully identifies novel and partial alleles
Two-field accuracy and exact sequence matching for novel and partial alleles
Sequence identified or two-field accuracy
Sequence identified or two-field accuracy
Similar to the partial alleles, we also leveraged the updated database to evaluate the ability of HLAProfiler to identify novel alleles using the sequence data. We simulated novel allele sequences by replacing one allele per sample in the rare allele simulation with an allele present in the updated reference but not in the database reference (Table 1). After allele refinement, HLAProfiler predicted a novel allele sequence for 75 (75%) of the samples. For 63 (63%) of these samples the predicted sequence matched exactly to the sequence in the updated database (edit distance = 0). Combined with an additional 5% of alleles for which HLAProfiler was unable to identify the novel sequence but identified the correct two-field precision this represents a total of 68% of novel alleles with enough information to accurately identify the protein sequence. For the remaining 12 samples with incorrectly predicted novel sequences and 25 samples without predicted novel sequences the median edit distance between the predicted allele and the truth allele was 2 (Additional file 2: Table S7).
For comparison with other callers, we also determined the number of novel alleles correctly identified at two-field precision without allele refinement. Of the other callers, PHLAT had the highest two-field accuracy in novel alleles (26%). Of the 100 samples with a novel allele simulated using the updated database, only 36 contained a novel allele that shared two-field precision with an allele in the original database, highlighting the challenge of identifying the correct two-field precision of novel alleles. It is important to note that because of the missing two-field precisions in the database, the two-field precision of the base allele (reported by HLAProfiler) used to predict the correct protein sequence was often incorrect. Overall, HLAProfiler’s allele refinement algorithm is an important step in overcoming the challenges of predicting the correct protein group of partial and novel alleles. TruSight HLA also predicted the presence of novel alleles. HLAProfiler correctly predicted the sequences of all three novel alleles identified by TruSight HLA (Additional file 1: Table S8).
HLAProfiler can be expanded to call KIR types
While the underlying database of k-mer profiles is specific to HLA, the algorithms inherent to HLAProfiler are more general and can be expanded to other genes. As a proof of concept, we applied HLAProfiler as-is to KIR genes, making no effort to optimize the code or tune parameters. For this, we created a KIR gene k-mer reference profile from IMGT and then randomly simulated types for 17 different KIR genes in ten samples (Additional file 1: Table S9). Using k-mer profiles for the KIR genes, HLAProfiler correctly identified > 90% of alleles for 12 of the 17 genes (70.6%). Two of the remaining four genes had greater than 65% accuracy. The final two genes, KIR2DL5A and KIR2DL5B, had 40 and 50% accuracy, respectively. Greatly reduced sequence depth after filtering, caused by high homology between these two genes, is the likely cause of the reduced accuracy for these two genes (Additional file 1: Figure S4). This proof of concept illustrates the flexibility of HLAProfiler in identifying alleles present from other immune-related genes, and optimizing HLAProfiler for KIR typing will undoubtedly yield much improved performance.
Accurate and precise HLA typing is critical in a variety of medical applications, such as organ transplantation, drug safety, disease susceptibility, and neoantigen prediction. RNA sequencing is an important source of transcriptome-wide gene detection and quantification and a promising source of HLA calls. Leveraging RNA-seq data for accurate HLA typing can not only reduce the cost, time, and input material requirements of HLA typing but also limit the need for specialized and separate laboratory protocols and reagents. Current algorithms have been limited in their ability to accurately predict HLA types from RNA-seq data in a wide range of HLA genes.
We have presented HLAProfiler, a k-mer profiling tool for accurately identifying the types of a variety of HLA genes in RNA-seq data. Our algorithm utilizes k-mer filtering, k-mer matching and competitive sequence matching and does not rely on traditional alignment, phasing or assembly tools. HLAProfiler was able to correctly identify HLA types with 100% accuracy at exact-allele precision in simulated RNA-seq data. These data were based on HLA types from 109 individuals, guaranteeing that the allele combinations exist in nature and are biologically relevant. This accuracy at such a high precision level represents a significant advancement over existing methods, most of which report only two-field accuracy.
HLAProfiler also performed well with RNA-seq data generated from lymphoblastoid cells. HLAProfiler called HLA alleles with > 98% concordance to gold standard typing at two-field precision in both class I and II alleles. The only tool with comparable performance, OptiType, also identified alleles with > 98% concordance but is limited to class I alleles. In the case of discordant calls, observed sequence reads often supported HLAProfiler calls rather than the gold standard. Orthogonal typing using Illumina’s TruSight HLA kit resolved these discrepancies and improved HLAProfiler concordance to greater than 99% for all genes.
HLAProfiler represents significant increases over existing methodologies when handling rare alleles, novel alleles, and alleles with partial reference sequences. When creating the reference k-mer profile database, HLAProfiler uses all available reference sequences and does not discriminate against rare alleles. This allows HLAProfiler to outperform all other callers, including OptiType, when calling rare alleles. HLAProfiler also includes partial reference allele sequences in the database. The percentage of alleles with a partial reference sequence in IMGT/HLA is staggering and, as demonstrated with NA11840, having partial allele calls can negatively impact allele calling. By identifying differences between the observed data and the initial prediction, HLAProfiler can correct these incorrect predictions due to partial alleles and identify the full allele sequence.
In addition, HLAProfiler can identify the presence of novel alleles. Identifying the presence of a novel HLA allele in an individual is important, especially in cases when the resulting protein sequence differs from the predicted allele. HLAProfiler can not only identify the presence of novel alleles but also predict the complete protein sequence over 60% of the time. Novel alleles are especially difficult to type correctly, resulting in multiple novel alleles being predicted by TruSight HLA in samples with discordant HLA types. In these three samples, HLAProfiler’s novel allele predictions matched exactly with the TruSight HLA prediction. In both the simulated and biological data, HLAProfiler successfully predicted novel HLA transcripts in RNA-seq data, which none of the other evaluated tools were able to do.
Despite the success of HLAProfiler in calling HLA alleles, HLA calling in RNA-seq data has limitations. Mainly, the technique is limited by the expression level of the HLA genes in the sample being evaluated. HLA calling, especially for class II genes, in tissues with decreased HLA expression will be more difficult than in whole blood or peripheral blood mononuclear cells, where the HLA genes are highly expressed. Additionally, HLAProfiler is sensitive to the coverage of the HLA genes, as demonstrated by the slight decrease in performance in downsampled reads. Samples with degraded RNA, such as formalin-fixed paraffin-embedded samples, where coverage across the genes is less uniform, might prove to be especially problematic. Improving both laboratory procedures and algorithms to increase the sequencing depth and coverage of HLA genes is an area of active research.
Our work offers a prime example of the utility of calling HLA alleles using RNA-seq data. One of the strongest putative associations between MHC allele and disease is the link between the B*27 antigen and AS, with approximately 90% of AS patients being B*27-positive . For three samples, Sanger sequencing provided an HLA type of B*27:03, while RNA-seq gave a B*27:05 result, which was confirmed with targeted DNA sequencing. Although B*27:05 shows a clear association with AS, the association with B*27:03 is unclear . This example illustrates how new analytical methods can provide accurate calls in lieu of traditional gold standard methods.
Overall, HLAProfiler offers a simple, flexible, and user-friendly approach to calling HLA alleles in RNA-seq data. We have also shown that HLAProfiler can be easily adapted to identify the alleles of other classes of genes, such as KIR genes, provided that the genes do not have extreme homology with other transcripts, the alleles for the gene are well curated, and only two allele states (e.g., germline diploid) are expressed.
Precise and accurate identification of HLA alleles in RNA-seq data is difficult. Using RNA-seq data, we have demonstrated that HLAProfiler can accurately identify common and well documented alleles as well as rare alleles for class I and II HLA genes. Additionally, HLAProfiler utilizes the observed reads to detect when novel alleles are present or when the reference allele is incomplete and output the correct coding sequence for the allele. Finally, HLAProfiler successfully identified alleles for KIR genes, which are also important to immune response. Using HLAProfiler to reliably type HLA, and other gene classes in RNA-seq data will further increase the utility of these data and open the door to important biological discoveries without an increase in finite resources such as time and cost.
Availability and requirements
Project name: HLAProfiler
Project home page: https://expressionanalysis.github.io/HLAProfiler/
Archived version: v1.0.5
Operating system: Linux, Unix
Programming language: Perl
Other requirements: EA-modified Kraken, Jellyfish
Any restrictions to use by non-academics: The software is limited to non-commercial use.
This work is supported by the NC University Cancer Research Fund.
Availability of data and materials
The simulated datasets, TruSight HLA types, reference database generated and analyzed for this manuscript are available at http://www.q2labsolutions.com/genomics-laboratories/bioinformatics (select “Request Example Data” at the bottom of page and choose HLAProfiler from the “Area of Interest” menu), or from the corresponding author on reasonable request. The Geuvadis data analyzed during the current study are available in ArrayExpress (https://www.ebi.ac.uk/arrayexpress/experiments/E-GEUV-1/).
MLB designed and developed the algorithm, primarily implemented and tested the code, ran competing algorithms, and prepared the manuscript. CCB designed and developed the algorithm, created the simulated dataset, and prepared the manuscript. KR assisted in preliminary algorithm design, ran competing algorithms, and contributed to the manuscript preparation. SC ran competing algorithms and contributed to the manuscript preparation. SW contributed to coding. BGV advised on scientific content and contributed to the manuscript. ETW advised on scientific content, performed orthogonal validations, and contributed to manuscript writing. JGP assisted in algorithm design and implementation and testing and contributed to manuscript preparation. All authors read and approved the final manuscript.
CCB is currently Senior Scientisit, OmicSoft Corporation, Cary, NC, 27513, USA.
KR is currently Senior Translational Scientist at Renaissance Computing Institute (RENCI), University of North Carolina, Chapel Hill, NC, USA.
Ethics approval and consent to participate
Consent for publication
MLB, CCB, KR, SW, and JGP contributed to this manuscript as employees of Q2 Solutions|EA Genomics, which offers genomic services to a variety of clients, including pharmaceutical companies. The submitted work was performed independently of these client relationships. CCB is currently an employee of OmicSoft Corporation, and this work was completed independently of that role. ETW reports personal fees and non-financial support from Illumina, outside the submitted work. BGV and SC declare no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Coico R, Sunshine G. Immunology: a short course. Canada: Wiley; 2015.Google Scholar
- Owen J, Punt J, Kuby J, Stranford S. Kuby Immunology. New York: Freeman; 2013.Google Scholar
- Robinson J, Halliwell JA, Hayhurst JD, Flicek P, Parham P, Marsh SGE. The IPD and IMGT/HLA database: allele variant databases. Nucleic Acids Res. 2015;43:D423–31.View ArticlePubMedGoogle Scholar
- Daly AK. Pharmacogenomics of adverse drug reactions. Genome Med. 2013;5:5.View ArticlePubMedPubMed CentralGoogle Scholar
- Gough SCL, Simmonds MJ. The HLA region and autoimmune disease: associations and mechanisms of action. Curr Genomics. 2007;8:453–65.View ArticlePubMedPubMed CentralGoogle Scholar
- Gottenberg J-E, Busson M, Loiseau P, Cohen-Solal J, Lepage V, Charron D, et al. In primary Sjögren’s syndrome, HLA class II is associated exclusively with autoantibody production and spreading of the autoimmune response. Arthritis Rheum. 2003;48:2240–5.View ArticlePubMedGoogle Scholar
- Hackl H, Charoentong P, Finotello F, Trajanoski Z. Computational genomics tools for dissecting tumour–immune cell interactions. Nat Rev Genet. 2016;17:441–58.View ArticlePubMedGoogle Scholar
- Charron D. Allogenicity & immunogenicity in regenerative stem cell therapy. Indian J Med Res. 2013;138:749–54.PubMedPubMed CentralGoogle Scholar
- Bartels MC, Otten HG, Elske Van Gelderen B, Van Der Lelij A. Influence of HLA-A, HLA-B, and HLA-DR matching on rejection of random corneal grafts using corneal tissue for retrospective DNA HLA typing. Br J Ophthalmol. 2001;85:1341–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Erlich H. HLA DNA, typing: past, present, and future. Tissue Antigens. 2012;80:1–11.View ArticlePubMedGoogle Scholar
- Dunn PPJ. Human leucocyte antigen typing: techniques and technology, a critical appraisal. Int J Immunogenet. 2011;38:463–73.View ArticlePubMedGoogle Scholar
- Weimer ET, Montgomery M, Petraroia R, Crawford J, Schmitz JL. Performance characteristics and validation of next-generation sequencing for human leucocyte antigen typing. J Mol Diagnostics. 2016;18:668–75.View ArticleGoogle Scholar
- Wittig M, Anmarkrud JA, Kässens JC, Koch S, Forster M, Ellinghaus E, et al. Development of a high-resolution NGS-based HLA-typing and analysis pipeline. Nucleic Acids Res. 2015;43:e70.View ArticlePubMedPubMed CentralGoogle Scholar
- Nariai N, Kojima K, Saito S, Mimori T, Sato Y, Kawai Y, et al. HLA-VBSeq: accurate HLA typing at full resolution from whole-genome sequencing data. BMC Genomics. 2015;16:S7.View ArticlePubMedPubMed CentralGoogle Scholar
- Major E, Rigo K, Hague T, Berces A, Juhos S. HLA typing from 1000 genomes whole genome and whole exome illumina data. PLoS One. 2013;8:e78410.Google Scholar
- Warren RL, Choe G, Freeman DJ, Castellarin M, Munro S, Moore R, et al. Derivation of HLA types from shotgun sequence datasets. Genome Med. 2012;4:95.Google Scholar
- Lappalainen T, Sammeth M, Friedländer MR, ‘t Hoen PAC, Monlong J, Rivas MA, et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501:506–11.View ArticlePubMedPubMed CentralGoogle Scholar
- Gourraud P-A, Khankhanian P, Cereb N, Yang SY, Feolo M, Maiers M, et al. HLA diversity in the 1000 Genomes dataset. PLoS One. 2014;9:e97282.View ArticlePubMedPubMed CentralGoogle Scholar
- Gourraud P-A, Khankhanian P, Cereb N, Yang SY, Feolo M, Maiers M, et al. 20140725_hla_genotypes. 2014. ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20140725_hla_genotypes/. Accessed 25 May 2016.
- Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 2012;22:1760–74.View ArticlePubMedPubMed CentralGoogle Scholar
- GENCODE. 2012. https://www.gencodegenes.org/. Accessed 10 Jun 2016.
- IMGT HLA Genes. 2010. ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/. Accessed 1 Nov 2016.
- Pratt VM, Zehnbauer B, Wilson JA, Baak R, Babic N, Bettinotti M, et al. Characterization of 107 genomic DNA reference materials for CYP2D6, CYP2C19, CYP2C9, VKORC1, and UGT1A1. J Mol Diagnostics. 2010;12:835–46.View ArticleGoogle Scholar
- RM Materials—material availability. 2010. https://wwwn.cdc.gov/clia/Resources/GETRM/MaterialsAvailability.aspx. Accessed 31 May 2016.
- Mack SJ, Cano P, Hollenbach JA, He J, Hurley CK, Middleton D, et al. Common and well-documented HLA alleles: 2012 update to the CWD catalogue. Tissue Antigens. 2013;81:194–203.View ArticlePubMedPubMed CentralGoogle Scholar
- Common and well-documented alleles. 2013. http://igdawg.org/cwd.html. Accessed 28 Nov 2016.
- IMGT KIR Genes. 2010. http://ftp.ebi.ac.uk/pub/databases/ipd/kir/. Accessed 21 Dec 2016.
- Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15:R46. BioMed Central.View ArticlePubMedPubMed CentralGoogle Scholar
- ExpressionAnalysis/kraken. 2016. https://github.com/ExpressionAnalysis/kraken/releases/tag/v0.10.5-beta-ea.2. Accessed 28 Jun 2017.
- Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27:764–70.View ArticlePubMedPubMed CentralGoogle Scholar
- Szolek A, Schubert B, Mohr C, Sturm M, Feldhahn M, Kohlbacher O. OptiType: precision HLA typing from next-generation sequencing data. Bioinformatics. 2014;30:3310–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Boegel S, Scholtalbers J, Löwer M, Sahin U, Castle JC. In Silico HLA Typing using standard RNA-Seq sequence reads. Methods Mol Biol. 2015;1310:247–58.View ArticlePubMedGoogle Scholar
- Kim HJ, Pourmand N. HLA typing from RNA-seq data using hierarchical read weighting [corrected]. PLoS One. 2013;8:e67885.View ArticlePubMedPubMed CentralGoogle Scholar
- Bai Y, Ni M, Cooper B, Wei Y, Fury W. Inference of high resolution HLA types using genome-wide RNA or DNA sequencing reads. BMC Genomics. 2014;15:325.View ArticlePubMedPubMed CentralGoogle Scholar
- López de Castro JA, Alvarez-Navarro C, Brito A, Guasp P, Martín-Esteban A, Sanz-Bravo A. Molecular and pathogenic effects of endoplasmic reticulum aminopeptidases ERAP1 and ERAP2 in MHC-I-associated inflammatory disorders: Towards a unifying view. Mol Immunol. 2016;77:193–204.View ArticlePubMedGoogle Scholar
- Díaz-Peña R, López-Vázquez A, López-Larrea C. Old and new HLA associations with ankylosing spondylitis. Tissue Antigens. 2012;80:205–13. A.View ArticlePubMedGoogle Scholar