- Open Access
Development and validation of a new high-throughput method to investigate the clonality of HTLV-1-infected cells based on provirus integration sites
© Firouzi et al.; licensee BioMed Central Ltd. 2014
Received: 11 March 2014
Accepted: 20 June 2014
Published: 27 June 2014
Transformation and clonal proliferation of T-cells infected with human T-cell leukemia virus type-I (HTLV-1) cause adult T-cell leukemia. We took advantage of next-generation sequencing technology to develop and internally validate a new methodology for isolating integration sites and estimating the number of cells in each HTLV-1-infected clone (clone size). Initial analysis was performed with DNA samples from infected individuals. We then used appropriate controls with known integration sites and clonality status to confirm the accuracy of our system, which indeed had the least errors among the currently available techniques. Results suggest potential clinical and biological applications of the new method.
It has been more than 30 years since human T-cell leukemia virus type-I (HTLV-1) was shown to be the causative agent of adult T-cell leukemia (ATL) [1, 2]. However, understanding the true nature of the multiple leukemogenic events  that are essential for this aggressive transformation remains elusive [4–9]. Although approximately 5% of HTLV-1-infected individuals develop ATL after a long latency period, the majority remain asymptomatic carriers (ACs) throughout their lifetimes. However, there are not enough clear determinants to distinguish between individuals who eventually develop ATL and those who remain as ACs [10, 11]. To discover the factors associated with disease development, long-term prospective studies have assessed the correlation between disease outcome and proviral load (PVL), that is, the percentage of infected cells among the total peripheral blood mononuclear cells (PBMCs) [10–12]. The ‘Joint Study on Predisposing Factors of ATL Development’ (JSPFAD)  showed that a PVL higher than 4% is one of the indications of risk for progression to ATL . Although an elevated PVL is currently the best characterized factor associated with a high risk of ATL development, a high PVL alone is not sufficient for disease prediction, suggesting the need to discover additional predictive factors [10, 11].
Because ATL is a malignancy caused by HTLV-1 infection, both the integration of provirus into the host genome and the clonal expansion of infected cells are highly critical leukemogenic events [6, 7, 14, 15]. Although many studies have addressed these aspects, the mechanism of HTLV-1 clonal expansion has not been elucidated [15–35]. Accurate monitoring for changes in clonality occurring before, during, and after ATL development is of great interest and of major clinical significance not only to clarify the underlying mechanisms but also to discover reliable predictive biomarkers for disease progression.
A broad range of evidence strongly supports that most neoplasms are composed of clonally expanded cell populations [36–38]. Owing to its biological significance, the concept of clonal expansion in cancer biology has been investigated using a variety of approaches in many tumor types [36–39], including ATL [6, 15, 16, 18–20, 22, 24, 29–32]. Clonal proliferation of HTLV-1-infected cells was first detected as monoclonal-derived bands by southern blotting . Early studies found that monoclonal integration of HTLV-1 is a hallmark of ATL cells . Furthermore, it was suggested that detecting a monoclonal band is useful for diagnosis and is associated with a high risk of ATL development [29, 30]. Subsequent PCR-based methods included inverse PCR, linker-mediated PCR, and inverse long PCR, which enabled analysis of samples with clonality below the detection threshold of southern blotting [17, 25, 31, 34]. Based on the observed banding patterns, the clonality of the samples was described as having undergone monoclonal, oligoclonal, or polyclonal expansion. Such PCR-based analyses revealed that, in addition to a monoclonal proliferation of infected cells, a monoclonal or polyclonal proliferation occurs even in non-malignant HTLV-1 carriers [31, 35]. Moreover, considering the stability of the HTLV-1 proviral sequence, it was hypothesized that maintaining a high PVL is achieved by persistent clonal proliferation of infected cells in vivo. This hypothesis was further supported by the detection of a particular HTLV-1 clone in the same carrier over the course of several years . Two Miyazaki cohort studies focused on the maintenance and establishment of clonal expansion: Okayama et al. analyzed the maintenance of a pre-leukemic clone in an AC state several years prior to ATL onset , and Tanaka et al. assessed the establishment of clonal expansion by comparing the clonality status of long-term carriers with that of seroconverters. They showed that some of the clones from long-term carriers were stable and large enough to be consistently detectable by inverse long PCR; however, those from seroconverters were unstable and rarely detectable over time .
Knowledge provided by conventional studies has shed light on the next challenges worthy of further investigation. Owing to technical hurdles, however, previous studies isolated small numbers of integration sites from highly abundant clones and detected low abundant clones in a non-reproducible manner [22, 34]. Furthermore, conventional techniques could not provide adequate information regarding the number of infected cells in each clone (clone size) . To effectively track and monitor HTLV-1 clonal composition and dynamics, we considered devising a new method that would not only enable the high-throughput isolation of integration sites but also provide an accurate measurement of clone size.
PCR is a necessary step for the integration site isolation and clonality analysis. However, bias in the amplification of DNA fragments (owing to issues such as extreme fragment length and high GC content) is intrinsic to any PCR-based method [40–45]. Different fragment amplification efficiencies make it difficult to calculate the amount of starting DNA (the original distribution of template DNA) from PCR products. Hence, estimating HTLV-1 clonal abundance, which requires calculating the number of starting DNA fragments, is only achievable by avoiding the PCR bias.
Recently, Bangham’s research group analyzed HTLV-1 clonality and integration site preference by a high-throughput method . In the method developed by Gillet et al., clone sizes were estimated using length of DNA fragments (shear sites generated by sonication) as a strategy for removing PCR bias . Owing to the limited variation in DNA fragment size observed with shearing, the probability of generating starting fragments of the same lengths is high, leading to a nonlinear relationship between fragment length and clone size [22, 46]. Therefore, Gillet et al. used a calibration curve to statistically correct the shear site data . Later, Berry et al. introduced a statistical approach, and further addressed the difficulties of estimating clone size from shear site data . Their approach estimates the size of small clones with little error, but estimates for larger clones have greater error . A parameter adopted from the Gini coefficient [47, 48] and termed the oligoclonality index was used to describe the size and distribution of HTLV-1 clones . It has been demonstrated that the oligoclonality index differs between malignant and non-malignant HTLV-1 infections, and also a high PVL of HTLV-1-associated myelopathy is due to cells harboring large numbers of unique integration sites . Furthermore, genome-wide integration site profiling of clinical samples revealed that the abundance of a given clone in vivo correlates with the features of the flanking host genome [22, 24]; although there was not a specific hotspot, HTLV-1 more frequently integrated in transcriptionally active regions of the host genome [22, 24]. These findings further clarified the characteristics of HTLV-1 integration sites, and strongly suggested the importance of HTLV-1 clonal expansion in vivo.
Here we introduce a method that overcomes many of the limitations of currently available methods. Taking advantage of next-generation sequencing (NGS) technology, nested-splinkerette PCR, and a tag system, we designed a new high-throughput method that enables specific isolation of HTLV-1 integration sites and, most importantly, allows for the quantification of clonality not only from the major clones and high-PVL samples but also from low-abundance clones (minor clones) and samples with low PVLs. Moreover, we conducted comprehensive internal validation experiments to assess the effectiveness and accuracy of our new methodology. A preliminary validation was conducted by analyzing DNA samples from HTLV-1-infected individuals with different PVLs and disease status. Subsequently, an internal validation was performed that included an appropriate control with known integration sites and clonality patterns. We present our methodology, which illustrates that employing the tag system is effective for improving quantification of clonal abundance.
Our clonality analysis method included two main aspects: (1) wet experiments, and (2) in silico analysis (Additional file 1: Figure S1). A general explanation of materials and methods is provided here, and detailed protocols of the wet experiments are included in Additional file 1: Notes. The in silico analysis is further described in Results and discussion.
NGS data have been deposited in the Sequence Read Archive of NCBI with access number of (SRP038906).
Biological samples: specimens and cell lines
Specimens: In total five clinical samples were provided by a biomaterial bank of HTLV-1 carriers, JSPFAD [13, 49]. The clinical samples were a part of those collected with an informed consent as a collaborative project of JSPFAD. The project was approved by the Institute of Medical Sciences, the University of Tokyo (IMSUT) Human Genome Research Ethics Committee. Information about the disease status of samples was obtained from JSPFAD database in which HTLV-1-infected individuals were diagnosed based on the Shimoyama criteria . In brief, genomic DNA from PBMCs was isolated using a QIAGEN Blood kit. PVLs were measured by real-time PCR using the ABI PRISM 7000 Sequence Detection System as described in .
Cell lines: An IL2-dependent TL-Om1 cell line  was maintained in RPMI 1640 medium supplemented with 10% heat-inactivated fetal calf serum (GIBCO), 1% penicillin-streptomycin (GIBCO), and 10 ng/mL IL2 (R&D systems). The same conditions as those of patient samples were used to extract DNA and measure PVL.
Illumina-specific library construction
We employed a library preparation protocol specifically designed to isolate HTLV-1 integration sites. The final products in the library that we generated contained all the specific sequences necessary for the Illumina HiSeq 2000 platform (Additional file 1: Figure S2). These products included a 5′-flow cell binding sequence, a region compatible with read-1 sequencing primer, 5-bp random nucleotides, 5-bp known barcodes for multiplexing samples, HTLV-1 long terminal repeat (LTR), human or HTLV-1 genomic DNA, a region compatible with read-2 and read-3 sequencing primers, 8-bp random tags, and a 3′-flow cell binding sequence from 5′ to 3′, respectively (Additional file 1: Figure S2B).
Incorporating the 5-bp random nucleotides downstream of the region compatible with the read-1 sequencing primer was critical and resulted in high-quality sequence data. We used a library designed without the first 5-bp of random nucleotides as input for the HiSeq 2000 sequencer in our first samples (S-1, S-2, S-3, and S-4). Because all fragments began with the same LTR sequence, clusters generated in the flow cells could not be differentiated appropriately. These samples resulted in low-quality sequence data (see Additional file 1: Notes). Designing the first 5-bp randomly resulted in high-quality sequence data for the remaining samples because clusters were differentiated with no problem during the first five cycles of sequencing (data not shown).
DNA isolation: DNA was extracted as described above, and the concentration of extracted DNA was measured with a NanoDrop 2000 spectrophotometer (Thermo Scientific). We recommend using 10 μg of DNA as the starting material. However, in practice there are some rare clinical samples with limited DNA available. In order to be able to handle those samples, the method was also optimized for 5 μg and 2 μg of starting DNA.(2) Fragmentation: According to the protocol provided in Supplementary Notes, the starting template DNA was sheared by sonication. The resulting fragments represented a size range of 300 to 700 bp as checked by an Agilent 2100 Bioanalyzer and DNA 7500 kit (Figure 1B).
Pre-PCR manipulations: Four steps of end repair, A-tailing, adaptor ligation, and size selection were performed as described in Additional file 1: Notes.
PCR: To amplify the junction between the genome and the viral insert, we used nested-splinkerette PCR (a variant of ligation-mediated PCR [52, 53]) (Additional file 1: Figure S2). We confirmed that the technique specifically amplifies HTLV-1 integration sites; since there was no non-specific amplification neither from human endogenous retroviruses nor from an exogenous retrovirus such as HIV (see Additional file 1: Table S1 and Additional file 2: Figure S1).
Information on oligonucleotides, including adaptors and primers, and the LTR and HTLV-1 reference sequences  are provided in Additional file 1: Table S1. The final PCR products were sequenced using the HiSeq 2000 platform.
In silico analysis
Raw sequencing data were processed according to the workflow described in the Results and discussion section. The initial forward read (100-bp) was termed Read-1 and the reverse read (100-bp) was termed Read-3 and an index read (8-bp) was termed Read-2. In brief, analysis programs were written in Perl language and run on a supercomputer system provided by The University of Tokyo’s Human Genome Center at The Institute of Medical Science . The sequencing output was check for quality using the FastQC tool . The regions corresponding to the LTR and HTLV-1 genome were subjected to a blast search against the reference sequences described in Additional file 1: Table S1. Following isolation of the integration sites, the flanking human sequences were mapped to the human genome (hg19) (the UCSC genome browser ) by Bowtie 1.0.0 . The final processed data included information about shear sites (R1R3), tags (R1R2), and a combination of tags and shear sites (R1R2R3). Fitting the data to the zero truncated Poisson distribution for retrieving correlation coefficients were done by the R-package ‘gamlss.tr’ . The Gini coefficient was calculated by StatsDirect medical statistics software .
Results and discussion
We originally designed our method to overcome the limitations of conventional techniques [31, 34] and to make improvements in the only existing high-throughput method . In general, our method includes two main sets of wet experiments and an in-silico analysis. We used genomic DNA as the starting material to prepare an appropriate library for Illumina sequencing. Subsequently, deep-sequencing data were analyzed by a supercomputer. The resulting information represents the clonality status of each sample (Additional file 1: Figure S1).
There are complex populations of infected clones and uninfected cells in a given HTLV-1 infected individual. High-throughput clonality analysis requires monitoring two main characteristics of clones: HTLV-1 integration sites and the number of infected cells in each clone (clone size). Each HTLV-1-infected cell naturally harbors only a single integration site . Therefore, the number of detected unique integration sites corresponds to the number of infected clones. Based on our analysis, which is consistent with the data of Gillet et al. , employing high-sensitivity deep sequencing allowed for the isolation of a large number of unique integration sites (UISs), including samples with low PVLs (Figure 1). We analyzed four samples from HTLV-1-infected individuals with different PVLs, disease status, and expected clonality patterns. The samples include S-1: AC (8% PVL); S-2: smoldering ATL (SM) (9% PVL); S-3: SM (31% PVL); and S-4: acute ATL (33% PVL). Based on the final optimized conditions, 1030, 39, 265, and 384 UISs were isolated from each sample, respectively (Figure 1).
The most challenging aspect of our clonality analysis was estimating the number of infected cells in each clone. Although a necessary step in the analysis, PCR introduces a bias in the frequency of starting DNA material [40–45]. Because amplification causes significant changes in the initial frequency of starting materials, PCR products cannot be used directly to estimate the amount of the starting DNA material. To overcome this problem, we needed to manipulate DNA fragments to make them unique prior to PCR amplification. Thus, if each DNA fragment could be marked with a unique feature, it would then be possible to calculate its frequency based on the frequency of that unique feature. When a single unique stretch of DNA is amplified by PCR, the resulting product is a cluster of identical fragments termed PCR duplicates. Therefore, to estimate the frequency of starting DNA fragments, one should count the number of clusters with unique features. The remaining technical question then becomes how to mark the starting DNA prior to PCR amplification. In the following section, we compare and discuss two main strategies, namely (1) shear sites and (2) a tag system, which enable DNA fragments to be uniquely marked.
Estimating the size of clones by shear sites
The first strategy, described by Gillet et al., relies on shearing DNA by sonication, resulting in fragments of random length . Sonication-derived shear sites were thus used as a distinguishing feature to make fragments unique prior to PCR. Clone sizes were then estimated by statistical approaches [22, 46].
To directly assess the effectiveness of the shear site strategy, we analyzed the clonality of the aforementioned clinical samples (S-1, S-2, S-3, and S-4). Genomic DNA was cleaved by sonication with fragments in the 300- to 700-bp range, theoretically providing approximately 400 possible variations in fragment size (Figure 1A and B). Following library construction, however, the final product represented smaller size ranges, implying a relatively limited number of variations (Figure 1C). Finally, the number of PCR amplicons with unique shear sites was retrieved from deep-sequencing data. See Additional file 2: Figure S2 for a simple image from an integration site and its shear sites. The data obtained from the shear site experiments were not fitted to calibration curves or statistical treatments, which were used by Gillet et al. and Berry et al., respectively (See Additional file 1: Notes) [22, 46]. For clarity, only the information relating to the major clone of each sample is provided in Figure 1D. The shear-site variations of the major clone were 209, 119, 242, and 222 for samples S-1 through S-4, respectively. Even in the case of control samples with 100% PVLs, the shear sites did not provide more than 225 variations (see Validation of the methodology). However, it was expected that samples with differing PVLs and disease status would harbor varying numbers of sister cells, at least in their major clones. Similar variations of shear sites were observed in major clones of AC, SM, and acute samples. These data suggest that, because the number of sister cells in each clone exceeded the shear site variations, the size of the clones was underestimated (Figure 1). This is most problematic in the case of large clones and leads to an underestimation of the clone size.
Measuring the size of clones by the tag system
We analyzed samples S-1, S-2, S-3, and S-4 to assess the effectiveness of our tag system for estimating clone size. The major clone of each sample showed tag variations of 393, 142, 1751, and 2675, respectively (Figure 2D). Similar variations of tags and shear sites were observed in the largest clones of S-1 and S-2 ((shear sites vs. tags): (209 vs. 393) and (119 vs. 142)) (Figure 1D and Figure 2D). In all four samples, those variations were also similar in the minor clones of which the clone sizes did not exceed shear sites variations (approximately <200 variations) (See Additional file 1: Table S3 and Additional file 2: Table S1 for information on the ten largest clones). However, the variations covered by tags were significantly greater than those of shear sites, especially for large clones like those observed in the major clones of S-3 and S-4 ((shear sites vs. tags): (242 vs. 1751) and (222 vs. 2675)). The variations covered by tags and combinations were almost the same for all four samples ((tags vs. combinations): (393 vs. 296), (142 vs. 119), (1751 vs. 1192), and (2675 vs. 2038)).
Upon comparison of the tag system data with the shear site data, it was clear that both strategies yield essentially the same results when the size of clones is small enough to be covered by the number of shear site variations generated. However, the tag system provides a much better estimation of clonality when the number of sister cells in each clone exceeds shear site variations. Therefore, clone size was underestimated when considering only shear sites in expanded clones like samples S-3 and S-4. Given this, our tag system should be used for samples with different clonality status to avoid underestimation of the size of clones. See Additional file 2: Figure S3 for a simple comparison of shear site and tag variations.
Validation of the methodology
Evaluating the accuracy of the clonality analyzed based on shear sites vs. tags system
The errors were reduced in the data using the tag system: 7.27, 5.23, 14.49, and 6.50 for the first trial, and 6.67, 7.07, 10.07, and 13.16 for the second trial. In the case of the combination of tags and shear sites, errors were: 6.98, 4.06, 0.21, and 1.31 for the first trial and 3.42, 10.51, 12.26, and 5.83 for the second trial. Interestingly, the samples ‘tags only’ and ‘combinations’ showed similar error levels. Based on these data, our system showed lower absolute errors than when considering only shear sites (Figure 5) (Additional file 1: Figure S4). Owing to differences in analyzed samples and system setups, we could not directly compare our data with published data [22, 46]. Indirect evidence, however, provided by shear site analysis of our own data illustrated that our system has lower absolute errors than using the shear site-based methodology.
Removing background noise
Data obtained from next-generation sequencers are not error free [40, 62–65]. There are many reports on the error rate of Illumina sequencers [66, 67]. Teemu Kivioja et al. recently developed a system named unique molecular identifiers (UMIs) for quantifying mRNAs and employed filtering criteria to remove false UMIs generated by sequencing errors . In our study, consistent with the data of Kivioja et al. , the sequencing errors produced false tags with low frequencies. A filtering system was required to remove those tags, which could affect interpretation of our clonality data and reduce the accuracy of the clone size measurement. To minimize the effect of sequencing errors on data interpretation, we tested different filtering conditions to remove background noise. Here, we report our proven filtering approach (Additional file 1: Figure S4).
Considering that tags are designed randomly, each tag has an equal probability of being observed. Hence, the distribution of tags should be fitted to the zero truncated Poisson distribution [59, 68]. Therefore, we test data fit to the Poisson distribution to determine the efficacy of each filtering condition. The distribution of tags for each sample was measured by the R-package ‘gamlss.tr’ , and the correlation coefficient was compared before and after filtering (Additional file 1: Figure S6).
We used a filtering system, which we named the merging approach. The merging approach was conducted by clustering the tags and allowing only one mismatch so that unique tags, differing only in one nucleotide (one-mismatch permission), were merged. Subsequently, if the frequency of observed tag reads (PCR duplicates) was greater than 10, those unique tags were employed in further analysis. Otherwise, they were considered as artifacts. We referred to this filtering approach as ‘Join Tag- remove10’ (JT-10) in the Figure legends. To facilitate understanding, these filtering conditions are illustrated in Additional file 1: Figure S4.
The advent of NGS technologies holds promise to reveal the complex nature of neoplasms and to move past the limitations of previous methods. Using different approaches starting from early cytogenetic analysis to later, more elaborate studies with NGS technologies, the clonal composition of different tumors has been analyzed [36–39]. Robust monitoring and tracking of clonal dynamics using provirus integration sites allow for the assessment of clonal composition of HTLV-1-infected individuals from early infection to the final stage of ATL development. To meet the technical requirements for such type of analysis, we combined our expertise in the field of HTLV-1 research and NGS analysis and developed the high-throughput methodology described herein.
Gillet et al. also recently introduced a high-throughput method to extensively characterize HTLV-1 integration site preferences and quantify clonality (further discussed in Additional file 1: Notes) . They statistically analyzed shear site data to estimate clone size. According to their published data [22, 46] and as well as our current data, the limited variation in shear sites leads to an underestimation of the size of large clones. Considering that the incidence of large clones increases with disease progression from the healthy AC state to the malignant states of smoldering, chronic, or acute [22, 46], an accurate measurement of clone size - particularly large clones - is of great clinical significance.
Our study is the first in which the size of large clones was experimentally measured without using statistical estimation. We have provided details of the method design, optimized experiment protocols, and in-silico data processing workflow. To validate our methodology and assess its accuracy, we analyzed eight control samples with known integration sites and clone sizes, and four clinical samples. We subjected the samples to deep sequencing so that they had enough read coverage for each integration site and to ensure accurate measurement of clone size (See Additional file 1: Notes). We proved our methodology to be reliable for isolating large numbers of integration sites and to be accurate for quantifying clone size. Because the tag system could provide a sufficient number of variations regardless of clone size, we were able to demonstrate that the measurements are accurate.
Preliminary experiments on the clinical samples with differing PVLs and disease status showed different clonality patterns specific to AC and different ATL disease subtypes. S-1 was selected to represent still-healthy but infected individuals (ACs), S-2 and S-3 to represent indolent types of ATL, and S-4 to represent a typical aggressive type of ATL. Despite similar PVLs, S-1 and S-2 could be distinguished based on clonality patterns (polyclonal vs. a shift towards oligoclonal): S-1: AC, 8% PVL, and S-2: SM, 9% PVL. The clones of AC showed a uniform distribution pattern with no large difference in clone size; clones of S-2, however, had non-uniform size. S-2 and S-3 (S-3: SM, 31% PVL) are both smoldering subtypes of ATL progression with differing PVLs (9% vs. 31%) and showed similar clonality patterns but a different number of infected cells in each clone. S-3 and S-4 had similar PVL (S-4: acute, 33% PVL) but exhibited different clonality patterns: oligoclonal for S-3 (three or four relatively large clones at the top surrounded with other clones) vs. monoclonal for S-4 (a large major clone surrounded with some small clones in the background). After ranking the clones in order of descending size, we noted that the size of the largest clone in the acute sample was 10 times that of the next clone (tags: (chr X: 83705328 (-)) = 2675 vs. (chr 14: 30655896 (+)) = 209). Relative size of the major clone (chr X: 83705328 (-)) was also estimated by another method (PCR-southern) (detailed information is provided in Additional file 2: Figure S3 and Additional file 2: Supporting experiments). Samples with distinct disease status (AC, SM, and acute) manifested different clone sizes (Additional file 1: Table S3 and Additional file 2: Table S1 include the number of infected cells in the top 10 clones), but S-1 vs. S-2 (0.60 vs. 0.67) and S-3 vs. S-4 (0.84 vs. 0.80) could not be discriminated based on their oligoclonality index (Additional file 1: Figure S7) (See Additional file 1: Notes for further discussion). Therefore, it can be inferred that, with an accurate measurement of clone size, the application of this method will aid in the discrimination of ATL subtypes. These results suggest a possible association between disease status, PVLs, and clonality patterns. Hence, HTLV-1-infected individuals could be classified in different groups based on their clonality patterns, which could ultimately affect their choice of therapy and estimation of prognosis.
Moreover, by interpreting information from previous studies on HTLV-1 clonality [15, 18–20, 22, 27, 31, 32, 35] and considering the data provided in our present paper, it appears that ACs harbor a polyclonal population of HTLV-1-infected cells, whereas ATL patients show monoclonal patterns. Thus, changes in the clonality pattern and onset of a clonal expansion of HTLV-1-infected cells seem to be potentially applicable as a prognostic indicator of ATL onset. For these purposes, it is necessary to analyze appropriate pools of samples from ACs and different subtypes of ATL and to conduct a cohort study on the clonality patterns of the sequential samples available over time.
We took advantage of next-generation sequencing technology, a tag system, and an in-silico analysis pipeline to develop and internally validate a new high-throughput methodology. The method was proved to accurately measure the size of clones by analyzing control samples with already known clone sizes and clinical samples. We also discussed the novelty, significance, and applications of our method, and compared it with the only existing high-throughput method devised by Gillet et al.. Employing our new methodology and the analysis of an appropriate pool of samples provided by JSPFAD  will be helpful not only for diagnosis and prediction but also for elaborated understanding of the underlying mechanism of ATL development. The methodology described here could be adapted to investigate and quantify other genome-integrating elements (such as proviruses, transposons, and vectors in gene therapy). In addition, the tag system can be used for quantifying DNA/RNA fragments in RNA expression  or in metagenomics for determining the size of bacterial populations.
We gratefully appreciate: JSPFAD for providing clinical samples; M. Nakashima and T. Akashi for maintenance of JSPFAD; Sung-Joon Park, Riu Yamashita, and Kuo-ching Liang for their invaluable advice on in-silico analysis; K. Abe, K. Imamura, T. Horiuchi, and M. Tosaka for sequencing technical support; Sara Firouzi and Unes Firouzi for comments on the design of figures.
SF expresses deep respect and gratitude to the NITORI scholarship foundation for supporting her during undergraduate studies.
Computational analyses were provided by the Super Computer System, Human Genome Center, Institute of Medical Science, at The University of Tokyo.
This work was supported by the Japanese Society for the Promotion of Science (JSPS) - DC1 (24.6916 to SF); Third Term Comprehensive Control Research for Cancer, Ministry of Health, Labour and Welfare (H24-G-004 to TW); JSPS KAKENHI (23390250 to TW, 24591383 to TY); and MEXT KAKENHI (221S0001to TW, 221S0002 to YS).
- Takatsuki K: Discovery of adult T-cell leukemia. Retrovirology. 2005, 2: 16-PubMed CentralView ArticlePubMedGoogle Scholar
- Gallo RC: The discovery of the first human retrovirus: HTLV-1 and HTLV-2. Retrovirology. 2005, 2: 17-PubMed CentralView ArticlePubMedGoogle Scholar
- Okamoto T, Ohno Y, Tsugane S, Watanabe S, Shimoyama M, Tajima K, Miwa M, Shimotohno K: Multi-step carcinogenesis model for adult T-cell leukemia. Jpn J Cancer Res. 1989, 80: 191-195.View ArticlePubMedGoogle Scholar
- Yamagishi M, Watanabe T: Molecular hallmarks of adult T cell leukemia. Front Microbiol. 2012, 3: 334-PubMed CentralView ArticlePubMedGoogle Scholar
- Matsuoka M, Jeang K-T: Human T-cell leukaemia virus type 1 (HTLV-1) infectivity and cellular transformation. Nat Rev Cancer. 2007, 7: 270-280.View ArticlePubMedGoogle Scholar
- Yamaguchi K, Watanabe T: Human T lymphotropic virus type-I and adult T-cell leukemia in Japan. Int J Hematol. 2002, 76: 240-245.View ArticlePubMedGoogle Scholar
- Bangham CRM: HTLV-1. eLS. Edited by: John W, Sons L. 2011, Chichester, UK: John Wiley & Sons, Ltd,Google Scholar
- Tsukasaki K, Tobinai K: Biology and treatment of HTLV-1 associated T-cell lymphomas. Best Pract Res Clin Haematol. 2013, 26: 3-14.View ArticlePubMedGoogle Scholar
- Nakano K, Watanabe T: HTLV-1 Rex: the courier of viral messages making use of the host vehicle. Front Microbiol. 2012, 3: 330-PubMed CentralView ArticlePubMedGoogle Scholar
- Iwanaga M, Watanabe T, Utsunomiya A, Okayama A, Uchimaru K, Koh KR, Ogata M, Kikuchi H, Sagara Y, Uozumi K, Mochizuki M, Tsukasaki K, Saburi Y, Yamamura M, Tanaka J, Moriuchi Y, Hino S, Kamihira S, Yamaguchi K: Human T-cell leukemia virus type I (HTLV-1) proviral load and disease progression in asymptomatic HTLV-1 carriers: a nationwide prospective study in Japan. Blood. 2010, 116: 1211-1219.View ArticlePubMedGoogle Scholar
- Iwanaga M, Watanabe T, Yamaguchi K: Adult T-Cell Leukemia: A Review of Epidemiological Evidence. Front Microbiol. 2012, 3: 322-PubMed CentralView ArticlePubMedGoogle Scholar
- Taylor GP, Tosswill JH, Matutes E, Daenke S, Hall S, Bain BJ, Davis R, Thomas D, Rossor M, Bangham CR, Weber JN: Prospective study of HTLV-I infection in an initially asymptomatic cohort. J Acquir Immune Defic Syndr. 1999, 22: 92-100.View ArticlePubMedGoogle Scholar
- Yamaguchi K, Uozumi K, Taguchi H, Kikuch H, Okayama A, Kamihira S, Hino S, Nosaka K, Watanabe T: Nationwide Cohort Study of HTLV-1 Carriers in Japan: Joint Study on Predisposing Factors of ATL Development (JSPFAD). AIDS Res Hum Retroviruses. 2007, 23: 581-600.View ArticleGoogle Scholar
- Taylor GP, Matsuoka M: Natural history of adult T-cell leukemia/lymphoma and approaches to therapy. Oncogene. 2005, 24: 6047-6057.View ArticlePubMedGoogle Scholar
- Tsukasaki K, Tsushima H, Yamamura M, Hata T, Murata K, Maeda T, Atogami S, Sohda H, Momita S, Ideda S, Katamine S, Yamada Y, Kamihira S, Tomonaga M: Integration Patterns of HTLV-I Provirus in Relation to the Clinical Course of ATL: Frequent Clonal Change at Crisis From Indolent Disease. Blood. 1997, 89: 948-956.PubMedGoogle Scholar
- Yoshida M, Seiki M, Yamaguchi K, Takatsuki K: Monoclonal integration of human T-cell leukemia provirus in all primary tumors of adult T-cell leukemia suggests causative role of human T-cell leukemia virus in the disease. Proc Natl Acad Sci U S A. 1984, 81: 2534-2537.PubMed CentralView ArticlePubMedGoogle Scholar
- Cavrois M, Gessain A, Wain-Hobson S, Wattel E: Proliferation of HTLV-1 infected circulating cells in vivo in all asymptomatic carriers and patients with TSP/HAM. Oncogene. 1996, 12: 2419-2423.PubMedGoogle Scholar
- Etoh K, Tamiya S, Yamaguchi K, Okayama A, Tsubouchi H, Ideta T, Mueller N, Takatsuki K, Matsuoka M: Persistent clonal proliferation of human T-lymphotropic virus type I-infected cells in vivo. Cancer Res. 1997, 57: 4862-4867.PubMedGoogle Scholar
- Okayama A, Stuver S, Matsuoka M, Ishizaki J, Tanaka G-i, Kubuki Y, Mueller N, Hsieh C-c, Tachibana N, Tsubouchi H: Role of HTLV-1 proviral DNA load and clonality in the development of adult T-cell leukemia/lymphoma in asymptomatic carriers. Int J Cancer. 2004, 110: 621-625.View ArticlePubMedGoogle Scholar
- G-i T, Okayama A, Watanabe T, Aizawa S, Stuver S, Mueller N, Hsieh C-C, Tsubouchi H: The clonal expansion of human T lymphotropic virus type 1-infected T cells: a comparison between seroconverters and long-term carriers. J Infect Dis. 2005, 191: 1140-1147.View ArticleGoogle Scholar
- Derse D, Crise B, Li Y, Princler G, Lum N, Stewart C, McGrath CF, Hughes SH, Munroe DJ, Wu X: Human T-cell leukemia virus type 1 integration target sites in the human genome: comparison with those of other retroviruses. J Virol. 2007, 81: 6731-6741.PubMed CentralView ArticlePubMedGoogle Scholar
- Gillet NA, Malani N, Melamed A, Gormley N, Carter R, Bentley D, Berry C, Bushman FD, Taylor GP, Bangham CRM: The host genomic environment of the provirus determines the abundance of HTLV-1–infected T-cell clones. Blood. 2011, 117: 3113-3122.PubMed CentralView ArticlePubMedGoogle Scholar
- Cook LB, Rowan AG, Melamed A, Taylor GP, Bangham CRM: HTLV-1-infected T cells contain a single integrated provirus in natural infection. Blood. 2012, 120: 3488-3490.PubMed CentralView ArticlePubMedGoogle Scholar
- Melamed A, Laydon DJ, Gillet NA, Tanaka Y, Taylor GP, Bangham CRM: Genome-wide determinants of proviral targeting, clonal abundance and expression in natural HTLV-1 infection. PLoS Pathog. 2013, 9: e1003271-PubMed CentralView ArticlePubMedGoogle Scholar
- Wattel E, Vartanian JP, Pannetier C, Wain-Hobson S: Clonal expansion of human T-cell leukemia virus type I-infected cells in asymptomatic and symptomatic carriers without malignancy. J Virol. 1995, 69: 2863-2868.PubMed CentralPubMedGoogle Scholar
- Gabet AS, Mortreux F, Talarmin A, Plumelle Y, Leclercq I, Leroy A, Gessain A, Clity E, Joubert M, Wattel E: High circulating proviral load with oligoclonal expansion of HTLV-1 bearing T cells in HTLV-1 carriers with strongyloidiasis. Oncogene. 2000, 19: 4954-4960.View ArticlePubMedGoogle Scholar
- Kamihira S, Iwanaga M, Doi Y, Sasaki D, Mori S, Tsurda K, Nagai K, Uno N, Hasegawa H, Yanagihara K, Morinaga Y, Tsukasaki K, Taniguchi H: Heterogeneity in clonal nature in the smoldering subtype of adult T-cell leukemia: continuity from carrier status to smoldering ATL. Int J Hematol. 2012, 95: 399-408.View ArticlePubMedGoogle Scholar
- Satoh M, Toma H, Sugahara K, Etoh K, Shiroma Y, Kiyuna S, Takara M, Matsuoka M, Yamaguchi K, Nakada K, Fujita K, Kojima S, Hori E, Tanaka Y, Kamihira S, Sato Y, Watanabe T: Involvement of IL-2/IL-2R system activation by parasite antigen in polyclonal expansion of CD4(+)25(+) HTLV-1-infected T-cells in human carriers of both HTLV-1 and S-stercoralis. Oncogene. 2002, 21: 2466-2475.View ArticlePubMedGoogle Scholar
- Ikeda S, Momita S, Kinoshita K-i, Kamihira S, Moriuchi Y, Tsukasaki K, Ito M, Kanda T, Moriuchi R, Nakamura T: Clinical course of human T-lymphotropic virus type I carriers with molecularly detectable monoclonal proliferation of T lymphocytes: defining a low-and high-risk population. Blood. 1993, 82: 2017-2024.PubMedGoogle Scholar
- Yamaguchi K, Seiki M, Yoshida M, Nishimura H, Kawano F, Takatsuki K: The detection of human T cell leukemia virus proviral DNA and its application for classification and diagnosis of T cell malignancy. Blood. 1984, 63: 1235-1240.PubMedGoogle Scholar
- Takemoto S, Matsuoka M, Yamaguchi K, Takatsuki K: A novel diagnostic method of adult T-cell leukemia: monoclonal integration of human T-cell lymphotropic virus type I provirus DNA detected by inverse polymerase chain reaction. Blood. 1994, 84: 3080-3085.PubMedGoogle Scholar
- Imaizumi Y, Iwanaga M, Tsukasaki K, Hata T, Tomonaga M, Ikeda S: Natural course of HTLV-1 carriers with monoclonal proliferation of T lymphocytes (“pre-ATL”) in a 20-year follow-up study. Blood. 2005, 105: 903-904.View ArticlePubMedGoogle Scholar
- Wongstaal F, Hahn B, Manzari V, Colombini S, Franchini G, Gelmann EP, Gallo RC: A survey of human leukemias for sequences of a human retrovirus. Nature. 1983, 302: 626-628.View ArticleGoogle Scholar
- Cavrois M, Wain-Hobson S, Wattel E: Stochastic events in the amplification of HTLV-I integration sites by linker-mediated PCR. Res Virol. 1995, 146: 179-184.View ArticlePubMedGoogle Scholar
- Ohshima K, Mukai Y, Shiraki H, Suzumiya J, Tashiro K, Kikuchi M: Clonal integration and expression of human T-cell lymphotropic virus type I in carriers detected by polymerase chain reaction and inverse PCR. Am J Hematol. 1997, 54: 306-312.View ArticlePubMedGoogle Scholar
- Nowell PC: The clonal evolution of tumor cell populations. Science. 1976, 194: 23-28.View ArticlePubMedGoogle Scholar
- Greaves M, Maley CC: Clonal evolution in cancer. Nature. 2012, 481: 306-313.PubMed CentralView ArticlePubMedGoogle Scholar
- Sprouffske K, Merlo Lauren MF, Gerrish Philip J, Maley Carlo C, Sniegowski Paul D: Cancer in Light of Experimental Evolution. Curr Biol. 2012, 22: R762-R771.PubMed CentralView ArticlePubMedGoogle Scholar
- Melo FDSE, Vermeulen L, Fessler E, Medema JP: Cancer heterogeneity–a multifaceted view. EMBO Rep. 2013, 14: 686-695.PubMed CentralView ArticleGoogle Scholar
- Aird D, Ross MG, Chen W-S, Danielsson M, Fennell T, Russ C, Jaffe DB, Nusbaum C, Gnirke A: Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 2011, 12: R18-PubMed CentralView ArticlePubMedGoogle Scholar
- Acinas SG, Sarma-Rupavtarm R, Klepac-Ceraj V, Polz MF: PCR-induced sequence artifacts and bias: insights from comparison of two 16S rRNA clone libraries constructed from the same sample. Appl Environ Microbiol. 2005, 71: 8966-8969.PubMed CentralView ArticlePubMedGoogle Scholar
- Polz MF, Cavanaugh CM: Bias in template-to-product ratios in multitemplate PCR. Appl Environ Microbiol. 1998, 64: 3724-3730.PubMed CentralPubMedGoogle Scholar
- Suzuki MT, Giovannoni SJ: Bias caused by template annealing in the amplification of mixtures of 16S rRNA genes by PCR. Appl Environ Microbiol. 1996, 62: 625-630.PubMed CentralPubMedGoogle Scholar
- Dabney J, Meyer M: Length and GC-biases during sequencing library amplification: A comparison of various polymerase-buffer systems with ancient and modern DNA sequencing libraries. Biotechniques. 2012, 52: 87-View ArticlePubMedGoogle Scholar
- Raabe CA, Tang TH, Brosius J, Rozhdestvensky TS: Biases in small RNA deep sequencing data. Nucleic Acids Res. 2014, 42: 1414-1426.PubMed CentralView ArticlePubMedGoogle Scholar
- Berry CC, Gillet NA, Melamed A, Gormley N, Bangham CRM, Bushman FD: Estimating abundances of retroviral insertion sites from DNA fragment length data. Bioinformatics. 2012, 28: 755-762.PubMed CentralView ArticlePubMedGoogle Scholar
- Gini C: Sulla misura della concentrazione e della variabilita dei caratteri. Transactions of the Real Istituto Veneto di Scienze. 1914, LIII: 1203-Google Scholar
- Gini C: Concentration and dependency ratios (in Italian). English Translation in Rivista di Politica Economica. 1997, 87: 769-789.Google Scholar
- Biomaterial resource bank of HTLV-1 carriers, Joint Study on Predisposing Factors of ATL Development (JSPFAD). [http://htlv1.org/old/bank-en.html], 
- Shimoyama M: Diagnostic criteria and classification of clinical subtypes of adult T-cell leukaemia-lymphoma. A report from the Lymphoma Study Group (1984–87). Br J Haematol. 1991, 79: 428-437.View ArticlePubMedGoogle Scholar
- Sugamura K, Fujii M, Kannagi M, Sakitani M, Takeuchi M, Hinuma Y: Cell surface phenotypes and expression of viral antigens of various human cell lines carrying human T-cell leukemia virus. Int J Cancer. 1984, 34: 221-228.View ArticlePubMedGoogle Scholar
- Devon RS, Porteous DJ, Brookes AJ: Splinkerettes–improved vectorettes for greater efficiency in PCR walking. Nucleic Acids Res. 1995, 23: 1644-1645.PubMed CentralView ArticlePubMedGoogle Scholar
- Uren AG, Mikkers H, Kool J, van der Weyden L, Lund AH, Wilson CH, Rance R, Jonkers J, van Lohuizen M, Berns A, Adams DJ: A high-throughput splinkerette-PCR method for the isolation and sequencing of retroviral insertion sites. Nat Protoc. 2009, 4: 789-798.PubMed CentralView ArticlePubMedGoogle Scholar
- Seiki M, Hattori S, Hirayama Y, Yoshida M: Human adult T-cell leukemia virus: complete nucleotide sequence of the provirus genome integrated in leukemia cell DNA. Proc Natl Acad Sci. 1983, 80: 3618-3622.PubMed CentralView ArticlePubMedGoogle Scholar
- Human Genome Center (HGC), the institute of medical Science, the University of Tokyo. [http://www.hgc.jp/], 
- FastQC: A quality control tool for high throughput sequence data. [http://www.bioinformatics.babraham.ac.uk/projects/fastqc/], 
- UCSC Genome Browser. [http://genome.ucsc.edu], 
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10: R25-PubMed CentralView ArticlePubMedGoogle Scholar
- Stasinopoulos DM, Rigby RA: Generalized additive models for location scale and shape (GAMLSS) in R. J Stat Softw. 2007, 23: 1-46.View ArticleGoogle Scholar
- StatsDirect Medical Statistics Software. [http://www.statsdirect.com/],
- Chatburn RL: Evaluation of instrument error and method agreement. AANA J. 1996, 64: 261-268.PubMedGoogle Scholar
- Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008, 36: e105-PubMed CentralView ArticlePubMedGoogle Scholar
- Hillier LW, Marth GT, Quinlan AR, Dooling D, Fewell G, Barnett D, Fox P, Glasscock JI, Hickenbotham M, Huang W, Magrini VJ, Richt RJ, Sander SN, Stewart DA, Stromberg M, Tsung EF, Wylie T, Schedl T, Wilson RK, Mardis ER: Whole-genome sequencing and variant discovery in C. elegans. Nat Methods. 2008, 5: 183-188.View ArticlePubMedGoogle Scholar
- Quail MA, Kozarewa I, Smith F, Scally A, Stephens PJ, Durbin R, Swerdlow H, Turner DJ: A large genome centre’s improvements to the Illumina sequencing system. Nat Methods. 2008, 5: 1005-1010.PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008, 18: 1851-1858.PubMed CentralView ArticlePubMedGoogle Scholar
- Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y, Ishikawa S, Linak MC, Hirai A, Takahashi H, Altaf-Ul-Amin M, Ogasawara N, Kanaya S: Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res. 2011, 39: e90-e90.PubMed CentralView ArticlePubMedGoogle Scholar
- Minoche AE, Dohm JC, Himmelbauer H: Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems. Genome Biol. 2011, 12: R112-PubMed CentralView ArticlePubMedGoogle Scholar
- Kivioja T, Vähärautio A, Karlsson K, Bonke M, Enge M, Linnarsson S, Taipale J: Counting absolute numbers of molecules using unique molecular identifiers. Nat Methods. 2012, 9: 72-74.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.