PennCNV

Home

Download

Installation

Tutorial

·         Quick Examples

·         Input Files

·         CNV Calling

·         Trio Calling

·         Joint Calling

·         Validation Calling

·         QC & Annotation

·         Visualization

·         PennCNV Plug-in

PennCNV-Affy

Reference

FAQ

Mail List

A note on signal quality control for Illumina SNP genotyping platforms

 

This note describes several different scenarios of “low-quality” signal patterns in high-density SNP genotyping platforms, using our own genotyped samples by the Illumina HumanHap550 array as examples. Typically, samples with low-quality signals complicates copy number variation (CNV) detection and analysis, so users of CNV algorithms should pay special attention to these samples: one can exclude them from CNV analysis, or use them in the analysis but only focus on large-sized, more confident calls.

 

A note on signal quality control for Illumina SNP genotyping platforms

1.1       Large variance of LRR values

1.2       Large variance of BAF values

1.3       Large variance of both LRR and BAF values

1.4       Failure of one allele

1.5       Upshift and downshift of BAF values

1.6       Random failures of BAF and LRR values

1.7       Complete genotyping failure

1.8       Heterosomic deletion or duplications

1.9       Genomic wave

1.10          Final notes

 

1.1      Large variance of LRR values

 

Sometimes we observe large variations of Log R Ratio (LRR) values around zero, which indicates low-quality sample. In the PennCNV software, we use the LRR_SD measure, which is calculated as the standard deviation of autosome LRR values between -2 and 2. Typically, good-quality genotyped samples by the HumanHap550 array have LRR_SD below 0.20. If the LRR_SD is between 0.20 and 0.30, the results for CNV calls are still acceptable but with elevated number of false positive calls. (Note that by default the PennCNV program tries to use a SD-adjustment procedure that matches the LRR_SD of sample to the HMM model for reducing false positives.) When LRR_SD is over 0.30, most of the CNV calls are no longer reliable, so it is better to use the BeadStudio software to visually check CNV calls that is of particular interest. If related family members are genotyped with good quality, then one can use family information to increase the confidence of CNV calls if the same CNV is also present in related individuals.

 

The typical signal pattern for LRR values for a low-quality sample is shown below. This sample has a LRR_SD of 0.27.

 

 

 

The possible cause of this scenario might be due to the use of non-optimal canonical clustering files. Although Illumina provides standard clustering files (built on HapMap individuals), changes in the genotyping protocol (for example, reagents used in experiments) may lead to deviations from the canonical clustering positions, leading to increased variance of LRR values. After using a custom clustering file (built on reference samples genotyped by the same protocol and same reagents), we have found that the variance of LRR values can be dramatically decreased. However, when comparing CNV calls between two groups (such as cases and control), it is always a good idea to make sure that they have been clustered by the same clustering file.

 

 

1.2      Large variance of BAF values

 

Sometimes we observe normal and clean LRR values, but the BAF values have very large variances. One example is shown below. As we can see, many BAF values are slightly higher than zero or slightly lower than one, and those BAF values that are scattered around 0.5 have very large variance. In PennCNV software, we use BAF_SD score to measure the variations in BAF values: it is calculated as the standard deviation of all autosome BAF values between 0.25 and 0.75. Samples affected by this scenario will have increased false positive calls on duplication CNVs.

 

 

 

User comments: Kurt Hetrick (Center for Inherited Disease Research, Johns Hopkins University) pointed out that “This figure appears to be the result of mixing two different genomes before going through the Illumina process. When two genomes are not similar and are mixed before processing you typically see a shift in BAF (and can estimate the level of contamination…for example your figure appears to be about 20% contaminated), but your Log R ratio should still be around 0 after smoothing because the average number of copies per probe is still 2. Sometimes it helps to look at the X chromosome in this graph because if the mixture is male/female then your smoothing average should be at 0 (as opposed to +/- 0.25)”.

 

 

This figure shows a more extreme example (possibly multiple samples are mixed):

 

 

 

 

 

1.3      Large variance of both LRR and BAF values

 

Sometimes we can observe large variance of both LRR and BAF values. The LRR_SD and BAF_SD values are both large in this case. Samples affected by this scenario will have increase false positive calls on both duplications and deletions.

 

 

This sample has LRR_SD=0.3 and BAF_SD=0.07. Samples like this will generate increased false positive duplications and deletions, so small CNV calls may not be reliable.

 

 

1.4      Failure of one allele

 

See example below. It appears that maybe A alleles are generally not measured well, or not normalized well, so that signal intensities from A allele are generally less than for B allele, leading to decreased LRR values, and slight shift of BAF values for heterozygotes toward 1. The LRR_SD and BAF_SD values are both large in this case. There is no good way to salvage such samples, so it is better to exclude these samples from CNV analysis.

 

The exact cause of this scenario is unknown.

 

 

1.5      Upshift and downshift of BAF values

 

In this scenario, the BAF values for genotyped sample tend to be shifted upwards or downwards, as can be seen from the samples below. To capture such events, the PennCNV program uses the BAF_median measure, which is the median BAF value for all autosome BAF between 0.25 and 0.75. Typically, if the BAF_median is below 0.45 or above 0.55, it causes potential problems in CNV calls, so that many false positive duplication CNV calls will be generated. Right now there is no remedy for this situation, but in the future PennCNV program may implement a simple “median adjustment” procedure to move the BAF values upwards or downwards so that BAF_median is 0.5.

 

An example with downshift of BAF values is shown below:

 

 

 

An example with upshift of BAF values is shown below:

 

 

 

The possible cause of upshifts and downshifts of BAF values is unknown.

 

Sometimes, when there is upshift or downshift of BAF value, we can observe clear drop in mean LRR values, indicating that one allele is not contributing enough signal as it should. For example, the figure below shows obvious drop in LRR values. This is somewhat similar to situation 1.4, but we do not see asymmetrical distribution of LRR values around the running mean here.

 

 

 

 

 

 

1.6      Random failures of BAF and LRR values

 

This scenario refers to random failures of a very small percentage of BAF values and LRR values, such that some non-consecutive markers in the BAF graph is distributed randomly, as opposed to being around 0, 0.5 or 1. This scenario will cause PennCNV to give false positive CNV calls on homozygous deletions (copy number of zero). In the PennCNV software, the BAF_DRIFT measure is used to capture such situations, which is calculated as the median value (for autosomes) of the fraction of markers with LRR between 0.2 and 0.25 or between 0.75 and 0.8 in each chromosome. (This is not a perfect measure, so in the future such measure may be substituted by other means that may examine LRR values instead.) There is no perfect remedy for this situation, but users should pay more attention to copy number zero calls for these samples with large BAF_DRIFT.

 

This is an extremely rare scenario, and it is not very obvious by visual examination, except for experienced users. From the following graph, we can see that many markers have random BAF values that are not clustered around AA, AB or BB clusters. Indeed, I have also examined the LRR values, and found that 6K markers (out of 550K) have LRR values less than -2, further indicating random failure on these markers.

 

 

 

 

One possible cause of this occurring is that a small portion of the genotyping array is not correctly hybridized or scanned, or dried, so SNPs in this region do not give any signal, showing as homozygous deletions for these SNPs. For the above example, I have sorted the markers by their “address”, and found that many of these failed 6K markers cluster together with similar addresses (see below).

 

 

User comments: Kurt Hetrick (Center for Inherited Disease Research, Johns Hopkins University) pointed out that “these addresses are physically located on one end of the array…which can predispose it to drying during the hybridization process.”, and that “it could be scanning related (we’ve had to rescan arrays in many instances b/c of this…sometimes you can see it when looking at the control codes…one bar of the array will be well out of bounds for these codes on the array)”.

1.7      Complete genotyping failure

 

The signals are pretty much random dots in this scenario. There is a complete genotyping failure for this sample and there is no remedy for that. Either re-type the sample or exclude it from any SNP and CNV analysis.

 

 

 

 

1.8      Heterosomic deletion or duplications

 

Heterosomic aberration refers to chromosome deletion or duplication in sub-populations of cells. It is typically a cell-line specific phenomenon, and it typically occurs in the entire chromosomes or the entire arms. They can be easily spotted by naked eye in the BeadStudio software too. Some examples are below:

 

 

 

 

 

The last example in the above graphs is very intriguing: we do not know why there is absolutely no change in LRR values, when there is such a large change in BAF values for this sample. This is probably not a heterosomic aberration but something else. If readers have comments please let us know.

 

Note that this scenario is especially common in chromosome X (some examples are shown in supplementary materials of our PennCNV paper in Genome Research). When a sample is of unknown sex (based on some software that “imputes” sex using chrX genotypes), it is better to check the actual signal patterns: quite possibly you can see a female with heterosomic deletions in chrX.

 

Some extreme examples in chrX is shown below. The LRR in q-arm is quite strong and it does not look like a female losing q-arm in some cells and p-arm in all cells. The exact cause is unknown but I put it under this category temporarily.

 

 

 

1.9      Genomic wave

 

Genomic waves refer to the wavy patterns that can be observed in many genomics applications that measure signal intensities along the chromosomes. It is quite common in high-density SNP genotyping platforms such as HumanHap550 platform, but we have also observed similar patterns in Affymetrix platform as well.

 

To facilitate our description, we shall make a few arbitrary definitions on genomic waves. First, we select the q-arm of chromosome 11 as our primary targets in visual inspection for the presence of genomic waves: it turns out that this region is one of the most wavy regions in the entire genome, and it has a very characteristic cosine patterns, with about one and a quarter period in the q-arm of chr11. The p-arm however has much smaller period and are difficult to discern visually, despite the clear presence of genomic waves.

 

A sample with "positive genomic wave" is defined as having downward patterns starting from ~55MB in chr11 q-arm:

 

 

A sample with "negative genomic wave" is defined as having upward patterns starting from ~55MB in chr11 q-arm:

 

 

Genomic wave cause false positive predictions of both duplications and deletions by the PennCNV program. However, we have developed a locus-specific, model-based approach that adjust for LRR values and generate clean signals for CNV detection. Some examples of wave adjustment is shown below (left panel: before adjustment; right panel: after adjustment):

 

An example of wave adjustment for samples with positive waves is shown below. As we can see, a deletion CNV near 25Mb can hardly be discerned from unadjusted LRR signals, but is very clear from adjusted LRR signals.

 

 

 

 

 

An example of wave djustment for samples with negative waves:

 

 

 

 

Note that our wave adjustment method only operates on LRR values, without affecting BAF values. Typically, BAF values are much less sensitive to genomic waves and does show consistent patterns even when LRR is wavy. However, in certain extreme-case scenarios, we can see that BAF values can also be affected (see example below). Note that this is extremely rare scenario and samples like this should generally be excluded from analysis as the genotype calls are not even reliable (with >10% non-call rate), except to infer LOH (for example, the distal p-arm seems to have LOH), or to do family-based inheritance analysis.

 

 

 

Interestingly, sometimes, the same waves are observed, but without any BAF patterns. Such sample should be excluded from any analysis.

 

 

The exact cause of genomic wave is not fully understood yet, however, empirically this is caused by too much or too little DNAs used in any hybridization experiments. The more deviated the quantity of DNA from "normal amount" (750ng in Illumina Infinium platform), the more obvious the wavy pattern is. Whole-genome amplification can easily cause genomic waves as the quantity of DNA is not well controlled. On the other hand, DNA samples ordered from commercial cell line repositories are much less susceptible to genomic waves due to the stringent quality control used in DNA extraction and quantification.

 

Note that “positive wave” is caused by too much DNA, while “negative wave” is caused by too little DNA. As a result, the variance for positive wave is usually much smaller than negative waves, so the signal variation around the smooth line is larger for negative waves. This trend remains after signal adjustment, leading to scenario 1 for samples with negative waves.

 

 

1.10 Final notes

There are many ways that a genotyped sample may fail to yield satisfactory CNV call results. However, generally speaking, these signal issues do not affect genotype calls, such as a high call rate can still be achieved and are highly reproducible. In a genome-wide association study, it is therefore typical to use only a subset of high-quality samples for CNV analysis, rather than taking the same set of samples that can be used in SNP-based analysis. On the other hand, it would be waste of information to exclude low-quality samples from CNV analysis; therefore, we will be developing novel ways to reduce the effects of low-quality signals to achieve high-confidence of CNV calls.