PennCNV - an algorithm for copy number variation detection

Abstract: Comprehensive identification and cataloging of copy number variations (CNVs) is required to provide a complete view of human genetic variation. The resolution of CNV detection in previous experimental designs has been limited to tens or hundreds of kilobases. Here we present PennCNV, a hidden Markov model (HMM) based approach, for kilobase-resolution detection of CNVs from Illumina high-density SNP genotyping data. This algorithm incorporates multiple sources of information, including total signal intensity and allelic intensity ratio at each SNP marker, the distance between neighboring SNPs, the allele frequency of SNPs and the pedigree information where available. We applied PennCNV to genotyping data generated for 112 HapMap individuals; on average we detected ~27 CNVs for each individual with a median size of ~12kb. Excluding common rearrangements in lymphoblastoid cell lines, the fraction of CNVs in offspring not detected in parents (CNV-NDPs) was 3.3%. Our results demonstrate the feasibility of whole-genome fine-mapping of CNVs via high-density SNP genotyping.

Software: The latest PennCNV software (2008jun26 version) can be downloaded as ZIP file or tar.gz file. The PennCNV software runs on 32/64-bit Linux and 32-bit Windows. Other system architecture needs appropriate compilation (source code is included in downloaded package). The ZIP file is preferred choice for Windows users, since many unzipping software in Windows have problems in handling tar.gz files.

Tutorial: This tutorial gives a step-by-step instructions on using PennCNV. An example data set containing genotyping data for three individuals is used in the tutorial: download this ~100MB example data set (as BeadStudio project files) here. Alternatively, if you do not have BeadStudio software, you can download a compressed ~20MB example data set (as tab-delimited text file, exported from the BeadStudio project file) here and skip "step 1" in the tutorial.

Tutorial: This tutorial illustrates some quality control issues in genotyping data and may help users better handle low-quality samples in CNV analysis.

Tutorial: This tutorial describes the procedure for running PennCNV within BeadStudio in Windows system. This procedure is currently under beta-testing.

Data set: This "Serial dilution data set" (as BeadStudio project file) contains a sample genotyped five times, each time with 2-fold dilutions (courtesy of Maris lab). It provides a solid proof that DNA quantity, not quality, lead to wavy signal patterns. (It may also serve to evaluate CNV calling algorithms for the consistency on duplicated samples)


The CNV calls generated on 112 HapMap individuals can be visualized in Genome Browser (human May 2004 assembly) here. Alternatively, it can be downloaded as text files in BED format here for visualization in Genome Browser.

Version history (read manual by "detect_cnv.pl -m" for more info):