PennCNV
Home
Download
Installation
Tutorial
Quick Examples
Input Files
CNV Calling
Trio Calling
Joint Calling
Validation Calling
QC & Annotation
Visualization
PennCNV Plug-in
PennCNV-Affy
Reference
FAQ
Mailing List

PennCNV-Affy Protocol for CNV detection in Affymetrix SNP arrays

                                      
PennCNV is a software package for copy number variation (CNV) detection from signal intensity of SNP genotyping arrays. PennCNV-Affy is a data-preprocessing protocol for Affymetrix genome-wide 5.0, genome-wide 6.0, and potentially Mapping 500K SNP arrays. The protocol converts raw CEL files into a signal intensity file that contains Log R Ratio (LRR) and B Allele Frequency (BAF) values that can be subsequently used by PennCNV for CNV calling.

PennCNV-Affy Protocol for CNV detection in Affymetrix SNP arrays
Affymetrix CNV calling overview
Step 1. Generate the signal intensity data based on raw CEL files
Substep 1.1 Generate genotyping calls from CEL files
Subsetp 1.2 Allele-specific signal extraction from CEL files
Substep 1.3 Generate canonical genotype clustering file
Substep 1.4 LRR and BAF calculation
Step 2: Split the signal file into individual files for CNV calling by PennCNV

 

Affymetrix CNV calling overview

The procedure below outlines how to process raw CEL files and generates canonical genotype clusters, then convert signal intensity for each sample to LRR/BAF values, then generates CNV calls. For this protocol to work, one need to use at least 100 CEL files to generate a reasonably good clustering file. If the user has only a few CEL files, then it is necessary to use the default canonical clustering file in the PennCNV-Affy package, but in this case the CNV calls may not be reliable.

cnv

 

Step 1. Generate the signal intensity data based on raw CEL files

The goal of the first step is to generate the cross-marker normalized signal intensity data from an Affymetrix genotyping project to a text file, so that it can be analyzed subsequently by the PennCNV software. This step has 3 substeps.

Suppose all the files from a genotyping project is stored in a directory called gw6/. Under this directory, there are several sub-directories: a CEL/ directory that stores the raw CEL files for each genotyped sample, a lib/ directory that stores library and annotation files provided by Affymetrix and by PennCNV-Affy,. We will try to write output files to the apt/ directory.

We need to download the PennCNV software from http://www.neurogenome.org/cnv/penncnv/penncnv.latest.tar.gz and uncompress the file.

Next download the PennCNV-Affy programs and library files from http://www.neurogenome.org/cnv/penncnv/gw6.tar.gz and uncompress the file. These files are required for signal pre-processing and also for CNV calling. There will be a lib/ directory that contains some PennCNV-specific library files for genome-wide 6.0 array; in addition, the library files for the genome-wide 5.0 array are in the libgw5/ directory.

Next download the Affymetrix Power Tools (APT) software package from http://www.affymetrix.com/support/developer/powertools/index.affx. We need to log into the website to download the software (the registration is free).

 

Substep 1.1 Generate genotyping calls from CEL files

This step uses the apt-probeset-genotype program in Affymetrix Power Tools (APT) to generate genotyping calls from the raw CEL files using the Birdseed algorithm (for genome-wide 6.0 array) or BRLMM-P (for genome-wide 5.0 array) algorithm. Note that the genotyping calling requires lots of CEL files.

Genome-wide 6.0 array

Before performing this step, we need to download the library files for the genome-wide 6.0 array from http://www.affymetrix.com/Auth/support/downloads/library_files/genomewidesnp6_libraryfile.zip, and save the decompressed files to the lib/ directory. Several files in this directory, including a CDF file and a Birdseed model file, will be used in the genotype calling step.

[kai@node126 ~/project/affycnv/gw6]$ apt-probeset-genotype -c lib/GenomeWideSNP_6.cdf -a birdseed --read-models-birdseed lib/GenomeWideSNP_6.birdseed.models --special-snps lib/GenomeWideSNP_6.specialSNPs --out-dir apt --cel-files listfile

The above command generates genotyping calls using all CEL files specified in the listfile, and generates several output files in the apt/ directory. The listfile contains a list of CEL file names, with one name per line, and with the first line being “cel_files”. The output files for this command include birdseed.confidences.txt, birdseed.report.txt and birdseed.calls.txt.

For a typical modern computer, the command should take less than one day for 1000-2000 CEL files. The command should usually work well by the default parameters set by APT program. However, sometimes the default parameters (--block-size) does not work and crash the computer; in this case, the user is advised to check the error log files, find out the block-size parameter automatically chosen by APT, and then decrease this parameter by half, then run the program again.

Genome-wide 5.0 array

For genome-wide 5.0 arrays, the command line is slightly different. First download the CDF and model files for GW5 array from http://www.affymetrix.com/Auth/support/downloads/library_files/genomewidesnp5_libraryfile_rev1.zip and http://www.affymetrix.com/Auth/support/downloads/library_files/GenomeWideSNP_5.r2.zip. Then save decompressed files to the lib/ directory.  There are several CDF files but we will need to use the GenomeWideSNP_5.Full.r2.cdf file. The genotype calling can be done using a command like this:

[kai@node126 ~/project/affycnv/gw5]$ apt-probeset-genotype -c lib/GenomeWideSNP_5.Full.r2.cdf --chrX-snps lib/GenomeWideSNP_5.Full.chrx --read-models-brlmmp lib/GenomeWideSNP_5.models -a brlmm-p --out-dir apt --cel-files listfile

Mapping 500K array

For Mapping 500K array set with Nsp and Sty arrays, the genotype calling and signal extraction need to be done separately for each array. The command for genotype calling should use brlmm (instead of brlmm-p) as the algorithm.

 

Subsetp 1.2 Allele-specific signal extraction from CEL files

This step uses the Affymetrix Power Tools software to extract allele-specific signal values from the raw CEL files. Here “allele-specific” refers to the fact that for each SNP, we have a signal measure for the A allele and a separate signal measure for the B allele.

Genome-wide 6.0 array

An example command is given below:

[kai@node123 ~/project/affycnv/gw6]$ apt-probeset-summarize --cdf-file lib/GenomeWideSNP_6.cdf --analysis quant-norm.sketch=50000,pm-only,med-polish,expr.genotype=true --target-sketch lib/hapmap.quant-norm.normalization-target.txt --out-dir apt --cel-files listfile

The above command read signal intensity values for PM probes in all the CEL files specified in listfile, apply quantile normalization to the values, apply median polish on the data, then generates signal intensity values for A and B allele for each SNP. The file hapmap.quant-norm.normalization-target.txt is provided in the PennCNV-Affy package: it is generated using all HapMap samples, as a reference quantile distribution to use in the normalization process, so that the quantile normalization procedures for different genotyping projects are more comparable to each other.

Genome-wide 5.0 array

For genome-wide 5.0 arrays, the target-sketch can be found in the libgw5/ directory. An example command is given below:

[kai@node123 ~/project/affycnv/gw5]$ apt-probeset-summarize --cdf-file lib/GenomeWideSNP_5.Full.r2.cdf --analysis quant-norm.sketch=50000,pm-only,med-polish,expr.genotype=true --target-sketch libgw5/agre.quant-norm.normalization-target.txt --out-dir apt --cel-files listfile

Mapping 500K array

The signal extraction needs to be done for each array separately. The pm-only option need to be used in --analysis argument since Mapping 500K array contains both PM and MM probes for each probe set.

A note on normalization target: the “--target-sketch” argument above gives a reference signal distribution, such that a new array can be normalized using the percentiles in the reference distribution. It is useful and necessary if the user has only a few dozen CEL files, and wants to use the default clustering file provided in the PennCNV-Affy package.

For a typical modern computer, the command should take less than one day for 1000-2000 CEL files. Several output files will be generated in the apt/ directory, including agre.quant-norm.pm-only.med-polish.expr.summary.txt, which contains the signal values (one allele per line, one sample per column). For SNP probes, two lines are used per probe, for A and B alleles, respectively. For non-polymorphic probes (so-called CN probes), only one line is used per probe.

 

Substep 1.3 Generate canonical genotype clustering file

This step generates a file that contains the parameters for the canonical clustering information for each SNP or CN marker, such that this file can be used later on to calculate LRR and BAF values.

If the user has only a few dozen CEL files, then it is unlikely that a clustering file can be generated successfully and accurately. In that case, one can skip this step and go to substep 1.4 directory, but using the default clustering file provided in PennCNV-Affy package (hapmap.genocluster for GW6 array, and agre.genocluster for GW5 array).

Genome-wide 6.0 array
                                                                                                                                                                                                                                                                                                                                                                                     
To generate canonical genotype clusters, use the generate_affy_geno_cluster.pl program in the downloaded PennCNV-Affy package (see gw6/bin/ directory).

[kai@beta ~/ project/affycnv/gw6/apt]$ generate_affy_geno_cluster.pl birdseed.calls.txt birdseed.confidences.txt quant-norm.pm-only.med-polish.expr.summary.txt -locfile ../lib/affygw6.hg18.pfb -sexfile file_sex -out gw6.genocluster

The affygw6.hg18.pfb file is provided in PennCNV-Affy package, which contains the annotated marker positions in hg18 (NCBI 36) human genome assembly. The file_sex file is a two-column file that annotates the sex information for each CEL file, one file per line, and each line contains the file name and the sex separated by tab. The file_sex file is important for chrX markers and chrY markers, such that only females are used for constructing canonical clusters for chrX markers and that only males are used for constructing canonical clusters for chrY markers. For example, the first 10 lines of a file_sex file is below:

10918.CEL       male
10924.CEL       male
11321_2.CEL     female
10998.CEL       female
11039.CEL       female
11345.CEL       female
10909.CEL       female
11035.CEL       female
11569_2.CEL     female

Alternatively, one can use 1 to specify male and 2 to specify female in the sexfile. If the sex information for some CEL file is not known, you do not need to include them in the sexfile.

If the --sexfile argument is not provided, then chrX and chrY markers will not be processed and the resulting cluster file is only suitable for autosome CNV detection!

For a typical modern computer, the command should take several hours to process files generated from 1000-2000 CEL files.

Genome-wide 5.0 array

An example command is given below:

[kai@beta ~/ project/affycnv/gw5/apt]$ generate_affy_geno_cluster.pl brlmm-p.calls.txt brlmm-p.confidences.txt quant-norm.pm-only.med-polish.expr.summary.txt -locfile ../libgw5/affygw5.hg18.pfb -sexfile file_sex -out gw5.genocluster

Mapping 500K array

Similar command as genome-wide arrays should be used for Nsp and Sty array separately.

 

Substep 1.4 LRR and BAF calculation

This step use the allele-specific signal intensity measures generated from the last step to calculate the Log R Ratio (LRR) values and the B Allele Frequency (BAF) values for each marker in each individual. The normalize_affy_geno_cluster.pl program in the downloaded PennCNV-Affy package (see gw6/bin/ directory) is used:

[kai@adenine ~/project/affycnv/gw6/apt]$ normalize_affy_geno_cluster.pl gw6.genocluster quant-norm.pm-only.med-polish.expr.summary.txt -locfile ../lib/affygw6.hg18.pfb -out gw6.lrr_baf.txt

The above command generates LRR and BAF values using the summary file generated in last step, and using a cluster file called gw6.genocluster generated in the last step. The location file specifies the chromosome position of each SNP or CN probe, and this information is printed in the output files as well to facilitate future data processing.

For a typical modern computer, the command should take several hours to process files generated from 1000-2000 CEL files. A new tab-delimited file called gw6.lrr_baf.txt will be generated that contains one SNP per line and one sample per two columns (LRR column and BAF column).

For Mapping 500K arrays, the two signal intensity files can now be concatenated into one single file for subsequent analysis.

If the user does not have sufficient number of CEL files for the above substep 1.1 and 1.3, then you can alternatively use the default canonical clustering file provided in the PennCNV-Affy package. Right now several files are provided: hapmap.genocluster for GW6 arrays, agre.genocluster for GW5 arrays, and affy500k.nsp.genocluster/affy500k.sty.genocluster for Mapping 500K arrays. The results won’t be optimal and are probably highly unreliable (the QC measures during PennCNV calling can give some clue on the signal-to-noise ratio of the resulting signal intensity files).

 

Step 2: Split the signal file into individual files for CNV calling by PennCNV

The first a few lines and first a few columns of the tab-delimited gw6.lrr_baf.txt file may look like this:


Name

Chr

Position

NA06985_GW6_C.
CEL.Log R Ratio

NA06985_GW6_C.
CEL.B Allele Freq

NA06991_GW6_C.
CEL.Log R Ratio

NA06991_GW6_C.
CEL.B Allele Freq

SNP_A-2131660

1

1145994

0.0068

0.5156

-0.3452

0.4954

SNP_A-1967418

1

2224111

0.1564

0.9621

-0.0146

1

SNP_A-1969580

1

2319424

0.0129

1

0.259

0.9814

SNP_A-4263484

1

2543484

0.0189

0.5012

0.1096

0

SNP_A-1978185

1

2926730

-0.2488

0.0392

-0.107

0

SNP_A-4264431

1

2941694

0.0273

0

0.0116

0

SNP_A-1980898

1

3084986

-0.0671

1

0.2788

1

SNP_A-1983139

1

3155127

0.0988

0

-0.1636

0.0813

SNP_A-4265735

1

3292731

-0.0172

0.5129

-0.2476

0.5207

 

The first line is referred to as the “header” line, which contains information on the meaning of each column. Each subsequent line contains information on one SNP per line for all individuals.

After this file is generated, we need to split this huge file into individual signal intensity files (one for each subject), then we can follow the procedures outlined in the PennCNV tutorial, and generate the CNV calls using a similar procedure as for Illumina arrays. The only difference is that the HMM file (gw6.hmm) and the PFB file (affygw6.hg18.pfb and affygw5.hg18.pfb) should be used for Affymetrix CNV calling. Several basic steps are briefly described below:

Use the kcolumn.pl program to split the gw6.lrr_baf.txt file, the “split 2” argument should be used, since every two columns (Log R Ratio and B Allele Frequency) should be in each output file. An example is given below:

kcolumn.pl gw6.lrr_baf.txt split 2 -tab -head 3 -name -out gw6

Many files will be generated, each containing LRR/BAF values for one sample. The output file names are gw6.NA06985_GW6_C, gw6.NA06991_GW6_C, and so on (since the "-out gw6" argument is specified). Now generate a text file called signallistfile, which contains one signal file name per line, to be used by PennCNV. For example, first line is gw6.NA06985_GW6_C, second line is gw6.NA06991_GW6_C. Unlike the list file used by APT in previous steps, no header line is needed in the signallistfile to be processed by PennCNV.

Tip: In the above command, if not using the --name argument, then the output file name will be something like gw6.split1, gw6.split2 and so on. The -name argument tells the program that the output file name should be based on the first word in the first line of the inputfile (for example, NA06985_GW6_C and NA06991_GW6_C). In addition, you can also add the "--beforestring _GW6" argument to the above command; this means that any string before the "_GW6" should be used as output file name, so that the output file names are gw6.NA06985, gw6.NA06991 and so on.

Finally generate the CNV calls. An example is given below:

detect_cnv.pl -test -hmm lib/affygw6.hmm -pfb lib/affygw6.hg18.pfb -list signallistfile -log gw6.log -out gw6.rawcnv

In the above command, the signallistfile is the file that contains all the signal file names, one per line. Alternatively, we can just specify a few signal file names in the command line without using the -list argument.

If there are obvious patterns of waviness in the input signal intensity files, one can supply the --gcmodel argument in detect_cnv.pl, and use the affygw6.hg18.gcmodel or affygw5.hg18.gcmodel or affy500k.hg18.gcmodel as model files for GC adjustment. (Furthermore, the affygw6.hg18.snpgenemap and affygw5.hg18.snpgenemap and affy500k.hg18.snpgenemap files in PennCNV-Affy provide mapping of SNP markers or CN markers to nearby or overlapping RefSeq genes. They are not useful in CNV calling though.)