Introduction

This introduction will demonstrate usage of the main commands required to run geneCNV and step you through an example analysis using data provided in the package.

Install geneCNV

Go to GitHub to download the source code and see full installation instructions.

Briefly, to install the package and any unsatisfied dependencies:

git clone https://github.com/GenePeeks/geneCNV.git
cd geneCNV
pip install -r requirements.txt
python setup.py install

Get coverage counts

To get started, generate coverage counts across relevant targets and samples using the create-matrix command.

You must first provide a BED file of relevant targets in this format:

X   32867834    32867947    Ex3 DMD
X   33038245    33038327    Ex2 DMD
X   33229388    33229673    Ex1 DMD

The first four fields (chromosome, start position, end position, label) are required, while the fifth is optional. An example BED file for the DMD gene (example_dmd_baseline.bed) is provided in the test_data directory of the package.

Note

Baseline target specification If you are using baseline targets, these targets should have Baseline as part of the target label.

In addition to a BED file, you must provide a text file of paths to the sample BAM files in this format:

/path/to/file1.bam
/path/to/file2.bam

An example create-matrix command looks like:

genecnv create-matrix test_data/example_dmd_baseline.bed training_samples.fofn \
training_sample_coverage.csv --targetArgfile dmd_baseline_targets.pickle

Serialized target/argument files (targetArgfiles) can be optionally produced with this command. You only need to produce a target/argument file once for a specific set of targets. An example output CSV for this command is provided in test_data. This can be used to run the subsequent train-model command.

Train the model with normal samples

Next you’ll estimate the model hyperparameters (train the model) using the samples included in the coverage count matrix. These should be “normal” samples without known CNVs in the (non-baseline) targets of interest.

Note

How should I select “normal” samples?

  • If you are unsure which of your samples should be considered “normal”, you can generate a coverage matrix and then examine the coverage distributions across samples using a dataframe analysis tool. You can then produce another coverage matrix after removing any noisy or problematic samples.
  • It is important that all sequencing data used for analysis with geneCNV has been produced with the same sequencing pipeline.

To train the model, run the following:

genecnv train-model dmd_baseline_targets.pickle test_data/training_sample_coverage.csv \
dmd_baseline_params.pickle --use_baseline_sum

Baseline autosomal targets are used to identify absolute copy number when no CNVs are present, and help provide more accurate results overall. Including baseline targets can also allow you to identify the sex of a sample when targets on the X chromosome are being tested. Baseline targets are not analyzed for copy number and are assumed to have copy number of 2.

If you are using a large number of baseline targets (>20), it’s recommended to use the optional --use_baseline_sum argument when calling train-model. This reduces the total number of baseline targets to one during training.

Evaluate samples for CNVs

Once parameters have been estimated from an appropriate set of training samples, they can be used to perform copy number analysis for the relevant targets on a test sample with the evaluate-sample command. Here you can pass simply a sample BAM file or a coverage matrix CSV (generated using the same targets).

To evaluate the first test sample in the file test_data/test_female_sample_coverage.csv use the following command:

genecnv evaluate-sample test_data/test_female_sample_coverage.csv dmd_baseline_params.pickle \
normal_female_results

This command will produce three output files with the provided prefix, normal_female_results.txt, which provides the posterior probabilities and copy numbers for all relevant targets, normal_female_results_summary.txt which provides a summary of any CNVs detected, and normal_female_results.pdf, which provides a visualization of the copy numbers and posterior probabilities across targets.

Depending on the number of total targets and MCMC iterations needed for convergence, the sample evaluation may take up to 10-12 minutes to complete. By default it takes advantage of multiple cores, but this can be turned off with the option --use_single_process.

For further detail and options, see the CLI documentation.