Creating CCRs with gnomAD 2.1 and new Ensembl builds.

This repo will allow you to create the model for the new Ensembl format and the new INFO-field-heavy gnomAD. This is a great way to get a quickstart on how the model is run. Run these two simple bash commands and it will run all the python scripts and bash code you need to grab all files and create the CCR model files.

First, run:

bash newfiles.sh

to grab everything you need. This will also run bash varmake.sh and create vt decomposed and normalized files for the gnomAD VCF annotated with appropriate VEP fields. It may take a while. This also uses get-chain.py to create the self-chain file and creates the results and data folders in this folder. The output of this script goes entirely into the data folder. The segmental duplication and self-chain files segmental.bed.gz and self-chains.id90.bed.gz are placed into this folder, the GRCh37 FASTA file grch37.fa, the final version of GRCh37 Ensembl exons GTF file Homo_sapiens.GRCh37.85.gtf.gz, the version 2.1 gnomAD VCF gnomad.exomes.r2.1.sites.vcf.bgz, and lastly the coverage file for gnomAD 2.1 labeled as gnomad.exomes.coverage.summary.tsv.bgz. The varmake.sh script takes the gnomAD VCF file and outputs a decomposed, normalized, VEP-annotated file called gnomad2.1-vep-vt.vcf.gz in the same folder.

Then run:

bash newrun.sh

to run newccrs.sh for all versions of the model found in the manuscript. It contains examples of how the code is run to obtain the versions used in the manuscript. It creates output using newccrs.sh in two folders. results/newgnomAD10.x5 is the full autosome-based model. Lastly, results/newXchrom is the X chromosome only version of CCR, dubbed X-CCR, which is also referenced in the paper's supplement.

In each results folder, weightedresiduals-cpg-synonymous-novariant.txt is the final file containing all CCRs with appropriately adjusted percentiles. The resids-cpg-synonymous-novariant.txt is the file before size-weighted percentile adjustment, and exac-regions-novariant.txt is the file of all the regions generated with no model applied to ranking them at all.

Lastly, to create the bgzipped, sorted and tabixed bedfiles, as seen on the CCR browser run:

bash newbeds.sh

which will create the appropriately versioned bgzipped bedfiles in the corresponding results folders.

newccrs.sh

So to run the full autosomal model much as we did in the manuscript, we used the following command:

bash newccrs.sh -c -w -s -v newgnomAD10x.5 -d 10 -d 0.5 -x data/segmental.bed.gz -x data/self-chains.id90.bed.gz -q X -q Y -u

So the options for running newccrs.sh are as follows:

About the model code

newccrs.sh will run exac-regions.py and resid-plot.py (it also runs weightpercentile.py) using the above. If you want more information on the inputs to those individual functions, just run the python scripts with -h to get the descriptions of all the inputs. You do not have to use regions.sh to run the model.

exac-regions.py is the python code that generates the regions. It utilizes the functions in the script utils.py to do so. The functions in utils.py would need to be modified if you wanted to change the way regions are calculated. exac-regions.py runs using parallel processing, so debugging may be difficult for those less experienced.

resid-plot.py is the code to create the model and calculate synonymous density. Synonymous density calculation currently takes a very long time, so it may be in the interest of the user to forgo the calculation if iterating through several different model versions.

weightpercentile.py calculates the length-scaled percentiles off of the far-less balanced ones generated by resid-plot.py. It is highly recommened you use this script to create the final percentiles.