CCR Scripts

This repo will allow you to create the model as we did. If you want to run the model as in the paper, go to Examples --> gnomAD above. New versions are under Examples --> New Versions.

exac-regions.py

This script generates the regions devoid of genetic variation. It utilizes scripts in utils.py to do so. Regions are actually calculated using the functions in utils.py which will be explained in the next section.

Make sure you save the output to a file, by default it is printed to stdout. exac-regions.py runs using the multiprocessing package, so debugging may be difficult for those less experienced.

resid-plot.py

This script is used to take the output of exac-regions.py and creates the model for the regions. It also calculates synonymous density. Synonymous density calculation currently takes a very long time, so it may be in the interest of the user to forgo the calculation if iterating through several different model versions. In the future it should be trivial to add more aspects to the model or switch from linear regression to a logistic regression model or some other ML model. This script does use one function from utils.py, which is issynonymous to aid with determination of whether variants are synonymous or not.

This script will produce its output in stdout. Capture it in a file, and make sure that the file is sorted by percentile (the last column) and the header as the first line before passing the file into weightpercentile.py.

weightpercentile.py

This only takes one input, the file from resid-plot.py sorted by percentile, keeping the header.

This script recalculates the percentile scale using the percentiles from resid-plot.py. It starts at 100 (the most constrained region) and scales it down proportionally for each subsequent region by the fraction of the length of protein-coding exome already covered by the preceding region. It is highly recommended that you use this script to create the final percentiles as the percentiles created by resid-plot.py are very imbalanced. The output of this file contains the final CCRs for your input variant set and exons of choice. This is the end of the pipeline.

utils.py

This script contains all the functions necessary for creating the regions as well as a few other useful functions. There are several doctests for each function that check if they work as expected.