cellspec: Single-cell mutation spectrum analysis

cellspec: Single-cell mutation spectrum analysis#

Tests Documentation

A python package for analyzing variant calls from high throughput single cell genome sequencing experiments. Provides a convenient scanpy style API for loading joint calling vcf files into anndata objects, and performing downstream processing and analysis tasks, including:

  • coverage analysis and filtering

  • Annotation ancestral trinucleotide sequence context of SNPs

  • Computing trinucleotide mutation spectra

  • Visualizing mutation spectra

in development:

  • sequencing error / artifact correction

  • Mutation signature fitting and de novo signature discovery

  • Phylogenetic analysis

    • distance based

    • maximum liklihood

    • Bayseian

  • eQTL analysis (using genome-transcriptome coassay data)

Installation#

Install the latest development version:

git clone https://github.com/harrispopgen/cellspec.git
cd cellspec
pip install -e .

Getting started#

As an homage to the semi permeable capsule technology that spurred the need for this package, I encourage the following convention when importing cellspec:

import cellspec as spc

cellspec uses the AnnData class to store joint calling data.

https://raw.githubusercontent.com/scverse/anndata/main/docs/_static/img/anndata_schema.svg

At the most basic level, an AnnData object adata stores a data matrix adata.X, annotation of observations adata.obs and variables adata.var as pd.DataFrame and unstructured annotation adata.uns as dict. Names of observations and variables can be accessed via adata.obs_names and adata.var_names, respectively. AnnData objects can be sliced like dataframes, for example, adata_subset = adata[:, list_of_gene_names].

In cellspec, observations are cells (or samples), and variables are bi-allelic sites. Genotype calls from the vcf file are stored in adata.X, and depth information in adata.layers. Total read depth at each site in each cell is stored in adata.layers["DP"], and alternate allele read depth is stored in adata.layers["AD"].

To load a vcf into anndata:

adata = spc.pp.load_vcf(filename)

This initial step can take a somewhat long time, especially for datasets with a lot of alleles. As such, it a good idea to save your data in .h5ad format for more convenient loading in the future:

adata.write_h5ad(filename)

Please refer to the documentation and tutorials for more instruction, and the API documentation for information on specific functionality.

Contents#