cellspec.pp.load_vcf

Contents

cellspec.pp.load_vcf#

cellspec.pp.load_vcf(vcf_path, show_progress=True, sparse=True, skip_missing_alt=True)#

Load a VCF file into an AnnData object.

This function converts a joint-called VCF (bulk or single-cell) into an AnnData object with sparse matrices for memory efficiency. The resulting object has: - .X: Genotype calls (0=HOM_REF or missing, 1=HET, 3=HOM_ALT) - .layers[‘DP’]: Total depth per variant per sample/cell - .layers[‘AD’]: Alternate allele depth per variant per sample/cell - .var_names: Variant IDs in format ‘chr-pos-ref>alt’ (pos is 1-based, matching VCF) - .var[‘chrom’]: Chromosome name for each variant - .var[‘pos’]: Genomic position for each variant (1-based, matching VCF) - .obs_names: Sample/cell names from VCF

Duplicate variants (same CHROM-POS-REF>ALT) are detected and only the first occurrence is kept, with a warning issued.

Parameters:
  • vcf_path (str) – Path to VCF file (can be .vcf or .vcf.gz)

  • show_progress (bool, default True) – Show progress bar during loading

  • sparse (bool, default True) – Use sparse matrices (recommended for single-cell scale data). When True, matrices are built incrementally in CSR chunks so peak memory scales with the number of nonzero entries rather than (n_cells × n_variants).

  • skip_missing_alt (bool, default True) – Skip variants with missing ALT alleles (represented as “.” in VCF). If False, these variants are included with “.” as the ALT allele.

Return type:

AnnData

Returns:

ad.AnnData AnnData object with variant calls

Examples

>>> import cellspec as spc
>>> # Load VCF, skipping variants without ALT alleles (default)
>>> adata = spc.pp.load_vcf("joint_calls.vcf.gz")
>>> print(f"Loaded {adata.n_vars} variants across {adata.n_obs} samples/cells")
>>>
>>> # Include variants with missing ALT alleles
>>> adata = spc.pp.load_vcf("joint_calls.vcf.gz", skip_missing_alt=False)

Notes

For large single-cell datasets, sparse=True (the default) is required at realistic joint-VCF scale. Memory usage is then O(nnz) — proportional to the actual number of called (cell, variant) entries rather than the dense matrix size — making the in-memory footprint comparable to (and often smaller than) the source VCF.gz on disk.

Genotype encoding in .X: - 0: HOM_REF or missing (use layers['DP'] > 0 to distinguish) - 1: HET (0/1) - 3: HOM_ALT (1/1)

UNKNOWN VCF calls (./.) and sites with no read coverage are treated as missing: their entries in .X, .layers[‘DP’], and .layers[‘AD’] are all 0. To identify the cells/sites where a real call was made, mask on adata.layers['DP'] > 0.

Variant positions use 1-based indexing matching VCF format (POS field).

By default, variants without ALT alleles (represented as “.” in VCF) are skipped during loading. Set skip_missing_alt=False to include them with variant IDs like ‘chr1-1000-A>.’

If duplicate variant records are found (same CHROM-POS-REF>ALT), only the first occurrence is kept. This can happen when merging VCFs or from overlapping variant calls. Use bcftools norm -d exact to deduplicate VCFs.