cellspec.pp.load_vcf#
- cellspec.pp.load_vcf(vcf_path, show_progress=True, sparse=True, skip_missing_alt=True)#
Load a VCF file into an AnnData object.
This function converts a joint-called VCF (bulk or single-cell) into an AnnData object with sparse matrices for memory efficiency. The resulting object has: - .X: Genotype calls (0=HOM_REF or missing, 1=HET, 3=HOM_ALT) - .layers[‘DP’]: Total depth per variant per sample/cell - .layers[‘AD’]: Alternate allele depth per variant per sample/cell - .var_names: Variant IDs in format ‘chr-pos-ref>alt’ (pos is 1-based, matching VCF) - .var[‘chrom’]: Chromosome name for each variant - .var[‘pos’]: Genomic position for each variant (1-based, matching VCF) - .obs_names: Sample/cell names from VCF
Duplicate variants (same CHROM-POS-REF>ALT) are detected and only the first occurrence is kept, with a warning issued.
- Parameters:
vcf_path (str) – Path to VCF file (can be .vcf or .vcf.gz)
show_progress (bool, default True) – Show progress bar during loading
sparse (bool, default True) – Use sparse matrices (recommended for single-cell scale data). When True, matrices are built incrementally in CSR chunks so peak memory scales with the number of nonzero entries rather than (n_cells × n_variants).
skip_missing_alt (bool, default True) – Skip variants with missing ALT alleles (represented as “.” in VCF). If False, these variants are included with “.” as the ALT allele.
- Return type:
- Returns:
ad.AnnData AnnData object with variant calls
Examples
>>> import cellspec as spc >>> # Load VCF, skipping variants without ALT alleles (default) >>> adata = spc.pp.load_vcf("joint_calls.vcf.gz") >>> print(f"Loaded {adata.n_vars} variants across {adata.n_obs} samples/cells") >>> >>> # Include variants with missing ALT alleles >>> adata = spc.pp.load_vcf("joint_calls.vcf.gz", skip_missing_alt=False)
Notes
For large single-cell datasets, sparse=True (the default) is required at realistic joint-VCF scale. Memory usage is then O(nnz) — proportional to the actual number of called (cell, variant) entries rather than the dense matrix size — making the in-memory footprint comparable to (and often smaller than) the source VCF.gz on disk.
Genotype encoding in .X: - 0: HOM_REF or missing (use
layers['DP'] > 0to distinguish) - 1: HET (0/1) - 3: HOM_ALT (1/1)UNKNOWN VCF calls (
./.) and sites with no read coverage are treated as missing: their entries in .X, .layers[‘DP’], and .layers[‘AD’] are all 0. To identify the cells/sites where a real call was made, mask onadata.layers['DP'] > 0.Variant positions use 1-based indexing matching VCF format (POS field).
By default, variants without ALT alleles (represented as “.” in VCF) are skipped during loading. Set skip_missing_alt=False to include them with variant IDs like ‘chr1-1000-A>.’
If duplicate variant records are found (same CHROM-POS-REF>ALT), only the first occurrence is kept. This can happen when merging VCFs or from overlapping variant calls. Use bcftools norm -d exact to deduplicate VCFs.