How to create a SNP distance matrix for microbial samples
Introduction
The analysis of single nucleotide polymorphisms (SNPs) is a cornerstone of microbial genomic epidemiology, enabling researchers to infer transmission networks, track outbreaks, and understand evolutionary relationships. SNP matrices are often used to visualize the relatedness and clonality of bacterial or fungal samples. This guide provides a walkthrough of generating SNP distance matrices using command-line tools, and a streamlined alternative in the cloud using the Solu Platform.
How to generate a SNP matrix
Option 1: using command-line tools
1. Install required tools
Ensure the following tools are installed and accessible via your command-line environment:
- snippy: A rapid variant calling and core genome alignment tool for bacterial genomes
- snippy-core: A companion tool to
snippy
that generates a core genome alignment from multiple samples - snp-dists: Computes pairwise SNP distances from a core genome alignment
Example installation via Bioconda:
conda install -c bioconda snippy snp-dists
2. Prepare your input data
- Reference genome: Select a high-quality reference genome in GenBank (
.
gbk
) or FASTA (.fa
) format. Ensure the reference is representative of your sample population - Sample reads: Organize paired-end or single-end FASTQ files for each isolate
3. Run snippy for variant calling
snippy --outdir sample1 --ref reference.fa --R1 sample1_R1.fastq.gz --R2 sample1_R2.fastq.gz
3. You can now analyze the output files with SKA commands to review them, compute SNP distances, and construct alignments.
Repeat this step for all samples. Key outputs per sample include:
snps.vcf
: Variant calls.aligned.fa
: Alignment to the reference.
4 Generate multiple sequence alignment with snippy-core
snippy-core --prefix core sample1 sample2 ... sampleN
Snippy-core outputs:
core.aln
: (core genome alignment)core.full.aln
(full alignment)
5. Compute the SNP distance matrix with snp-dists:
snp-dists -c core.full.aln > snp_distance_matrix.tsv
Advanced considerations
- Reusing the same reference in multiple runs ensures comparability across studies
- Using snippy-core's gore genome definition (the core.aln approach), adding or removing samples alters the core, potentially changing SNP distances. Including divergent outgroups may reduce the core to zero.
- Usually recombinant regions are masked with tools like Gubbins to avoid overestimating SNPs
- It's a good practice to examine how well your samples align to the reference.
Option 2: automate with Solu
For researchers seeking a faster, more user-friendly solution, the Solu Platform simplifies SNP distance matrix computation into just one click. Solu eliminates the need for manual setup and ensures accurate, reproducible results.
- Upload data: Upload your sequencing reads or assembled genomes to Solu in FASTA or FASTQ format.
- Automated analysis: Solu automatically selects a reference genome, runs snippy or SKA, generates a multiple sequence alignment, filters out low quality SNPs and recombination regions, and calculates the whole genome SNP distances.
- Export results: View the results in the UI or download the SNP distance matrix as Excel or csv file.
By automating the process, Solu saves time, makes different runs comparable, reduces errors, and makes advanced bioinformatics accessible to all researchers.


Conclusion
SNP distance analysis is a vital tool in bioinformatics for understanding relatedness of samples in an outbreak context. Constructing high-quality SNP distance matrices can be technically challenging and resource-intensive. Solu Platform offers a seamless, automated solution, enabling researchers to focus on their scientific goals rather than computational complexities. Explore the platform today and see how it can streamline your research workflow.
FAQs
Q: Can I use this method for large datasets?
A: Yes, but for extremely large datasets, consider setting up some cloud computing instance instead of running it on your own computer. Solu Platform is optimized to handle large datasets efficiently.
Q: What if my files are in a different format?
A: Both SKA and Solu platform work for genomic data in FASTA or FASTQ format. Solu Platform also automatically standardizes the files to the correct format to reduce errors.
Q: Is Solu Platform suitable for beginners?
A: Absolutely! Solu has an intuitive interface and requires no installation or configuration, making it ideal for researchers at all skill levels.
Q: Can I customize the analysis parameters on Solu?
A: No, Solu is is designed as a zero-configuration tool which ensures result reproduction and validated against real outbreak scenarios.
Q: How secure is my data on Solu Platform?
A: Solu prioritizes data security and complies with industry standards (HIPAA, ISO 27001) to ensure your data is protected at all times.
Q: Can Solu handle mixed datasets (e.g., reads and assembled genomes)?
A: Yes, Solu can process mixed datasets seamlessly, making it a versatile tool for diverse research needs.
Get started for free
Create your free Solu Platform account today to start analyzing genomes.