This is a limited proof of concept to search for research data, not a production system.

Search the MIT Libraries

Title: roblanf/sarscov2phylo: 22-7-20

Type Software roblanf (2020): roblanf/sarscov2phylo: 22-7-20. Zenodo. Software. https://zenodo.org/record/3958884

Author: roblanf ;

Links

Summary

The trees in this release were generated with the following command line:

bash global_tree_gisaid.sh -i gisaid_hcov-19_2020_07_22_07.fasta -o global.fa -t 34

The raw sequence file contains all available SARS-CoV-2 genomes in GISAID available on the 22nd of July 2020, at 9PM Canberra (Australia) time.

The ZIP file contains the code necessary to reproduce the trees themselves, and the README in the zip file also describes the methods used in detail. I also include the trees themselves here so that they can be easily downloaded without downloading the entire repo.

Filtering statistics

sequences downloaded from GISAID 44915 // alignment stats of global alignment Alignment number: 1 Format: aligned FASTA Number of sequences: 44446 Alignment length: 29903 Total # residues: 1326206221 Smallest: 29146 Largest: 29903 Average length: 29838.6 Average identity: 100% // alignment stats of global alignment after masking sites Alignment number: 1 Format: aligned FASTA Number of sequences: 44446 Alignment length: 29903 Total # residues: 1318812088 Smallest: 29059 Largest: 29680 Average length: 29672.2 Average identity: 100% // alignment stats after filtering out short/ambiguous sequences Alignment number: 1 Format: aligned FASTA Number of sequences: 44278 Alignment length: 29903 Total # residues: 1313831108 Smallest: 29059 Largest: 29680 Average length: 29672.3 Average identity: 100% // alignment stats of global alignment after trimming sites that are >50% gaps Alignment number: 1 Format: aligned FASTA Number of sequences: 44278 Alignment length: 29661 Total # residues: 1310443036 Smallest: 28457 Largest: 29661 Average length: 29595.8 Average identity: 100% // After filtering sequences with TreeShrink Type: Phylogram #nodes: 79266 #leaves: 44233 #dichotomies: 33504 #leaf labels: 44233 #inner labels: 35031

Notable changes to the scripts in this release

None

Notable aspects of the trees

A few long branches, particularly on sequences from India. These could be real or due to a lot of sequencing error. If real they would suggest that there are some highly diverged sequences in India. They should be treated with additional diligence compared to other sequences.

More information

  • DOI: 10.5281/zenodo.3958884

Dates

  • Publication date: 2020
  • Issued: July 24, 2020

Rights

  • info:eu-repo/semantics/openAccess Open Access

Much of the data past this point we don't have good examples of yet. Please share in #rdi slack if you have good examples for anything that appears below. Thanks!

Format

electronic resource

Relateditems

DescriptionItem typeRelationshipUri
IsSupplementTohttps://github.com/roblanf/sarscov2phylo/tree/22-7-20
IsVersionOfhttps://doi.org/10.5281/zenodo.3958883
IsPartOfhttps://zenodo.org/communities/covid-19
IsPartOfhttps://zenodo.org/communities/zenodo