Scripts

We write lots of simple Python and R scripts for our research. Most are for some pretty specific problems, but here are a smattering that might be useful:

  • ncbitax_msa.py — Annotates the taxon names in a multiple sequence alignment (MSA) with some information about their classification from the NCBI taxonomy database. For example, you can use it to append family names to all of the taxon names, which could be helpful when setting up monophyly constraints for a BEAST search.
  • ncbtax_tips.py — This is similar to the previous script, but it appends taxonomic information to the tip labels of a phylogeny.
  • ncbitax_guide_tree.py — Gets the NCBI taxonomies for each taxon in a MSA, and then builds a Newick tree from the classification. You could then use that tree as a constraint in a phylogeny estimate, for cases in which you trust the classification, and are more interested in relationships within taxa, or divergence times. For this script to work, this R script needs to be in the working directory.
  • msa-cat.py — Concatenates all of the MSA files in a directory, for example those spit out by a PHLAWD run.
  • data_miner_3.py — A pipeline that gets a directory full of PhyLoTA fasta files ready for a supermatrix tree estimate. It keeps one sequence per species per locus, writes the GenBank accession number of the retained sample to a file, does alignments with mafft, and purges hypervariable regions with gblocks. We usually use PHLAWD for phyloinformatics, but occasionally we have problems with PHLAWD that we can’t figure out. When that happens, we may use this script.
  • get_best_exemplars.py — You can constrain a PHLAWD run to work on a list of target species. But what if you want to estimate relationships among a set of target genera? This script queries a list of genus names against PhyLoTA to find the species that is the best exemplar (has the most phylogenetically informative data) for each genus.
  • labeler.py — Writes data labels for specimens in an insect collection.
  • systers.py — Performs Sister-clade contrasts. Calculates diversity contrasts with multiple metrics, and evaluates significance with multiple statistical tests.
  • treeky.py — A Python phyloinformatic pipeline. Specify a taxon by name. The script retrieves phylogenetic sequence clusters from PhyLoTA, removes introns from nuclear protein-coding loci, aligns clusters with MAFFT, filters alignments through Gblocks, concatenates alignments, estimates approximated maximum likelihood trees with FastTree, and saves the phylogeny as a PDF with nodes color-coded according to local SH branch support. For the last step, this R script needs to be in the working directory.