This document describes how to set up a MySQL version of the NCBI taxxonomy tree so that it can be queried by higher taxa. This could all be automated using a script, but here are the manual steps.
Download the taxonomy from the NCBI FTP server (ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz).
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
Unpack it:
gunzip taxdump.tar.gz tar -xvf taxdump.tar
Load the schema into a MySQL database.
# CocoaMySQL dump # Version 0.5 # http://cocoamysql.sourceforge.net # # Host: localhost (MySQL 4.0.21 Complete MySQL by Server Logistics) # Database: ncbi # Generation Time: 2006-03-24 13:20:36 +0000 # ************************************************************ # Dump of table names # ------------------------------------------------------------ DROP TABLE IF EXISTS `ncbi_names`; CREATE TABLE `ncbi_names` ( `tax_id` mediumint(11) unsigned NOT NULL default '0', `name_txt` varchar(255) NOT NULL default '', `unique_name` varchar(255) default NULL, `name_class` varchar(32) NOT NULL default '', KEY `tax_id` (`tax_id`), KEY `name_class` (`name_class`), KEY `name_txt` (`name_txt`) ) TYPE=MyISAM; # Dump of table nodes # ------------------------------------------------------------ DROP TABLE IF EXISTS `ncbi_nodes`; CREATE TABLE `ncbi_nodes` ( `tax_id` mediumint(11) unsigned NOT NULL default '0', `parent_tax_id` mediumint(8) unsigned NOT NULL default '0', `rank` varchar(32) default NULL, `embl_code` varchar(16) default NULL, `division_id` smallint(6) NOT NULL default '0', `inherited_div_flag` tinyint(4) NOT NULL default '0', `genetic_code_id` smallint(6) NOT NULL default '0', `inherited_GC_flag` tinyint(4) NOT NULL default '0', `mitochondrial_genetic_code_id` smallint(4) NOT NULL default '0', `inherited_MGC_flag` tinyint(4) NOT NULL default '0', `GenBank_hidden_flag` smallint(4) NOT NULL default '0', `hidden_subtree_root_flag` tinyint(4) NOT NULL default '0', `comments` varchar(255) default NULL, PRIMARY KEY (`tax_id`), KEY `parent_tax_id` (`parent_tax_id`) ) TYPE=MyISAM;
We now load the node and name dumps into our database. The trick is to do this from the command line, and make sure MySQL can access the directory the *.dmp files are in.
LOAD DATA INFILE '/Users/rpage/names.dmp' INTO TABLE ncbi_names FIELDS TERMINATED BY '\t|\t' LINES TERMINATED BY '\t|\n' (tax_id, name_txt, unique_name, name_class);
On my G4 iBook this took just under 2 minutes for the NCBI taxonomy downloaded 24 March 2006.
Now load the nodes:
LOAD DATA INFILE '/Users/rpage/nodes.dmp' INTO TABLE ncbi_nodes FIELDS TERMINATED BY '\t|\t' LINES TERMINATED BY '\t|\n' (tax_id, parent_tax_id,rank,embl_code,division_id,inherited_div_flag,genetic_code_id,inherited_GC_flag, mitochondrial_genetic_code_id,inherited_MGC_flag,GenBank_hidden_flag,hidden_subtree_root_flag,comments);
This is a lot quicker (about 22 seconds) on my machine.
We could do this in SQL, but I prefer to do this externally using scripts. The first step is to run a Perl script ncbi2gml.pl that generates a GML file of the complete NCBI tree, with nodes labelled by tax_id. GML is graph file format supported by graph libraries such as LEDA and GTL, and by programs such as the yEd graph editor. I use GTL in my code. Unfortunately GTL isn't free software nor open source, but is available free of charge for academics.
#!/usr/bin/perl
#
# Output NCBI tree as GML file with labelled nodes
#
use strict;
use warnings;
use DBI;
my $use_taxid = 1; # if 1 we output taxids, if 0 we use names
my $dbh = DBI->connect("DBI:mysql:ncbitree", , );
print "graph [\n";
print "comment \"Graph of NCBI taxonomy\"\n";
print "directed 1\n";
my $sth=$dbh->prepare(
"SELECT ncbi_nodes.tax_id, ncbi_names.name_txt " .
"FROM nodes " .
"INNER JOIN names ON ncbi_nodes.tax_id = ncbi_names.tax_id " .
"WHERE ncbi_names.name_class=\"scientific name\""
);
$sth->execute();
my @id;
my $count = 0;
while(my @ary=$sth->fetchrow_array())
{
$id[$ary[0]] = $count;
print "node [ id $count label \"";
if ($use_taxid eq '1')
{
print $ary[0];
}
else
{
print $ary[1];
}
print "\"]\n";
$count++;
}
$sth=$dbh->prepare("SELECT tax_id, parent_tax_id FROM ncbi_nodes WHERE (tax_id <> '1')");
$sth->execute();
while(my @ary=$sth->fetchrow_array())
{
print "edge [ source ", $id[$ary[1]], " target ", $id[$ary[0]], " ]\n";
}
print "]\n";
$sth->finish();
$dbh->disconnect;
Then we compute visitation numbers and path strings. I do this using a C++ program gml2nestedsql (get source). This produces a SQL file tree.sql that can be added to a MySQL database that has the ncbi_tree table defined above.
The visitation numbers model the tree as nested sets (see Aaron Mackey's O'Reilly article for background).
We can now load the file tree.sql into a MySQL database if that database has a table like this:
# Dump of table tree # ------------------------------------------------------------ DROP TABLE IF EXISTS `ncbi_tree`; CREATE TABLE `ncbi_tree` ( `tax_id` mediumint(9) NOT NULL default '0', `parent_id` mediumint(9) default NULL, `left_id` mediumint(9) NOT NULL default '0', `right_id` mediumint(9) NOT NULL default '0', `path` varchar(255) NOT NULL default '', PRIMARY KEY (`tax_id`), UNIQUE KEY `left_id` (`left_id`), UNIQUE KEY `right_id` (`right_id`), UNIQUE KEY `path` (`path`), KEY `parent_id` (`parent_id`) ) TYPE=MyISAM;
Assuming at least one other table has NCBI tax_ids, we can then use our tree to query that table using the NCBI taxonomy.