NCBI Taxonomy tree

Roderic Page (r.page@bio.gla.ac.uk)

This document describes how to set up a MySQL version of the NCBI taxxonomy tree so that it can be queried by higher taxa. This could all be automated using a script, but here are the manual steps.

NCBI Taxonomy

Get the NCBI Taxonomy

Download the taxonomy from the NCBI FTP server (ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz).

wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

Unpack it:

gunzip taxdump.tar.gz
tar -xvf taxdump.tar

Create MySQL database

Load the schema into a MySQL database.

# CocoaMySQL dump
# Version 0.5
# http://cocoamysql.sourceforge.net
#
# Host: localhost (MySQL 4.0.21 Complete MySQL by Server Logistics)
# Database: ncbi
# Generation Time: 2006-03-24 13:20:36 +0000
# ************************************************************

# Dump of table names
# ------------------------------------------------------------

DROP TABLE IF EXISTS `ncbi_names`;

CREATE TABLE `ncbi_names` (
  `tax_id` mediumint(11) unsigned NOT NULL default '0',
  `name_txt` varchar(255) NOT NULL default '',
  `unique_name` varchar(255) default NULL,
  `name_class` varchar(32) NOT NULL default '',
  KEY `tax_id` (`tax_id`),
  KEY `name_class` (`name_class`),
  KEY `name_txt` (`name_txt`)
) TYPE=MyISAM;



# Dump of table nodes
# ------------------------------------------------------------

DROP TABLE IF EXISTS `ncbi_nodes`;

CREATE TABLE `ncbi_nodes` (
  `tax_id` mediumint(11) unsigned NOT NULL default '0',
  `parent_tax_id` mediumint(8) unsigned NOT NULL default '0',
  `rank` varchar(32) default NULL,
  `embl_code` varchar(16) default NULL,
  `division_id` smallint(6) NOT NULL default '0',
  `inherited_div_flag` tinyint(4) NOT NULL default '0',
  `genetic_code_id` smallint(6) NOT NULL default '0',
  `inherited_GC_flag` tinyint(4) NOT NULL default '0',
  `mitochondrial_genetic_code_id` smallint(4) NOT NULL default '0',
  `inherited_MGC_flag` tinyint(4) NOT NULL default '0',
  `GenBank_hidden_flag` smallint(4) NOT NULL default '0',
  `hidden_subtree_root_flag` tinyint(4) NOT NULL default '0',
  `comments` varchar(255) default NULL,
  PRIMARY KEY  (`tax_id`),
  KEY `parent_tax_id` (`parent_tax_id`)
) TYPE=MyISAM;


Populate database

We now load the node and name dumps into our database. The trick is to do this from the command line, and make sure MySQL can access the directory the *.dmp files are in.

LOAD DATA INFILE '/Users/rpage/names.dmp' 
INTO TABLE ncbi_names 
FIELDS TERMINATED BY '\t|\t' 
LINES TERMINATED BY '\t|\n' 
(tax_id, name_txt, unique_name, name_class);

On my G4 iBook this took just under 2 minutes for the NCBI taxonomy downloaded 24 March 2006.

Now load the nodes:

LOAD DATA INFILE '/Users/rpage/nodes.dmp' 
INTO TABLE ncbi_nodes 
FIELDS TERMINATED BY '\t|\t' 
LINES TERMINATED BY '\t|\n' 
(tax_id, parent_tax_id,rank,embl_code,division_id,inherited_div_flag,genetic_code_id,inherited_GC_flag,
mitochondrial_genetic_code_id,inherited_MGC_flag,GenBank_hidden_flag,hidden_subtree_root_flag,comments);

This is a lot quicker (about 22 seconds) on my machine.

Generate tree

We could do this in SQL, but I prefer to do this externally using scripts. The first step is to run a Perl script ncbi2gml.pl that generates a GML file of the complete NCBI tree, with nodes labelled by tax_id. GML is graph file format supported by graph libraries such as LEDA and GTL, and by programs such as the yEd graph editor. I use GTL in my code. Unfortunately GTL isn't free software nor open source, but is available free of charge for academics.

#!/usr/bin/perl

#
# Output NCBI tree as GML file with labelled nodes
#

use strict;
use warnings;

use DBI;

my $use_taxid = 1; # if 1 we output taxids, if 0 we use names

my $dbh = DBI->connect("DBI:mysql:ncbitree", , );


print "graph [\n";
print "comment \"Graph of NCBI taxonomy\"\n";
print "directed 1\n";

my $sth=$dbh->prepare(
	"SELECT ncbi_nodes.tax_id, ncbi_names.name_txt " .
	"FROM nodes " .
	"INNER JOIN names ON ncbi_nodes.tax_id = ncbi_names.tax_id " .
	"WHERE ncbi_names.name_class=\"scientific name\""
	);
	
$sth->execute();

my @id;
my $count = 0;

while(my @ary=$sth->fetchrow_array())
{
	$id[$ary[0]] = $count;
	print "node [ id $count  label \"";
	
	if ($use_taxid eq '1')
	{
		print $ary[0];
	}
	else
	{
		print $ary[1];
	}
	
	print "\"]\n";
	
	$count++;
}

$sth=$dbh->prepare("SELECT tax_id, parent_tax_id FROM ncbi_nodes WHERE (tax_id <> '1')");
$sth->execute();


while(my @ary=$sth->fetchrow_array())
{
	print "edge [ source ", $id[$ary[1]], " target ", $id[$ary[0]], " ]\n";
}

print "]\n";

$sth->finish();
$dbh->disconnect;

Then we compute visitation numbers and path strings. I do this using a C++ program gml2nestedsql (get source). This produces a SQL file tree.sql that can be added to a MySQL database that has the ncbi_tree table defined above.

The visitation numbers model the tree as nested sets (see Aaron Mackey's O'Reilly article for background).

We can now load the file tree.sql into a MySQL database if that database has a table like this:

# Dump of table tree
# ------------------------------------------------------------

DROP TABLE IF EXISTS `ncbi_tree`;

CREATE TABLE `ncbi_tree` (
  `tax_id` mediumint(9) NOT NULL default '0',
  `parent_id` mediumint(9) default NULL,
  `left_id` mediumint(9) NOT NULL default '0',
  `right_id` mediumint(9) NOT NULL default '0',
  `path` varchar(255) NOT NULL default '',
  PRIMARY KEY  (`tax_id`),
  UNIQUE KEY `left_id` (`left_id`),
  UNIQUE KEY `right_id` (`right_id`),
  UNIQUE KEY `path` (`path`),
  KEY `parent_id` (`parent_id`)
) TYPE=MyISAM;

Assuming at least one other table has NCBI tax_ids, we can then use our tree to query that table using the NCBI taxonomy.