Monday, April 14, 2008

Python Blast SAX example

For the times when you have very large XML formatted Blast output files (use the -m7 option at run time), you can use a SAX python parser to read out the files.


There are two main methods for dealing with XML data, SAX and DOM. DOM parses the XML file into a tree based hierarchy. And while the tree based representation of the data is a very good way to organize the various facets of the data, it does require that you load the entire file into memory. If the file is very large, say a 1GB file from blasting one set of proteins against another, then loading the file into memory can become problematic. SAX on the other hand, treats an XML file as a stream of information. As it reads the file, it comes across opening and closing tags, and uses those identify the 'state' the parser is in using callbacks. Below is an example Python code that looks for tags in a blast file to identify which which data is being read. The BlastHandler class contains a set of call backs 'startElement', 'characters', and 'endElement'. Using these callbacks the program prints out the ID codes of the query and hit (a ID is assumed to be the first set of none-space characters after the '>' sign in the fasta file). It then prints out this information for every hit that it encounters in the file.
Also note the use of the 'BlankEntityHandler' class as the entity handler. XML files will often refer to external schema files that describe their format. If the default entity handler is left intact, then that schema may be downloaded as the file is parsed, causing extra bandwidth usage and computational time. We just return null data to bypass this process.



#!/usr/bin/python

import sys
import re
import xml.sax.saxutils
import xml.sax.xmlreader
import cStringIO


def get_def_id(a):
return (re.compile(r'\s+').split(a))[0]

class BlankEntityHandler( xml.sax.handler.EntityResolver):
def resolveEntity( self, publicId, systemId):
return cStringIO.StringIO()

class BlastHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self.inRead = 0
self.buffer = ""

def startElement(self, name, attributes):
if name == "Hsp":
""
elif name == "Iteration_query-def":
self.inRead = 1
self.buffer = ""
elif name == "Iteration_query-len":
self.inRead = 1
self.buffer = ""
elif name == "BlastOutput_query-def":
self.inRead = 1
self.buffer = ""
elif name == "BlastOutput_query-len":
self.inRead = 1
self.buffer = ""
elif name == "Hit_def":
self.inRead = 1
self.buffer = ""
elif name == "Hsp_evalue":
self.inRead = 1
self.buffer = ""
elif name == "Hsp_align-len":
self.inRead = 1
self.buffer = ""

def characters(self, data):
if self.inRead:
self.buffer += data

def endElement(self, name):
if name == "Iteration_query-def":
self.inRead = 0
self.curQueryDef = self.buffer
elif name == "Iteration_query-len":
self.inRead = 0
self.curQueryLen = int(self.buffer)
elif name == "BlastOutput_query-def":
self.inRead = 0
self.curQueryDef = self.buffer
elif name == "BlastOutput_query-len":
self.inRead = 0
self.curQueryLen = int(self.buffer)
elif name == "Hit_def":
self.inRead = 0
self.curHitDef = self.buffer
elif name == "Hsp_evalue":
self.inRead = 0
self.curEvalue = self.buffer
elif name == "Hsp_align-len":
self.inRead = 0
self.curAlignLen = int(self.buffer)
elif name == "Hsp":
if ( self.curAlignLen >= float(self.curQueryLen) * 0.6 ):
print get_def_id(self.curQueryDef),\
get_def_id(self.curHitDef), \
self.curEvalue, \
self.curAlignLen, \
self.curQueryLen

parser = xml.sax.make_parser( )
parser.setEntityResolver( BlankEntityHandler() )
handler = BlastHandler( )
parser.setContentHandler(handler)
parser.parse( sys.argv[1] )



...Read more

Wednesday, January 09, 2008

Hacking your DNA

Drew Endy's talk about engineering DNA. Related to the work at The BioBricks Foundation



Also covered at:

Hackaday
i09

From Cracked: The 5 Current Genetic Experiments Most Likely to Destroy Humanity

Just in case you aren't paranoid enough, Cracked has a list of
"The 5 Current Genetic Experiments Most Likely to Destroy Humanity"