There are two main methods for dealing with XML data, SAX and DOM. DOM parses the XML file into a tree based hierarchy. And while the tree based representation of the data is a very good way to organize the various facets of the data, it does require that you load the entire file into memory. If the file is very large, say a 1GB file from blasting one set of proteins against another, then loading the file into memory can become problematic. SAX on the other hand, treats an XML file as a stream of information. As it reads the file, it comes across opening and closing tags, and uses those identify the 'state' the parser is in using callbacks. Below is an example Python code that looks for tags in a blast file to identify which which data is being read. The BlastHandler class contains a set of call backs 'startElement', 'characters', and 'endElement'. Using these callbacks the program prints out the ID codes of the query and hit (a ID is assumed to be the first set of none-space characters after the '>' sign in the fasta file). It then prints out this information for every hit that it encounters in the file.
Also note the use of the 'BlankEntityHandler' class as the entity handler. XML files will often refer to external schema files that describe their format. If the default entity handler is left intact, then that schema may be downloaded as the file is parsed, causing extra bandwidth usage and computational time. We just return null data to bypass this process.
#!/usr/bin/python
import sys
import re
import xml.sax.saxutils
import xml.sax.xmlreader
import cStringIO
def get_def_id(a):
return (re.compile(r'\s+').split(a))[0]
class BlankEntityHandler( xml.sax.handler.EntityResolver):
def resolveEntity( self, publicId, systemId):
return cStringIO.StringIO()
class BlastHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self.inRead = 0
self.buffer = ""
def startElement(self, name, attributes):
if name == "Hsp":
""
elif name == "Iteration_query-def":
self.inRead = 1
self.buffer = ""
elif name == "Iteration_query-len":
self.inRead = 1
self.buffer = ""
elif name == "BlastOutput_query-def":
self.inRead = 1
self.buffer = ""
elif name == "BlastOutput_query-len":
self.inRead = 1
self.buffer = ""
elif name == "Hit_def":
self.inRead = 1
self.buffer = ""
elif name == "Hsp_evalue":
self.inRead = 1
self.buffer = ""
elif name == "Hsp_align-len":
self.inRead = 1
self.buffer = ""
def characters(self, data):
if self.inRead:
self.buffer += data
def endElement(self, name):
if name == "Iteration_query-def":
self.inRead = 0
self.curQueryDef = self.buffer
elif name == "Iteration_query-len":
self.inRead = 0
self.curQueryLen = int(self.buffer)
elif name == "BlastOutput_query-def":
self.inRead = 0
self.curQueryDef = self.buffer
elif name == "BlastOutput_query-len":
self.inRead = 0
self.curQueryLen = int(self.buffer)
elif name == "Hit_def":
self.inRead = 0
self.curHitDef = self.buffer
elif name == "Hsp_evalue":
self.inRead = 0
self.curEvalue = self.buffer
elif name == "Hsp_align-len":
self.inRead = 0
self.curAlignLen = int(self.buffer)
elif name == "Hsp":
if ( self.curAlignLen >= float(self.curQueryLen) * 0.6 ):
print get_def_id(self.curQueryDef),\
get_def_id(self.curHitDef), \
self.curEvalue, \
self.curAlignLen, \
self.curQueryLen
parser = xml.sax.make_parser( )
parser.setEntityResolver( BlankEntityHandler() )
handler = BlastHandler( )
parser.setContentHandler(handler)
parser.parse( sys.argv[1] )
...Read more
1 comment:
If you are still paying attention to this blog, then hopefully you'll see this.
I have to parse 3GBs of BLAST output but I can't seem to get this script working. The code does not print anything to screen and I tried writing results to file, but it just returns a blank file. I would appreciate some help.
Post a Comment