Tuesday, July 28, 2009

More Jython and BioJava

Continuing the exploration of how it use BioJava with Jython I've written up an example that takes a Genpept file an does alignments (both NeedlemanWunsch and SmithWaterman) against every Coding feature in a Genebank file.



Things to note:

  • CDS sequences are kept in the Annotations of Features. If you read the straight sequence from the Genebank file, you will get the DNA of the chromosome.
  • The BLOSUM62 matrix is stored in a string, so no extra file needed to load the substitution scores
  • The 'PROTEIN-TERM' alphabet is the 20 amino acids and the TERMinal (stop) symbol. If you use the regular 'PROTEIN' alphabet BioJava will throw an exception when creating the BLOSUM62 Substitution matrix.



#!/usr/bin/env jython

import sys
from java.io import *
from java.util import *
from org.biojava.bio import *
from org.biojava.bio.seq import *
from org.biojava.bio.seq.io import *
from org.biojava.bio.alignment import NeedlemanWunsch
from org.biojava.bio.alignment import SequenceAlignment
from org.biojava.bio.alignment import SmithWaterman
from org.biojava.bio.alignment import SubstitutionMatrix
from org.biojava.bio.symbol import AlphabetManager
from org.biojava.bio.symbol import FiniteAlphabet


BLOSUM62 = """# Matrix made by matblas from blosum62.iij
# * column uses minimum score
# BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
# Blocks Database = /data/blocks_5.0/blocks.dat
# Cluster Percentage: >= 62
# Entropy = 0.6979, Expected = -0.5209
A R N D C Q E G H I L K M F P S T W Y V B Z X *
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4
R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4
N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4
D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4
C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4
Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4
E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4
G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4
H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4
B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4
Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4
X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1
"""

if __name__ == '__main__':
alphabet = AlphabetManager.alphabetForName("PROTEIN-TERM")
matrix = SubstitutionMatrix(alphabet, BLOSUM62, "BLOSUM62" )

nw_aligner = NeedlemanWunsch(
0, #// match
3, #// replace
2, #// insert
2, #// delete
1, #// gapExtend
matrix #// SubstitutionMatrix
)
sw_aligner = SmithWaterman(
0, #// match
3, #// replace
2, #// insert
2, #// delete
1, #// gapExtend
matrix #// SubstitutionMatrix
)
br1 = BufferedReader(FileReader( sys.argv[1] ) )
seqs1 = SeqIOTools.readGenpept(br1)
while ( seqs1.hasNext() ):
query = seqs1.nextSequence()
ff = FeatureFilter.ByType("CDS")
br2 = BufferedReader(FileReader( sys.argv[2]) )
seqs2 = SeqIOTools.readGenbank(br2)
while(seqs2.hasNext()):
seq = seqs2.nextSequence()
fh = seq.filter(ff)
feats = fh.features()
while ( feats.hasNext() ):
feat = feats.next()
fAnno = feat.getAnnotation()
locus = fAnno.getProperty('locus_tag')
target = ProteinTools.createProteinSequence( fAnno.getProperty('translation'), locus )
nw_score = nw_aligner.pairwiseAlignment( query, target )
sw_score = sw_aligner.pairwiseAlignment( query, target )
print "%s - %s Needleman-Wunsch: %d Smith-Waterman: %d" % (query.getName(), target.getName(), nw_score, sw_score)


...Read more

BioJava and Jython

Jython is an exciting project to implement a Python parser and environment in Java. One of it's more exciting aspects is the ability to directly call existing Java classes with no need to write an additional wrapper layer (as compared to the situation with CPython and existing C libraries). This means the complex libraries, like BioJava, can be easily utilized in a scripting environment.
The cookbook example of opening and reading a Genbank file can easily be translated into Jython:

#!/usr/bin/env jython

import sys
from java.io import *
from java.util import *
from org.biojava.bio import *
from org.biojava.bio.seq.db import *
from org.biojava.bio.seq.io import *
from org.biojava.bio.symbol import *

if __name__ == "__main__":
br = BufferedReader( FileReader( sys.argv[1] ) )
sequences = SeqIOTools.readGenbank(br)
while sequences.hasNext():
seq = sequences.nextSequence()
print seq.seqString()

Wednesday, July 08, 2009

The complexity of Life

Three teams of scientists published three massive studies in Nature on the genes behind schizophrenia. They scanned thousands of people to find variants of genes that tended to show up more in people with schizophrenia than in those without it. And they found a heap of genes. There are thousands of different variants that each may raise your risk of schizophrenia by a tiny amount.


Via Discovery Magazine

NIH Expands Human Microbiome Project

The Human Microbiome Project has awarded more than $42 million to expand its exploration of how the trillions of microscopic organisms that live in or on our bodies affect our health, the National Institutes of Health (NIH) announced today. (June 23rd 2009)


From the NIH

Tuesday, February 24, 2009

UNIPOP: A universal operon predictor for prokaryotic genomes

The Journal of Bioinformatics and Computational Biology has published on article on a tool called UNIPOP. The operon prediction tool uses graph theory to figure out operons by mapping areas of chromosomes that experienced less shuffling between species. Because it's not a machine learning based method, there is no retraining that has to be done for different types of organisms. But it does require multiple related genome to detect signals.

You can find the source code and results on their website.  Input is the ppt files available from NCBI (for example the all.ptt.gz file for bacterial genomes).


Abstract:
Identification of operons at the genome scale of prokaryotic organisms represents a key step in deciphering of their transcriptional regulation machinery, biological pathways and networks. While numerous computational methods have been shown to be effective in predicting operons for well-studied organisms such as Escherichia coli (E. coli) K12 and Bacillus subtilis (B. subtilis) 168, these methods generally do not generalize well to genomes other than the ones used to train the methods because they rely heavily on organism-specific information. Several methods have been explored to address this problem through utilizing only genomic structural information conserved across multiple organisms, but they all suffer from the issue of low prediction sensitivity. In this paper, we report a novel operon prediction method that is applicable to any prokaryotic genome with accurate prediction accuracy. The key idea of the method is to predict operons through identification of conserved gene clusters across multiple genomes and through deriving a key parameter relevant to the distribution of intergenic distances in genomes. We have implemented this method using a graph-theoretic approach, called a maximum cardinality bipartite matching algorithm. Computational results have shown that our method has higher prediction sensitivity as well as specificity than any published method. We have carried out a preliminary study on operons unique to archaea and bacteria, respectively, and derived a number of interesting new insights about operons between these two kingdoms. The software and predicted operons of 365 prokaryotic genomes are available at http://csbl.bmb.uga.edu/~dongsheng/UNIPOP.

...Read more

Monday, February 23, 2009

Nanotech based DNA sequencing to lower costs

In the new issue of Nature, Oxford Nanopore describes their new DNA sequencing technology. It doesn't require fluorescent labeling, and achieves 99.8% accuracy. They say that this technology 'could' reduce costs and speed up sequencing (that is probably dependent on how the manufacturing details work out). They claim 50 base pair per second per pore, but I haven't yet seen what they estimate maximum read length would be, or how many pores per chip they expect to make. No word yet on time to market.

You can check out the corporate web site at http://nanoporetech.com/. They are pushing this technology as a possible solution to the personal genomes projects that have been springing up recently.

The technology is a combination of biochemical components and nanotechnology. You can find a nice overview animation at YouTube.



The full animation with no voice over:

Wednesday, February 11, 2009

Flash Molecular Biology Games

Sometimes you need a little fun. If you are looking for a way to kill some time, but want to be able to justify it as 'research', here are some flash games you can find on the web for free.

Microbe Kombat
You move your mouse around to guide you microbial cells to proteins that can be eaten. Enemy cells inhabit the same space and compete for the limited food source. Cute graphics/Nice Music.
Presentation: A
Fun: B
Science: A

Microbe Arena
This is a simple round based game where you customize your character and let him fight. Basically you adjust a bunch of sliders and then hit the 'fight' button. Less of a game, and more of a toy.
Presentation: B
Fun: D
Science: C-

Microbe War
Not really biology based. You are a little ship that flies around and shoots a gun. You have a small armada that goes with you.
Presentation: B
Fun: B-
Science: F


Winner: Microbe Kombat

Is Biology the future of computing?

Computer Science is a form of applied mathematics.  All the etched silicon and electricity is simply the most convenient form to express those ideas, for the time being.  Who's to say that in the future that computers won't take more inspiration from biological sources.  Speaking at the 2009 International Solid-State Circuits Conference, Intel Fellow Mark Bohr spoke about the possibility of integrated circuit design taking design queues from neuron design.

You can read more at Venture Beat

Tuesday, February 10, 2009

Large Scale Phylogenetic Rendering


The New York Times is running an article, "Crunching the Data for the Tree of Life":
For years now researchers have sequenced DNA from thousands of species from jungles, tundras and museum drawers. They have used supercomputers to crunch the genetic data and have gleaned clues to how today ’s diversity of species evolved over the past 450 million years. There’s just one problem. They have no way to visualize it...

The article mainly concerns itself with large scale phylogenetic analysis and rendering. The kind of rendering you would do with ATV or PAUP is for small sub branches of the total tree of life. This article analogizes the goals of these programs with Google Earth, programs that can quickly deal with large scale data sets.

One of the programs mentioned is Paloverde program from UC Davis(The site mentioned in the paper seemd to be down, but you can find a Arizona mirror to download the program).  They provide a compiled binary for Mac OSX.

Another program for this type of large scale rendering is Phlyo3D that works in conjunction with Walrus, a Java3D based graph rendering platform.

Monday, February 09, 2009

Careers in Computational Biology

If you are thinking about a future working in the field of Bioinformatics and Computational Biology, Nature has an article on Careers in Systems Biology.  They interviewed Malcolm Young, CEO of e-Therapeutics, and Hiroaki Kitano, Director of Sony Computer Science Laboratories, and President of the Systems Biology Institute, Tokyo, Japan.