| ||||||||||||||||||||||||||||||||||||||
Annotation
Team leader: Wei Fan
Research Summary
Our group is focusing on the field of genome annotation, which includes identifying the structural and functional elements, integrate and display the valuable information on genomic level. It mainly consists of four parts: (1) identify protein coding genes using automatic pipeline to combine evidence derived from different methods, such as de novo prediction, cDNA/EST mapping, and homology protein aligning. (2) assign function descriptions to genes by searching against the known database (such as NT/NR, SwissProt/TrEMbl), and predict domains and infer ontology terms. (3) identify ncRNAs such as rRNA, tRNA, miRNA, and other RNA genes by de novo prediction or similarity searching. (4) identify repeat sequence: tandem repeats by TRF, transposable elements by Repeat-Masker. Other analysis such as regulatory elements and pseudogenes are being developed but not so mature. All these methods and software are included in GACP (Genome Annotation and Comparison Pipeline) project, which has been developed constantly since 2006, and achieve version 5.0 in 2008 June .
Projects and brief description
We have finished annotation projects for a broad branch of species include plants, animals, bacteria, and fungi. In 2004, we published a silkworm paper on Science, which trained the parameters specially for gene prediction. In 2005, we published a rice paper on PLoS which used a post filter methods to remove TE contamination from gene models. In 2007, We published a bacteria paper on JOURNAL OF BACTERIOLOGY, which had performed a comprehensive annotation and analysis. The projects undertaken but not published include rice improved annotation silkworm improved annotation, and a set of microbes and BACs. As the coming of large-scale sequencing era, many genomes will be sequenced and called for fine annotation in the near future. To the end of year 2008, the cucumber genome, giant panda genome, tobacoo genome, and a new drosophila genome, will be fully sequenced and annotated.