Currently, I have developed a perl program that prompts the user for a GenBank, EMBL, FASTA or IntelliGenetics file name and proceeds to predict the global folding energy of the resulting RNA molecule, by invoking Mfold, from the sequence. To continue my research, I plan to determine the global folding energy of a sequence and then take a 100-nucleotide long section of that sequence and use a different program called Rfold to calculate the local base pairing probabilities. Next, I will make synonymous substitutions based on a designated threshold that will substitute high probability positions with low probability positions to maximize the folding energy. Then, I will put this "improved" section back into the sequence and run mfold again to determine it's affect on the global folding energy.
This week I have begun the development of a Perl script designed to automatically calculate the local and global folding energies of RNA. I have installed and configured both CPAN and BioPerl. However, I have encountered several problems trying to incorporate the Bio::Tools::Run::PiseApplication::mfold module because it is an obsolete part of the BioPerl package so it is no longer included or supported by BioPerl. After several failed installation attempts and unsuccessful Perl scripts, I have decided to explore alternative options and am currently researching other BioPerl modules in order to create a successful script that can simulate RNA folding.
This week I have been working on installing and understanding a collection of Perl modules called "BioPerl" to help me design a program that can calculate the local and global folding energies of RNA. One class that has been of particular interest to me is the "Bio::Tools::Run::PiseApplication::mfold" class which will hopefully facilitate running Mfold through the Perl script. Another module I have been looking at is "Bio::Align::DNAStatistics" which can calculate several useful statistics about an input sequence such as the number of synonymous and non-synonymous mutations. This module has the potential to be very useful in developing a function that can generate synonymous mutations.
I plan to continue researching BioPerl in search of additional useful modules as well as begin the actual construction of the program to determine exactly what processes need to be done. This past week, I have been examining how well we can predict the total folding energy of RNA. I have been using a program called Mfold which predicts the secondary structure of RNA using thermodynamic methods. I began by folding an entire RNA molecule (Rattus rattus CCR5 gene, CCR5-G allele cds) and recording the total folding energy. Then, I broke the sequence into smaller sequences, 30bps longs (30-mers), and determined their individual folding energies. I recorded my results in a Microsoft Excel spreadsheet to help identify how well the global energy correlates with the local energy.
Being these calculations are tedious and cumbersome, I plan to develop a Perl program that can automate this procedure and output the local energies of the k-mers of a given RNA sequence. Moreover, I plan to modify the RNA sequences by making synonymous changes and examining how these changes affect both the local and global folding energies. I have also continued practicing programming in Perl by creating a program that can take either a DNA or RNA sequence and convert it to its respective amino acid translation. However, I have found that there exists Perl tools called "BioPerl" that are designed to aid in the development of biological programs like these. In the following weeks I plan to better familiarize myself with BioPerl in the hopes that it can be of some use in our research. This week, I learned more about the specific topics we will be researching. To supplement my background in biology, I read Sean Eddy’s paper, “How do RNA folding algorithms work?” that was published in November 2004 in Nature Biotechnology as well as abstracts from several other papers about RNA folding prediction. I also read a report done by a previous student that did research closely related to ours.
Additionally, being we will be working with the programming language Perl for much of our research, I became more familiar with its syntax and semantics. To accomplish this, I created a program that opens any specified FASTA file, concatenates the (DNA/RNA/Protein) sequence of nucleotides or amino acids, and extracts substrings of length 5. The substrings are then stored in a hash along with the number of times they appear in the sequence. The output is a list of each substring along with the number of times it appears in the sequence. It also lists any unique substrings that were found. To test my program, I downloaded several complete (mainly viral) and partial genomes from the NCBI (National Center for Biotechnology Information) and found that the output was consistent with what I expected. This blog will be used to document my undergraduate research on Algorithms and Tools for Synthetic Gene Design at The College of New Jersey.
|
JAY VillariUndergraduate student studying Computer Science and Biology at The College of New Jersey Archives
May 2016
Categories |