BIOT630, FINAL EXAM
Important to note:
Exam answers submitted after the Due Date specified in the class Syllabus will not be accepted. There are no opportunities for revision post submission for grade. So, take time to read the exam questions and the accompanying information carefully and make sure to provide complete answers as asked. Also, this is an exam, please treat it as such. Your answers should be your own. You should not be working cooperatively with anyone to answer the questions. You should not be providing guidance or assistance to anyone asking questions. To do so is an Ethics violation, which will warrant an Ethics Committee review and corresponding action. My recommendation is to not take the chance, not take the risk, and complete the exam as if you were in class and the examination process being directly monitored. If you are approached or asked to cooperatively work on the exam in any way, asked to provide guidance, and/or been asked to provide/share answers, please inform me immediately.
You are an aspiring bioinformatics researcher on vacation. While taking a hike you come across an interesting plant. You notice when you position yourself between it and the light coming from the sun it appears to change color from green to purple. This fascinates you enough to uproot part of the plant, pot it, and take back to your lab.
Back at the lab, you perform a literature search to see if any such plant has been previously registered, studied, reported. To your surprise, it has not. You think to yourself, what a great opportunity to characterize something new!
You contemplate what to do next.
You think to yourself, the first logical step maybe to sequence the genome for the plant.
Question Series 1 (3 points)
Describe the genomic sequencing strategies that you could possibly employ. Be sure to describe the strategies to the extent that reflects your understanding of them.
What classic DNA sequencing or Next Generation sequencing method would you use and why? Discuss the main differences between method types and defend your answer. Be sure to describe the methods to the extent that reflects your understanding of them.
[Hint from Dr. Johnson: Revisit Lecture 2]
After contemplating the approaches and technologies available to sequence a Genome, you realize your lab budget cannot support such a grandiose first step. Darn!
Instead, you next consider, sequencing only those genomic sequences that are expressed.
Question Series 2 (1.25 points)
Why would sequencing genomic sequences that are expressed be a good alternative to sequencing the entire genome? What’s the advantage? What’s the disadvantage?
What does EST stand for?
[Hint from Dr. Johnson: Revisit Lecture 2]
Convinced sequencing expressed genes is a good alternative to genomic sequencing, you sample part of the plant while it is in its green state, isolate RNA from it, and construct a cDNA library.
You next send this cDNA library to a third party service provider for DNA sequencing.
Soon after, you start to receive sequence back from the provider, but notice the sequences are not in the traditional FASTA format you are used to working with.
One of the longest sequences you receive back is the following:
TITLE Sequence 1258 bases
5 10 15 20 25 30
1 G G C T C T G G A C T G G G G A C A C A G G G A T A G C T G
31 A G C C C C A G C T G G G G G T G G A A G C T G A G C C A G
61 G G A C A G T C A C G G A G G A A C A A G A T C A A G A T G
91 C G C T G T A A C T G A G A A G C C C C C A A G G C G G A G
121 G C T G A G A A T C A G A G A C A T T T C A G C A G A C A T
151 C T A C A A A T C T G A A A G A C A A A A C A T G G T T C A
181 A G C A T C C G G G C A C A G G C G G T C C A C C C G T G G
211 C T C C A A A A T G G T C T C C T G G T C C G T G A T A G C
241 A A A G A T C C A G G A A A T A C T G C A G A G G A A G A T
271 G G T G C G A G A G T T C C T G G C C G A G T T C A T G A G
301 C A C A T A T G T C A T G A T G G T A T T C G G C C T T G G
331 T T C C G T G G C C C A T A T G G T T C T A A A T A A A A A
361 A T A T G G G A G C T A C C T T G G T G T C A A C T T G G G
391 T T T T G G C T T C G G A G T C A C C A T G G G A G T G C A
421 C G T G G C A G G C C G C A T C T C T G G A G C C C A C A T
451 G A A C G C A G C T G T G A C C T T T G C T A A C T G T G C
481 G C T G G G C C G C G T G C C C T G G A G G A A G T T T C C
511 G G T C T A T G T G C T G G G G C A G T T C C T G G G C T C
541 C T T C C T G G C G G C T G C C A C C A T C T A C A G T C T
571 C T T C T A C A C G G C C A T T C T C C A C T T T T C G G G
601 T G G A C A G C T G A T G G T G A C C G G T C C C G T C G C
631 T A C A G C T G G C A T T T T T G C C A C C T A C C T T C C
661 T G A T C A C A T G A C A T T G T G G C G G G G C T T C C T
691 G A A T G A G G C G T G G C T G A C C G G G A T G C T C C A
721 G C T G T G T C T C T T C G C C A T C A C G G A C C A G G A
751 G A A C A A C C C A G C A C T G C C A G G A A C A G A G G C
781 G C T G G T G A T A G G C A T C C T C G T G G T C A T C A T
811 C G G G G T G T C C C T T G G C A T G A A C A C A G G A T A
841 T G C C A T C A A C C C G T C C C G G G A C C T G C C C C C
871 C C G C A T C T T C A C C T T C A T T G C T G G T T G G G G
901 C A A A C A G G T C T T C A G C A A T G G G G A G A A C T G
931 G T G G T G G G T G C C A G T G G T G G C A C C A C T T C T
961 G G G T G C C T A T C T A G G T G G C A T C A T C T A C C T
991 G G T C T T C A T T G G C T C C A C C A T C C C A C G G G A
1021 G C C C C T G A A A T T G G A G G A T T C T G T G G C G T A
1051 T G A A G A C C A C G G G A T A A C C G T A T T G C C C A A
1081 G A T G G G A T C T C A T G A A C C C A C G A T C T C T C C
1111 C C T C A C C C C C G T C T C T G T G A G C C C T G C C A A
1141 C A G A T C T T C A G T C C A C C C T G C C C C A C C C T T
1171 A C A T G A A T C C A T G G C C C T A G A G C A C T T C T A
1201 A G C A G A G A T T A T T T G T G A T C C C A T C C A T T C
1231 C C C A A T A A A G C A A G G C T T G T C C G A C A A A
Question Series 3 (1.25 points)
What tool is available for you to interchange sequence file formats if needed?
When you convert this file to FASTA format what do you get? Copy/paste the FASTA format for the sequence as your answer.
[Hint from Dr. Johnson: Revisit Lecture 2]
You decide to explore this particular sequence.
You want to determine if it is known and/or what protein sequences in other plants it might be similar to?
You decide to perform a BLAST search against the "Reference Proteins (refseq_protein)" database for "Organism" = "plants (taxid:3193)".
Question Series 4 (4 points)
What is BLAST? Describe the algorithm, discuss what a HSP is.
What is an E-value? How is it determined? What is considered a significant E-value for DNA BLAST Hits vs. PROTEIN BLAST Hits?
What is the difference and advantages of running BLAST using the blastx vs. blastn algorithm?
After running blast on the expressed sequence, how many sequences are returned with a significant E-value?
For the significant results returned, what non-hypothetical non-predicted gene is likely being coded for by the sequence you performed blast on?
For the significant results returned, what is the translation frame that the protein for the gene is likely being generated from?
[Hint from Dr. Johnson: Revisit Lecture 5]
You are pleased with the BLAST results, as there are a good number of sequences found to be similar to your query sequence.
Also, there appears to be a consensus across the BLAST hits as to what the expressed sequence likely is.
When considering the gene identity for the expressed sequence, you realize that it can serve as a "Molecular Clock" to explore what the evolutionary relationships might be between the plant you have discovered and those species represented in the top BLAST hits.
So, you go ahead and do.
First, you generate the multiple alignment of the top BLAST hits in conjunction with the expressed sequence without any editing in FASTA format.
After, you use the multiple alignment to construct a phylogenetic tree for examination using the “Neighbor Joining” algorithm.
Question Series 5 (4 points)
What is meant by "Molecular Clock"? Define, describe.
For the significant BLAST hits returned, what is the top scoring sequence for each of the following species regardless if hypothetical or predicted:
To answer, copy/paste a concated FASTA list of these sequences. Important requirement, keep the order of the FASTA formatted sequences in the exact same order asked. Also, modify the FASTA description line to only include the species identifier (e.g., >Vitis vinifera, >Populus trichocarpa, etc.). Very important also, include as the first FASTA sequence in your concated list of sequences the translated query sequence you used in BLAST using the most relevant observed "Frame" (Hint: use the EMBOSS transeq tool to translate your query sequence: http://www.ebi.ac.uk/Tools/st/emboss_transeq/)
What is the MSA for the concated sequences prepared in FASTA format? To answer, use the multiple sequence alignment tool we used as part of our class Exercises. Important to note, after you copy/paste your sequences into the tool, set the “OUTPUT FORMAT” to “Pearson/FASTA” before you run the tool. Copy/paste the resulting MSA in fasta format as your answer.
Per question 5.c, what multiple sequence alignment tool did you use?
Why use Neighbor Joining as oppose to UPGMA? What's the difference?
From overlooking the resulting tree, what species does the plant you discovered appear to be most like? Provide the species name as your answer along with image of the tree.
Per your answer provided to question 5.f, what distance tool, tree tool, and draw tool did you use?
[Hint from Dr. Johnson: Revisit Lecture 6 & Lecture 7]
You are pleased that you may have found what the plant is most similar to, but would like to corroborate the findings using a non-distance based method.
Question Series 6 (1.25 points)
What non-distance based methods for tree construction are available? Limit your answer to those methods discussed in the Lecture Slides.
Using one of the methods provided as your answer in 6.a, what species does the plant you discovered appear to be most like? Provide the species name as your answer, along with image of the tree. Also comment on whether the results corroborate or refute the distance-based results.
[Hint from Dr. Johnson: Revisit Lecture 7]
After receiving all sequences from the third party vendor, you decide that you also want to sequence all the expressed genes when the plant is in the purple color state, as opposed to just the green state.
So you go ahead and do.
After receiving all the sequences, you decide to deposit them into a public database.
Question Series 7 (0.75 point)
What is one of at least 3 database(s) you could deposit the sequences into, such that the other two databases will automatically get populated with the sequences? Restrict your answer to those databases discussed as part of the class Lecture content.
What is the recommended tool made available to do large set submissions?
[Hint from Dr. Johnson: Revisit Lecture 2]
Sometime after depositing the sequences, you get a call from a colleague who has looked at the raw number of sequences you submitted vs. unique sequences.
Your colleague goes on to share that they identified one highly over expressed gene in the purple color state compared to the green color state, and that they used the sequence to locate the position of the expressed gene in the genome.
Your colleague goes on to say they next used "primer walking" to generate a genomic sequence that likely contains the full gene.
Your colleague then asks if you would be willing to take the genomic sequence, predict the full gene, and the corresponding protein structure. You of course say yes.
Here is the genomic sequence provided to you by your colleague:
Question Series 8 (5 points)
What is the sequence of the likely gene of interest contained within the genomic sequence provided? Copy/paste the predicted gene sequence in protein FASTA sequence format as your answer and the tool you used.
Per your answer to question 8.a, is the predicted gene coding or non-coding? What tool did you use to answer this question? Copy/paste the output from the tool used in addition to your answers.
Per your answer to question 8.a, what is the predicted protein most similar to? Provide the Gene name and the Species name as part of your answer along with what tool you used to determine the answer.
What is the likely structure for the predicted protein when using a homology-modeling method? Copy/paste the image of the predicted structure as your answer?
Per your answer to question 8.d, what tool did you use?
Per your answer to question 8.d, what homologous template was used to predict the protein structure? Provide the PDB id and name of protein as your answer.
[Hint from Dr. Johnson: Revisit Lecture 8 & Lecture 11]
After having identified the candidate gene and the corresponding protein structure, you share the results with your colleague.
As part of the discussion of results with your colleague, you learn that your colleague is working on another unrelated project that they feel you may be able to help out with.
Specifically, your colleague goes on to explain that they have sequences for non coding RNAs observed to be over expressed in a mouse research experiment conducted.
Your colleague shares with you that amount of AQP1 protein present across the experiment conditions studied appears to be a critical factor per outcome and that they suspect a miRNA to be involved in its translation.
In turn, your colleague asks if you could use your bioinformatics expertise to determine which of the non coding RNA sequences they have identified may in fact be regulating AQP1, if any, along with what the secondary structure of the miRNA might be.
You of course say yes, I can do.
Here are the non coding RNA sequences provided to you by your colleague:
>Mouse (mmu) candidate 1
>Mouse (mmu) candidate 2
>Mouse (mmu) candidate 3
>Mouse (mmu) candidate 4
>Mouse (mmu) candidate 5
Question Series 9 (4.5 points)
Using the non coding RNA sequences provided, what is the top known miRNA search result found for each sequence?
Per your answer to question 9.a, which search tool did you use?
Per your answer to question 9.a, which sequence has a “conserved” binding sequence present in the 3'UTR for AQP1?
Per your answer to question 9.c, which search tool did you use?
Per your answer to question 9.c, what is the "Stem-Loop" structure for the sequence?
Per your answer to question 9.c, what is the "Mature" sequence?
Per your answer to question 9.c, what part of the "Stem Loop" sequence actually binds the UTR for AQP1? Underline the sequence as your answer.
[Hint from Dr. Johnson: Revisit Lecture 10]
You provide the results back to your colleague and mutually agree this has been a rewarding collaboration.
You go onto ponder how there are likely more differentially expressed sequences between the green color state vs. purple color state than just the one sequence explored.
Your colleague agrees and suggests, together, we should continue doing work.
You agree and quickly think that microarrays might be one possible way to proceed.
Question Series 10 (3 points)
Although capable of measuring expression for tens of thousands of genes simultaneously, what is a well understood restriction of microarray technology?
Given the plant species you have is novel, what would you need to do in order to leverage and use this technology? Would you not use? Or, what could you do to use?
If provided microarrays capable of measuring gene expression in the plant species you have, what are the experiment and analysis steps involved, given samples of plant collected in the green state and samples collected in the purple state, to determine what genes have differential expression significantly associated with change in color?
[Hint from Dr. Johnson: Visit Lecture 12 – to be posted next week]
- This solution has not purchased yet.
- Submitted On 02 Jun, 2017 01:42:59