This walkthrough illustrates how to apply the GEP annotation strategy for the Pathways Project to construct a gene model for the Ras homolog enriched in brain (Rheb) gene in Drosophila yakuba.
Biological systems are networks, and in these networks, we can define nodes (e.g., genes, proteins, metabolites) connected through edges (e.g., enzymatic/chemical reactions, transcription regulation). Networks have properties that can be measured using a mathematical approach, and we can make predictions about the evolution of a system based on some of those properties.
A “pathway” in a biological system can be defined as a relatively discrete (though never completely isolated) portion of a network. Generally, we view a pathway as a sequence of gene regulatory and enzymatic reactions that produce some important biological outcomes (e.g., synthesize an energy storage molecule, sense and regulate blood sugar levels).
In this project we will be using network analysis approaches to better understand the evolution and function of biological pathways. The Pathways Project is focused on annotating genes found in well characterized signaling and metabolic pathways across the Drosophila genus. The current focus is on the insulin signaling pathway which is well conserved across animals and critical to growth and metabolic homeostasis. The long-term goal of the Pathways Project is to analyze how the regulatory regions of genes evolve in the context of their positions within a network and we anticipate that other pathways will eventually be part of the analyses.
Pathways Project Overview provided by the Project Leader, Laura K. Reed (6 minutes) Slideset
The Annotation Workflow is a one page summary of the annotation protocol for the Pathways Project.
The Reference Glossary includes definitions for terms that are frequently used in the Pathways Project.
This “Annotation Form” merged the “Annotation Report” and “Annotation Notebook” into a single document and the latter two items are now archived.
The Annotation Form Exemplar is provided as an example of a completed Annotation Form ready for submission to the GEP’s Pathways Project. The optional questions were omitted from the exemplar.
Students can apply what they learned in the Annotation Walkthrough to construct a gene model for Rheb in D. pseudoobscura by completing the Pathways Project: Annotation Form. This answer key is provided to assist instructors in checking the accuracy of the annotation and includes potential areas of confusion throughout.
Pilot Project Curriculum
This was created in response to a member mentioning their students really struggled with the genomic neighborhood and the member didn’t realize until they were already too far into the annotation to correct their misconceptions. This is meant to be a quick in-class and/or homework assignment.
This module introduces students to the GEP UCSC Genome Browser. After completing this module students will be able to navigate to a genomic region and to control the display setting for different evidence tracks.
This module uses mRNA data to identify splice sites. After completing this module students will be able to identify intron-exon boundaries using canonical splice donor and acceptor sequences and determine which are best supported by RNA-Seq and TopHat splice junction predictions.
In this module students will learn how mRNA is translated into a string of amino acids. After completing this module students will be able to determine the codons for specific amino acids as well as start and stop codons. They will be able to identify open reading frames for a given gene, define the phases of splice donor and acceptor sites and describe how they impact the maintenance of the open reading frame.
This module explores how multiple different mRNAs and polypeptides can be encoded by the same gene. After completing this module students will be able to explain how alternative splicing of a gene can lead to different mRNAs and illustrate how alternative splicing can lead to the production of different polypeptides and result in drastic changes in phenotype.
This walkthrough serves as an introduction to key functionalities of NCBI BLAST. Exercise Exercise Worksheet Worksheet Answer Key Answer Key Package without Answers Package
This PowerPoint presentation provides a brief introduction to the different types of RNA-Seq evidence tracks (e.g. Bowtie, TopHat, Cufflinks) that are on the GEP UCSC Genome Browser.
- Rele CP, Sandlin KM, Leung W and Reed LK. Manual annotation of Drosophila genes: a Genomics Education Partnership protocol [version 1; peer review: 2 approved with reservations]. F1000Research 2022, 11:1579
- Mudge, J. M., & Harrow, J. (2016). The state of play in higher eukaryote gene annotation. Nature Reviews Genetics, 17(12), 758-772.
- Weitz, J. S., Benfey, P. N., & Wingreen, N. S. (2007). Evolution, interactions, and biological networks. PLoS biology, 5(1), e11.
- Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N., & Barabási, A. L. (2000). The large-scale organization of metabolic networks. Nature, 407(6804), 651-654.
- Alvarez-Ponce, D., Aguadé, M., & Rozas, J. (2009). Network-level molecular evolutionary analysis of the insulin/TOR signal transduction pathway across 12 Drosophila genomes. Genome research, 19(2), 234–242.
- Alvarez-Ponce, D., Guirao-Rico, S., Orengo, D. J., Segarra, C., Rozas, J., & Aguadé, M. (2012). Molecular population genetics of the insulin/TOR signal transduction pathway: a network-level analysis in Drosophila melanogaster. Molecular biology and evolution, 29(1), 123–132.
- Alvarez-Ponce, D., Aguadé, M., & Rozas, J. (2011). Comparative genomics of the vertebrate insulin/TOR signal transduction pathway: a network-level analysis of selective pressures. Genome biology and evolution, 3, 87–101.
- Alvarez-Ponce D. (2012). The relationship between the hierarchical position of proteins in the human signal transduction network and their rate of evolution. BMC evolutionary biology, 12, 192.
- Alvarez-Ponce, D., Feyertag, F., & Chakraborty, S. (2017). Position Matters: Network Centrality Considerably Impacts Rates of Protein Evolution in the Human Protein-Protein Interaction Network. Genome biology and evolution, 9(6), 1742–1756.
- Lynch, M., & Conery, J. S. (2000). The evolutionary fate and consequences of duplicate genes. Science (New York, N.Y.), 290(5494), 1151–1155.
- Force, A., Lynch, M., Pickett, F. B., Amores, A., Yan, Y. L., & Postlethwait, J. (1999). Preservation of duplicate genes by complementary, degenerative mutations. Genetics, 151(4), 1531–1545.
- Bhutkar, A., Schaeffer, S. W., Russo, S. M., Xu, M., Smith, T. F., & Gelbart, W. M. (2008). Chromosomal rearrangement inferred from comparisons of 12 Drosophila genomes. Genetics, 179(3), 1657–1680.
- Wang, M., Wang, Q., Wang, Z., Wang, Q., Zhang, X., & Pan, Y. (2013). The Molecular Evolutionary Patterns of the Insulin/FOXO Signaling Pathway. Evolutionary bioinformatics online, 9, 1–16.
- Grönke, S., Clarke, D. F., Broughton, S., Andrews, T. D., & Partridge, L. (2010). Molecular evolution and functional characterization of Drosophila insulin-like peptides. PLoS genetics, 6(2), e1000857.
- Brogiolo, W., Stocker, H., Ikeya, T., Rintelen, F., Fernandez, R., & Hafen, E. (2001). An evolutionarily conserved function of the Drosophila insulin receptor and insulin-like peptides in growth control. Current biology : CB, 11(4), 213–221.
In this example, Ilp6 is within the intron of Raf-PE, however Raf-PA is upstream of Ilp6.
We are defining gene order based on the first/closest protein coding exon only. So if the gene is nested in an intron that is between two non-coding exons then we ignore those UTRs and just define gene order based on the coding exons. If a gene is nested in an intron between two coding exons of another gene then we describe that as nesting. So in this example, Raf is upstream of Ilp6
The Genome Browser Gateway should default to the correct assembly once you click on the Drosophila species in the left-hand table. To double check, you are using the correct one, you can see which assembly you should be using via the “Genome Browsers” column of the Pathways Project Genome Assemblies web page.
For example, D. yakuba has three assembly options to choose from and according to the Genome Assemblies page, we should use the “Aug. 2021 (Princeton Prin_Dyak_Tai18E2_2.1/ DyakRefSeq3)” assembly when annotating D. yakuba.
- A sequencing error needs to be first validated by performing tblastn of the region in question against another assembly – it is rare for the same sequencing/assembly error to be present in two distinct assemblies.
- Once the sequencing error has been validated, you can use the Sequence Updater to make a VCF file.
- When using the Gene Model Checker, under “Model Details > Errors in Consensus Sequence”, select “Yes”, and upload the VCF file.
- Validate the model as usual, and then submit the VCF file during submission along with the Annotation Form, the PEP, FASTA, and GFF.
- Identify the last named isoform for the gene.
- Add “-PN” (for putative/protein novel) to the end of the gene name.
- Add the letter following the letter of the last named isoform.
For example, if the last named isoform for a gene is “first-PJ,” the novel isoform for that gene would be named “first-PNK.”