Tips for the First Time Annotator

The following information was cultivated for first-time annotators to gain their bearings in the eukaryotic annotation process for the GEP. Special thanks to GEP TA D’Andrew Harrington (College of Southern Nevada) for creating the original version of this document. This page is a living document.

Getting Started

What do I need to do to succeed?

Success starts and ends with you. There isn’t any other method or secret it boils down to. Find a system that works for your time constraints and be curious. By asking questions and spending time studying the materials – you will exponentially increase your chances of success in your annotations and any other scientific endeavor you are looking to achieve.

What are recommendations for time management?

It’s no secret that most of your Professors have offered insight into studying and time management as a whole. Everyone learns and applies information differently. “SOHCAHTOA” in Trigonometry or even “Pure As Gold,” to remember Purines and Pyrimidines are examples of study methods to facilitate long-term memory. Since everyone applies thoughts and ideas differently – it is crucial to be comfortable when studying and genuinely putting time aside to accomplish a task. Here are a few tips that we, as TA’s, have found that work:
 
1. Set time aside to study. Those still getting into the habit of studying should start with small goals like, “This week, I will study two times for 30 minutes for my Biology class.” By starting small and working up to larger goals, you will naturally ease your way into more complex materials with more extended time frames. Be cautious, though, do NOT burn yourself out! If you are just struggling with the material or feel like revisiting something in a few days will help – that’s okay.  
 
2. Don’t stop in the middle. Annotation can be an in-depth dive each time you review or unravel new information. Because of this, be sure to only stop when you reach a natural ending point. If you stop an annotation at a mid-point, say, in the middle of a BLAST it may be more difficult for you to get back up to speed and remember where you were. 
 
3. The “full disclosure.” When you are assigned a contig or gene – the annotation process can take a considerable amount of time if you are not prepared. Leverage the time it will take to complete this annotation in its entirety. While this can also depend on your Professor and Facilitator, recognize that we are not setting you up to fail. The GEP has a wide range of resources to help you succeed and to, most importantly, retain the material. We genuinely hope you enjoy this course as much as each of us has in shaping it into what it is today.

Are there materials that will make this course easier?

Throughout this course, we have found that having the following information and materials handy does a world of difference:
 
1. Keep a lab notebook, digital or otherwise: to keep track of annotation information
 
2. A flash drive or digital storage to store screenshots and data files

Who can I reach out to if I need help?

If you are ever having a difficult time with your homework, annotations, or even just need someone as a sounding board while you work assignments out – the GEP TA’s are here to help any way we can. We recognize that a course like this might feel a bit foreign from other labs you have done in the past because this isn’t something cookie cutter. These are actual annotations that other scientists are working on with you. Don’t feel discouraged, and stay persistent because the end product is something unique and all your own!  
 
The GEP TA’s (Virtual TA Schedule) are reachable by the Zoom room link your Professor provided to you. 

Breakdown of Annotation Tools

What tools do I have to help me in annotation?

The annotation process can be broken down into “stages” or “steps” that will create a more manageable timeframe for you to complete your assignments. Each tool is useful in its own way and shouldn’t be skipped or dodged. 

Annotation Files Merger

The Annotation Files Merger allows for the merging of multiple files gathered during the annotation process. This is important because it allows consistency in data collection and a quick method to review everything from a top-down glance. The supported file types for the annotation files merger are: GFF (Generic Feature Format), FASTA (FAST-ALL), PEP (Peptide File Format), and VCF (Variant Call Format).

The file types listed will be found throughout your annotations so it will also be important to recognize what files are which and why they are important too. 

Reminders

  • Even if the isoforms are identical, you should merge each file type for every isoform:

See the Annotation Files Merger User Guide for more information.

FlyBase

FlyBase is a bioinformatic database for all things Drosophila. This website can be a great place to study the who, what, where, when, and why behind a particular gene within D. melanogaster. You can best use FlyBase by searching for individual genes to uncover detailed reports that summarize genomic location, functionality, orthological pairing, and much, much more. 

Reminders

  • If you are looking for a deeper understanding of your assigned genes – look no further than FlyBase.
  • It can be incredibly dense within the FlyBase reports, so be sure to have a general idea of what you are looking for.
  • When looking for a gene, remember that gene names are case-sensitive.


See the FlyBase Tools and Downloads Documentation for more information.

Gene Model Checker

The Gene Model Checker is a key “checkpoint” in the annotation process. This allows us to visualize our annotation as compared to that of the D. melanogaster gene. Here we can see a bigger picture as to what may or may not be missing so be sure to review your dot plot and protein alignment for possible errors or common mistakes. 

Reminders

  • The Gene Model Checker is where you will find:

See the Gene Model Checker User Guide and Video Tutorial for more information.

Gene Record Finder

The Gene Record Finder can be used to break down the complex information from FlyBase into a quick and easy-to-read form. More importantly, the Gene Record Finder can provide transcription details, polypeptide details, and isoforms for the gene you are examining.

Reminders

  • If you are annotating the UnTranslated Regions (UTRs), be sure to use the Transcription Details.
  • If you are looking for gene CoDing Sequences (CDSs), be sure to use the Polypeptide Details.
  • Make sure you account for EACH isoform. Unique or identical, we need information on all isoforms!

See the Gene Record Finder User Guide for more information.

GEP UCSC Genome Browser

The GEP UCSC Genome Browser is a tool to visualize genetic information in an easy-to-use format. This makes it far easier for us to examine unique patterns or spot similarities across species.

Reminders

  • Be sure to check if your gene is located in the reverse or forward position in relation to your scaffold. This error commonly occurs!

See the UEG | Genome Browser Video for a breakdown of how to use the GEP UCSC Genome Browser.

NCBI BLAST

BLAST is vital for success in genomic annotation as a whole. As mentioned in the introductory video for navigating and interpreting BLAST results, there are numerous steps that need to be done in order to make sure BLAST is working for you and you aren’t working for BLAST.

Reminders

  • BLAST is heuristic and deterministic, which means it will not give you the same answer every single time, so it should only be used as a guide, along with other lines of evidence for your model.
  • Evidence gathered from BLAST is not always exact or precise. Be sure to check your coordinates carefully!
  • Within BLAST be sure to collect these key points from the top two hits:
BLAST
Type
Query
(sequence to match)
Database/Subject (searching for match)FunctionUse Cases
blastn (nucleotide)nucleotidenucleotidesearching with shorter queries, cross-species comparisonmap mRNAs against genomic assemblies
blastp (protein)proteinproteingeneral sequence identification and
similarity searches
search for proteins similar to predicted genes
blastxnucleotide → proteinproteinidentifying potential protein products encoded by a nucleotide querymap proteins/CDS against genomic sequence
tblastnproteinnucleotide → proteinidentifying database sequences encoding proteins similar to querymap proteins against genomic assemblies
tblastxnucleotide → proteinnucleotide → proteinidentifying nucleotide sequences similar to the query based on their coding potential 
identify genes in unannotated sequences

Arrows indicate the BLAST program translates the nucleotide sequence before performing the search.

See the Introduction to NCBI BLAST and Introduction to BLAST using Human Leptin lessons for more information.

Pathways Project Genome Assemblies

The Pathways Project Genome Assemblies page is by far the quickest and most effective way to navigate to BLAST if you are looking to search against a specific assembly.

Reminders

  • Pathways Project Genome Assemblies searches against SPECIFIC species assemblies. If you are looking for something a bit broader, you can navigate to NCBI BLAST.

Sequence Updater

The Sequence Updater creates a VCF (Variant Call Format) file, which can be used to update an existing assembly. This is used whenever a student may have sufficient evidence to suggest that an assembly has an error causing an incorrect alignment.

Reminders

  • Before suggesting a consensus error, make sure to contact your instructor and/or a TA to confirm that you have looked at all other lines of evidence; using the Sequence updater and proposing a consensus error should always be your last resort. Updating a sequence does not happen often and should not be used unless there is significant evidence to suggest it. More often than not, this tool will not be used.
  • You must have multiple sequences that suggest an error. Much like verifying your work, it’s better to have numerous pieces pointing to the same potential error than to only have one. Is the change you are seeking valid and accurate?
  • As outlined in the GEP Tools | Sequence Updater video tutorial, keep in mind once you apply these changes, that’s it. Be absolutely certain this is what you want.

Small Exons Finder

Do you have an exon that is too small to find with BLAST? Is the result, no matter what you change, coming up with “No Significant Matches Found?” If so, the Small Exons Finder can offer some insight in locating small exons that would otherwise be finding a needle in a very large haystack.

Reminders

  • When using the Small Exons Finder, you should use the whole scaffold from the GEP UCSC Genome Browser and not just the position of your predicted gene. You can do this by following the steps below:

Conclusion

Ideally, the annotation process for anyone who is just starting out will take time and that’s okay. The initial hardship you face is not because you don’t understand the science but are learning to use these specific tools, the use of which will become easier with time. Patience, time, and dedication to your work will not only offer you a better sense of why annotation can be an effective scientific process, it will also provide you with skills that can eventually be utilized in a scientific or research career. We are not here to create unneeded difficulty; we simply strive to offer a genuine look at an introduction to bioinformatics and genomic annotations. If at any time during your annotation you feel lost or unsure where to go, simply reach out the GEP TAs, and we will be more than happy to provide a nudge in the right direction.