![]() |
| |||||||||
|
||||||||||
| tRNAscan-SE Output Legend and Search Methods |
Output Format
Tabular Output Format
Sequence tRNA Bounds tRNA Anti Intron Bounds Cove Name tRNA # Begin End Type Codon Begin End Score -------- ------ ----- --- ---- ----- ----- ----- ----- CELF22B7 1 12619 12738 Leu CAA 12657 12692 60.01 CELF22B7 2 19480 19561 Ser AGA 0 0 80.44 CELF22B7 3 26367 26439 Phe GAA 0 0 80.32 CELF22B7 4 26992 26920 Phe GAA 0 0 80.32 CELF22B7 5 23765 23694 Pro CGG 0 0 75.76
Each new tRNA in a sequence is consecutively numbered in the "tRNA #" column. "tRNA Bounds" specify the starting (5') and ending (3') nucleotide bounds for the tRNA. tRNAs found on the reverse (lower) strand are indicated by having the Begin (5') bound greater than the End (3') bound (see tRNAs #4 & #5 in output above).
The "tRNA Type" is the predicted amino acid charged to the tRNA molecule based on the predicted "Anticodon" (written 5'->3') displayed in the next column. tRNAs that fit criteria for potential pseudogenes (poor primary or secondary structure, see Pseudogene Detection), will be marked with "Pseudo" in the "tRNA Type" column. If there is a predicted intron in the tRNA, the next two columns indicate the nucleotide bounds. If there is no predicted intron, both of these columns contain zero. The final column is the Cove score for the tRNA in bits. Note that this score will vary somewhat depending on the particular tRNA covariance model used in the analysis (the search mode selects which tRNA covariance model will be used: eukaryote-specific, prokaryote-specific, archae-specific, or general). tRNAscan-SE counts any sequence that attains a score of >= 20.0 bits as a tRNA (based on empirical studies conducted by Eddy & Durbin, 1999).
Secondary Structure Format
The first line contains the sequence name, trna#, tRNA bounds (in parentheses), and length of the tRNA. The next line contains the isoacceptor tRNA Type, Anticodon (with tRNA-relative and sequence-absolute bounds), and the Cove Score. This is identical information as would be seen in the tabular output format, excluding the anticodon bounds. The next line contains hash marks every 5 and 10 bp to ease position identification in the tRNA sequence that appears on the following line. On the sequence line, nucleotides matching the "consensus" tRNA model used in Cove analysis appear in upper case, while introns and other nucleotides in non-conserved positions are printed in lower-case letters. The last line contains predicted secondary structure folding of the tRNA, with nested ">" and "<" symbols representing base pairings. The various tRNA features are labelled in this example.
CELF22B7.trna4 (26992-26920) Length: 73 bp
Type: Phe Anticodon: GAA at 34-36 (26959-26957) Score: 73.88
* | * | * | * | * | * | * |
Seq: GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA
Str: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.
| | | | | | | || |
+-----+ +--------------+ +---------------+ +---------------++-----+
| D-stem/loop Anticodon TPC stem/loop |
| stem/loop |
+----------------------------------------------------------------+
Isoacceptor stem
tRNAscan-SE Detection Algorithm
tRNAscan-SE does no tRNA detection itself, but instead combines the strengths of three independent tRNA prediction programs by negotiating the flow of information between them, performing a limited amount of post-processing, and outputting the results. The program works in three main phases. In the first stage, it runs two independent tRNA detection programs on the input DNA sequence. These relatively fast, first-pass detection programs include a modified, optimized version of tRNAscan 1.3 (1), and EufindtRNA, an implementation of another tRNA search algorithm previously described (3).
tRNAscan 1.3 detects tRNAs by initially looking for short, well conserved intragenic promoter sequences (A & B boxes in eukaryotes) found in the TPC and D arm regions of prototypic tRNAs. Once a specific number of nucleotides in the sequence match the consensus promoter (defined by an arbitrary score threshold), the program then progressively attempts to identify the various stem-loop structures found in the tRNA "clover leaf". As each arm is identified by the presence of base-pairing in the stem, correct loop size, and several invariant and semi-invariant bases, a "general score" counter is incremented. If the final score exceeds an empirically determined threshold, the tRNA location, anticodon, and type are saved.
EufindtRNA, on the other hand, only searches for linear sequence signals. A step-wise algorithm uses newly developed log-odds score matrices to first identify A and B box promoter elements that exceed an empirically determined cutoff. The scores for these A and B boxes are then added to a log odds score for the nucleotide distance between the A and B boxes to produce an intermediate score. Finally, a log odds score for the distance to the nearest downstream poly-T pol III termination signal is added to the intermediate score to obtain a final score. If the final score is above a final score cutoff, the tRNA identity and location is saved. tRNAscan-SE uses a less selective version of this algorithm that does not look for pol III termination signals, thus uses the intermediate score as a final cutoff. Also, the intermediate score cutoff is loosened slightly relative to the intermediate cutoff described in the original algorithm (3). These modifications increase the algorithm's sensitivity but greatly reduce EufindtRNA's selectivity. This does not reduce the final selectivity of tRNAscan-SE since a secondary filter (Cove) is being used to eliminate false positives. The sensitivity of EufindtRNA is roughly comparable to tRNAscan 1.3, but it appears to be complementary in that EufindtRNA tends to identify tRNAs missed by tRNAscan 1.3 and vice versa (3). tRNAscan-SE takes advantage of this fact, and saves results from both tRNAscan 1.3 and EufindtRNA, then merges them into one list of non-redundant "candidate" tRNA identifications.
In the second stage, tRNAscan-SE extracts the DNA subsequences identified as possible tRNAs and passes only these segments to an RNA search program in the Cove program suite (covels) for analysis. Cove programs look for tRNAs in a very different way. A probabilistic model for tRNA has been developed by aligning known tRNAs and giving a base-specific probability score to every nucleotide in the tRNA model. Also, Cove uses a special method for capturing secondary RNA structure information using a type of language referred to as a stochastic context-free grammar (SCFG). Cove applies this probabilistic model to the entire windowed sequence, and produces a probability score that the sequence matches the tRNA model. If the score exceeds 20.0 bits, the tRNA is considered a true tRNA (based on empirical studies in ref. 2).
In the final phase, tRNAscan-SE takes those tRNAs confirmed as such
and runs another Cove program (coves) that displays RNA secondary
structure. The tRNA type is predicted by identifying the anticodon
within the structure output. Introns are also automatically
identified from the structure output as runs of five or more
consecutive non-consensus nucleotides within the anticodon loop.
Pseudogene Detection tRNAscan-SE uses heuristics to try to distinguish pseudogenes from
true tRNAs, primarily on lack of tRNA-like secondary structure. A
second tRNA covariance model was created from the original 1415-tRNA
alignment, under the constraint that no secondary structure is
conserved (this model is effectively just a sequence profile, or
hidden Markov model (HMM)). By subtracting a tRNA's similarity score
to the primary structure-only model ("HMM Score" column) from that
using the complete tRNA model, a secondary structure-only score
("2'Str Score" column) is obtained. We have observed that tRNAs with
low scores for either component of the total score were often
pseudogenes. Thus, tRNAs are marked as likely pseudogenes if they
have either a score of less than 10 bits for the primary sequence
component of the total score (HMM Score), or a score of less than 5
bits for the secondary structure component (2'Str Score) of the total
score. Selenocysteine tRNAs are not checked by these rules since they
have atypical primary and secondary structure. Also, use of the -O
option (search for organellar tRNAs) disables pseudogene checking
since these criteria are geared towards detecting cytoplasmic
pseudogenes (some true non-eukaryotic tRNA are marked as pseudogenes
by this analysis).
tRNA with Anticodon CAT The current version of tRNAscan-SE does not include the analysis
of identifying between initiator tRNA-Met, elongator tRNA-Met, and tRNA-Ile
with anticodon CAT. The annotations presented in this database, therefore,
do not distinguish these tRNAs. This feature may be included in the future
version of tRNAscan-SE.
For more details on the program algorithm & implementation, see
the Nucleic Acids Research paper (Lowe
& Eddy, 1997).