How to use AliWABA

Step 1: Entering sequence data

Uploading sequences

AliWABA accepts nucleotide or amino acid sequence data in the form of Pearson (FASTA) files. Correctly formatted FASTA files will contain a >, followed immediately by the name of a sequence, followed by one or more lines of sequence data, followed by another > line, etc. For example:
>gene1 from Human
ATTATTGTATG....
TTCTGATCCCT....
>gene2 from Human
....
>gene3 from Mouse
....
It is important that the word immediately following a > sign be unique. Thus, the following would probably not work.
>gene1 from Human
ATTATTGTATG....
TTCTGATCCCT....
>gene1 from Mouse
....
>gene1 from Rat
....
For especially long sequences, or where you expect the number of domains to be extremely large, you may be better off downloading the ABA source code and running it locally; for very large tasks---e.g., aligning genomes---many of the tasks are easily automated and parallelized, but we cannot currently provide those services through the web site.

Selecting sequence type

In the upper right hand corner of the page is a selection box that enables you to specify whether your sequence data comprises nucleotides or amino acids. Further, if you have nucleotide sequences, you may choose between two alignment algorithms, BLAST(N) and cross_match. Typically, the two return nearly identical results, but cross_match often works slightly better for shorter sequences.

Align

Clicking the Align button will trigger the appropriate alignment algorithm and create an A-Bruijn graph.

Limitations

Step 2: Viewing the alignment

If, after clicking Align in the previous step, a blank page loads, it is likely that your sequences were not alignable. For example, sequences that are too short for reasonable alignments or that are corrupted in some way may cause an error in the alignment program. In these cases, please mail your sequence file to aba AT aba DOT nbcr DOT net and we will diagnose the problem.

Upon success, AliWABA will display a (possibly large) image of a graph. The notation of this graph can be somewhat cryptic, so an explanation is in order. As with all ABA graphs, vertices do not represent subsequences of the alignment, but intersections of subsequences. Edges represent local alignments between 1 or more sequences. In the graphical representation of the ABA graph used by AliWABA, vertices are numbered with an integer; this integer can be used to choose a path in the graph (see below). Edges are labelled with three numbers, in the format a,b(c). The first number, a, denotes the "length" of the edge (that is, the length of the local alignment induced by the edge). The second number, b, is simply an edge identifier and can generally be ignored. The final number (in parentheses) represents the multiplicity of that edge, or the number of sequences participating in the local alignment. For example, the edge label 424,44(2) describes a local alignment of length 424 between two sequences; recall that a local alignment between two sequences has a length, and that length may be different than either of the substrings that participate in it because of indels.

As a way to mark the sequences, the "source" vertex of each sequence is usually drawn with a red box around it and the name of the sequence written above the box. This can sometimes lead to confusion because the label for one sequence may be drawn in the box for another sequence, but this can easily be seen by selecting edges for a given query sequence (see below).

Usually the ABA graph will be very large. You can change the size of the image or the compactness of vertices in the image by adjusting view options, as described below.

The primary operation that can be done from this view is to select edges, whose sequence data can then be either viewed or annotated.

Navigating the alignment

On the left-hand side of the browser you will see a frame with several options: Clicking on any of these options will replace the current right-hand panel with the chosen content.
  1. Current A-Bruijn graph (returns to the current view of the A-Bruijn graph)
  2. New alignment (returns to a the main ABA sequence entry form)
  3. User's Guide (this document)

At the bottom of the graph image are two sections with parameters that you may specify. The first section, Edge Selection allows you to pick edges, paths, or sequences that you may be interested in exploring further. It is generally the case that an ABA graph contains a few high multiplicity edges that may represent interesting alignments, and many low multiplicity (or unique) edges that act as a sort of sequence glue that is less interesting. Often, you want to identify the high multiplicty (and especially, long and high multiplicity) edges and scrutinize them further, by retrieving the sequences along those edges or by annotating them against known domain databases.

There are five ways to specify a set of edges for which you want more information. To activate any particular method, click on its radio button and fill in the necessary parameters; pressing the "Change Selection" button will update the image (selected paths are marked in red), while pressing the "Reset Selection" button will forget any selections you've made.

All edges along one of the query sequences.
The sequences are listed in a selection box according to the first word in the FASTA identification line. Nucleotide sequences will also include all reverse complements, denoted "-rc"; an input nucleotide file composed of ERVL, MLT2B3, and RICKSHA will also allow the selection of ERVL-rc, MLT2B3-rc, and RICKSHA-rc.
Explicit path.
Paths may also be specified explicitly, as sequences of vertices. Vertices are specified by the number inscribed in them in the image, and a sequence is written as (a,b,c,d,...) containing 2 or more vertices. Multiple sequences can be activated at once as (a,b,c,d...)+(x,y,z...). Once an edge is selected, it is always selected, so (1,2,3,4,5)+(2,3,4) is the same as (1,2,3,4,5), not (1,2)+(4,5).
Extend current selections by ____ edges on each end
Selects all edges between vertices that are a ____ radius away from each vertex currently in the selection. A large value here will likely select the entire graph. Selecting the entire graph is probably not very useful.
Select edges with (multiplicity|length) (greater than|equal to|less than) ______
Selects any edge with the given characteristics.
Select edges near vertices that have (in|out)-degree (greater than|equal to|less than) _____
Selects any edge with the given characteristics. In some cases, edges impinging on vertices with high fan-in or fan-out may be interesting. This allows the selection of those edges.

When a selection is already present on the graph, the natural operation to perform when adding a new selection is to AND them together, which is the default operation. You can, however, take the intersection which will result in a smaller subsection; this feature is primarily for completeness, as it is almost always easier to construct the intersection by specifying its exact parts. There are cases, however, that you will need to remove edges from a selection. In this case, you can enforce Curr-New, which takes the existing selection, computes the selection that you have specified in the form, and then removes the new selection from the current selection and returns the difference. (Symmetrically, you can do this the opposite way, where you select a small set of edges, then specify a bigger set and ask for New-Curr.) One case that this is useful for is when you want to investigate only low multiplicty edges near two vertices, one with high fan-in and the other with high fan-out. Suppose you have vertices A and B, where A has in-degree 10 and B has out-degree 10, and the edge (A,B) has multiplicity 10. To get all the fan-in and fan-out edges adjacent to A or B, select edge (A,B) by explicitly describing it. Then extend the current selections by 1 edges on each end. Then explicitly specify edge (A,B), choose Curr-New and press Change Selection

Changing the (potentially huge) graph viewing options

ABA graphs can be quite large. We provide two modes of drawing the graph:

  1. Expanded view, where sequences are labelled, drawn left to right, and stretched out
  2. Contracted view, where the graph is drawn in as little space as possible
  3. <
Each of these controls only the layout of vertices within the available image size. The image size itself is controlled by setting the width and height of it in pixels. For desktop displays, the default (unrestricted) size is usually best. For smaller laptop displays, settings of 1200x1024 are not unreasonable. You can also set these values to 640x480 to get an overall understanding of the topology of the graph.

Additionally, clicking on the graph itself, regardless of size, will display the graph at full resolution. Note: Firefox, and other browsers, may choose to display a thumbnail of this image, but if you click the image again it will be displayed at full resolution.

Extracting information from the graph

Each edge in an ABA graph represents a local alignment. By selecting edges and pressing the Display FASTA for selected edges button, you will be presented with a new FASTA file that contains the segments of the sequences that aligned along those edges. This file can then be input into a multiple alignment tool of your choice to see a more detailed alignment. The alignments performed by ABA to construct the ABA graph may be a coarser alignment than the seuqences actually warrant so as to reduce spurious alignments that could cause a different graph topology. By displaying the FASTA for sequences along an edge, you can perform a more detailed or exact alignment.

Because ABA (and, by extension, AliWABA) is a tool for exploring the domain organization of biosequences, it is sensible to check domains that you might have found in the ABA graph against known domain databases. In particular, we have enabled searching the Conserved Domain Database (www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cdd) for selected edges. By performing an edge selection and then pressing the Annotate selected edges button, you will be presented with a BLAST report that describes any significant (E-value < .001) hits to Cdd. Note: This feature may take several minutes to run.

Downloading data

The GraphViz DOT file used to construct the image can be downloaded from a link at the bottom of the ABA Graph viewing page. This may be helpful if an ABA graph is needed in a figure for publication or a presentation, or if you would like to simplify the ABA graph by removing any short edges that you have some a priori reason to remove.

Future directions

The following features are being considered. Your input would be extremely helpful in determining their relative priorities or suggesting new ones.


This page last updated 1/29/2006 by Neil Jones