IndelExtractor v2.2.1 - Generates alignment masks for a sequence alignment
Depending what operating system (OS) you are using, what type of program file you have and whether you want to run IndelExtractor in graphical (GUI) or command line (CL) mode will determine how you run the program.
IndelExtractor.pl
IndelExtractor.exe [-a|-s] in_filename [-ao|-so|-mo|-uo] out_filename [options ...]
IndelExtractor.pl [-a|-s] in_filename [-ao|-so|-mo|-uo] out_filename [options ...]
See OPTIONS AND ARGUMENTS for details of all options
This is a quick start quide for those of you who are too lazy to read the rest of the documentation. This also probably means that you also wish to use IndelExtractor in the GUI mode; if not you'll have to read the documentation on OPTIONS AND ARGUMENTS.
There are also two packages of IndelExtractor available:
As a minimum you must have the Perl interpreter installed, BioPerl v1.4 and BioPerl-run v1.4 installed.
Highlight the regions of the alignment where you want to convert gaps to ambiguous/missing characters (left-click and drag). Now right click to get the Right-click menu and select Convert to Missing; this will convert any gaps ('-') in the highlighted region to ambiguous/missing characters ('X' for proteins and 'N' for nucleotides).
After you have check any partial sequences that may be present, and done this conversion, you must check the box labelled Partial Sequences Checked? before you are allowed to create an alignment mask.
Select the Mask Type you wish to generate and modify any of the Mask Generation Options shown in the Mask Options Pane. For the moment select Indels, left click ``Go'' to generate the mask. Now change the Percentage Threshold to 80% and right-click ``Go'' to generate the mask, but add it to the display (This makes it easy to compare the difference between two or more types of masks. Select the Save masked regions of alignment button (disk containing a black circle), and type a name for the file if the default one is not appropriate and select the format of the file you wish to save from the Save as Type: drop-down box. For the moment select NEXUS and click save. On inspection of this file you will notice that the file contains tha whole of the alignment, but also includes a few additional features. These include the actual mask above and below the data matrix, a SETS block that contains the character definition of the mask which is named after the Mask Type. A PAUP block also exists that will result in only the Masked regions of the alignment being analysed when run through PAUP (and visa versa for the Unmasked regions.Now try saving the whole alignment (disk icon with pencil) and Unmasked regions (disk icon) in NEXUS format and compare their contents.
This behaviour of saving the entire alignment irrespective of whether you chose to save the Masked or Unmasked regions can be removed by changing the CHARSET for masked regions option in the Nexus Output tab of the preferences to ``No''.
IndelExtractor is an application who's primary role is to identify positions of interest in a sequence alignment based on a conseneus sequence.
IndelExtractor's main features set includes the following:
IndelExtractor was originally designed to identify and save indels and the surrounding ambiguous alignment for use in a systematic study of indels. However, the application has now expanded to allow the generation of various alignment masks including: consensus, variable/invariable sites as well as the original indel sites mask. Options has also been added to allow the user to BLAST sequences remotely at NCBI and generate the multiple sequence alignments from unaligned sequences (via Clustalw).
IndelExtractor is written in Perl and requires BioPerl and UnivAln modules. The program is designed to be cross-platform and has been tested on both Microsoft Windows and GNU/Linux-based operating systems. However, see BUGS AND CHANGES for exceptions.
An alignment 'mask' is a string of characters positioned above or below a sequence alignment so that each character in the mask corresponds to a position in the alignment. With this in mind, a consensus sequence is a type of alignment mask that identifies positions that contain a residue that falls above a pre-defined percentage threshold. This consensus mask may contain many different characters (i.e. the character of the residue that fell above the threshold) or simply a single character which indicates that a particular position contained a residue that fell above the threshold. The former, contains information regarding which residue fell above the threshold but the latter, only indicates if the position contained a residue that fell above the threshold i.e. it contains information as to whether a position is 'masked' (represented by a character in the alignment mask sequence) or 'unmasked' (represented by the absence of a character in the alignment mask sequence).
Before an alignment mask is generated, an alignment is analysed to detect partial sequences. This is done by counting the number of gap (-), ambiguous (X for amino acids, N for nucleotides) and missing (?) characters each sequence contains. Any sequence that contains more than 2 standard deviations from the mean of these characters is thought of as being a partial sequence.
When running in GUI mode, these partial sequences are highlighted for you to take a closer look at and if necessary, convert gaps to ambiguity/missing characters. This is an important step, since the appearence of any number of gaps at a position in the alignment has an effect on the alignment mask that is subsequently calculated. As such you must select the option in the Mask Options Pane that states you have check the partial sequences before you are allowed to generate an alignment mask.
When running in CL mode, the -i option results in partial sequences being ignored during the calculation of the alignment mask.
The graphical user interface (GUI) has been designed to allow the user to maximise the view of a set of aligned or unaligned sequences as required, while having the most common options only a couple of clicks away and more advanced options available via 'Preferences'. I have tried to make the GUI as self-explanatory as possible, but pop-up help balloons are also used.
Basically, the GUI is divided into three horizontal panes, which can be hidden or made visible via the 'View' menu.
This is the upper most pane, which by default is hidden from view. It contains a display for a set of unaligned sequences and several buttons used for manipulating the sequences (e.g. BLAST, for BLASTing the unaligned sequences against the NCBI database, align, for aligning the sequences using a locally installed copy of Clustalw. The user is able to carry out basic sequence modifications prior to BLASTing or aligning the sequences.
This is the middle pane, which by default is visible to the user. It contains a display for a set of aligned sequences (either loaded directly from file or an alignment produced by the user from a set of unaligned sequences) and several buttons to save various parts of the alignment. Alignment modifications can be carried out in a similar manner to sequences in the Unaligned Sequences Pane.
This is the lower most pane, which by default is visible to the user. It contains several options that the user can select/modify regarding the generation of an alignment mask.
If Partial Sequences have been detected, you must first select the option to state you have checked the partial sequences, before you are allowed to generate an alignment mask.
There are many preferences and modifications the user can make in addition to those that are available immediately via the Aligned Sequences Pane. The user can specify the location of various executable files, make specific alterations to how and what sequences are BLASTed against at NCBI, make alterations regarding the appearance of the GUI etc.
This page is used to specify the location of local files and executables, so that IndelExtractor may use them for carrying out certain tasks. The 'Home Directory' is used for saving your preferences to a file (indelextractor.prefs). If you wish to be able to generate multiple alignments from a sequence file, you must first have clustalw installed locally and then point IndelExtractor to the directory containing the executable, otherwise you will not be able to produce an alignment.
This page contains several of IndelExtractor's default startup settings. There are default mask settings and output file format settings.
This page contains settings that change the way NEXUS files are saved. Currently, there is only an option to save a NEXUS 'charset' to the file that defines the current alignment mask (if one is defined). Also, an include/exclude command for that charset (dependent upon whether you are saving 'masked' or 'unmasked' positions respectively) may also be saved to the file.
This page is filled with options to modify the way in which your local copy of clustalw will run alignments from within IndelExtractor. All these settings are initially set to 'Default' so that clustalw will define the settings for the relevant parameters (so defaults will be the defaults of your clustalw program). These can be over-ridden by entering your own settings.
NOTE: Different default settings are used for DNA/protein sequences so be careful that the settings you use are compatible with the type of sequences you are aligning! (see USEFUL LINKS for more details).
This page is filled with options to modify the way in which BLAST searches are run from within IndelExtractor. Currently, only remote BLAST (submitting BLAST's to NCBI) is supported. Choose the correct BLAST Algorithm based on the type of sequences you are using. You can also enter a 'Limit by Entrez Query' from the drop-down box or type your own. (see USEFUL LINKS for more details).
Settings for a proxy server (untested).
This page has options for modifying the way the GUI looks and feels. You can change the font/font size of the main program text and the font/font size of the sequences being displayed. You can modify how many lines per click a mouse wheel will scroll in BLAST results. You can also turn off the help bubbles,. chnage which panes and colours you want to view on startup.
NOTE: It is probably best not to change the font of the sequences being displayed, as they require fixed width fonts so the columns are displayed in alignment with each other (by all means change the font size!).
IndelExtractor supports many sequence/alignment input/output file formats and uses file extensions to correctly identify the input file format or the required output format. Below is a list of supported file formats for sequence/alignment files and their associated extensions as recognised by IndelExtractor:
Supported input/output sequence file formats include the following:
NOTE: Not all formats are supported for output of sequences (not my fault)!.
Several alignment files can be output from IndelExtractor, from the entire alignment to the masked/unmasked positions, all can be output in one of the supported alignment file formats:
Supported input/output alignment file formats include the following:
NOTE: Not all formats are supported for output of alignments (not my fault)!.
One additional output format is that of IndelExtractor (.indel), it was initially developed for saving information pertaining to indel positions and remains for that reason.
This is a percentage value, it is used when calculating some of the masks (e.g. Indel and Consensus).
This is the minimum number of residues that two indel regions must be separated by before being joined together to form one larger indel.
This option is to specify if you would like to use amino acid similarity groupings when calculating some of the masks (e.g. Indel and Consensus). Similarity groups are the 'strong' groups used in the clustalw alignment program and correspond to dayhoff groupings, these groupings are:
If you elect to use similarity groupings, the mask calculations are modified in the following manner: If a particular residue does not break the threshold then the position is examined to see if any of the similarity groupings breaks the threshold.
An indel mask identified positions of the alignment that are thought to be part of an Indel. Indels are often removed together with any surrounding ambiguous alignment and the removal of 'jagged' alignment ends prior to phylogenetic analysis. This is often laborious and rather subjective as to what constitutes 'surrounding ambiguous alignment', therefore this mask identifies the positions to remove in a quick and objective manner.
The identification of Indels starts with the construction of a consensus sequence using a percentage threshold selected by the user, whereby a particular residue must be present in at least that percentage of the sequences to be included in the consensus sequence. Any position that contains a gap character, automatically has a gap inserted into the consensus sequence. Indel boundaries are then identified by searching the consensus sequence for gap characters, when found, the consensus is scanned left and right to identify the nearest position that broke the threshold - this then constitutes an indel. Neighbouring indels are joined together if the user has selected to merge indels that are only separated by a particular number of residues or less. The user may also elect to use similarity groupings when calculating the indels mask.
A Consensus mask identifies positions that break the percentage threshold set by the user. The user may also elect to use similarity groupings when calculating the consensus mask.
A variable sites mask identifies positions that are variable.
A invariable sites mask identifies positions that are invariable.
The following options and arguments are for use in CL mode only.
Options available to IndelExtractor:
Use the value specified by integer in mask calculations. Possible values for integer are: 1-100
Minimum residue seperation needed between two adjacent indel regions before being combined together. Possible values for integer are: 0-10
NOTE: Clustalw default values are used in the absence of these options. See your own clustalw documentation for their details and default values.
Options available to Clustalw:
Here is a list of useful links related to this program.
If ppm can't find some of the required packages, add the following command while in ppm
repository add "Winnipeg_server" http://theoryx5.uwinnipeg.ca/ppms
Another good URL for windows installation is: http://bioperl.org/Core/Latest/Installing_Bioperl_on_Windows_perl5.8.htm
If you would like me to make any modifications to make IndelExtractor easier to use, or have suggestions for new features e-mail me at nathanhaigh@ukonline.co.uk
None to report yet
The latest release of IndelExtractor can be obtained from http://www.nathanhaigh.co.uk/university/lab/haigh/files/software/
Please check the latest version before reporting bugs. But please report bugs to me as soon as possible (nathanhaigh@ukonline.co.uk)
Copyright © 2003-2004 Nathan Haigh (nathanhaigh@ukonline.co.uk)
University of York, England.
Nathan Haigh (nathanhaigh@ukonline.co.uk)
Previous versions of IndelExtractor were based on code from Alignmenator by Konstantinos Thalassinos and were co-written with Bryony Mackenzie (basm101@york.ac.uk)