NAME

IndelExtractor v2.2.1 - Generates alignment masks for a sequence alignment


SYNOPSIS

Depending what operating system (OS) you are using, what type of program file you have and whether you want to run IndelExtractor in graphical (GUI) or command line (CL) mode will determine how you run the program.

Graphical (GUI) Mode IndelExtractor.exe

IndelExtractor.pl

Command Line (CL) Mode

IndelExtractor.exe [-a|-s] in_filename [-ao|-so|-mo|-uo] out_filename [options ...]

IndelExtractor.pl [-a|-s] in_filename [-ao|-so|-mo|-uo] out_filename [options ...]

See OPTIONS AND ARGUMENTS for details of all options


QUICK START GUIDE

This is a quick start quide for those of you who are too lazy to read the rest of the documentation. This also probably means that you also wish to use IndelExtractor in the GUI mode; if not you'll have to read the documentation on OPTIONS AND ARGUMENTS.

There are also two packages of IndelExtractor available:

Perl Script File

As a minimum you must have the Perl interpreter installed, BioPerl v1.4 and BioPerl-run v1.4 installed.

  1. Download the file IndelExtractor_pl_2.2.1.zip or IndelExtractor_pl_2.2.1.tgz from my website http://www.nathanhaigh.co.uk/university/lab/haigh/files/software/

  2. Unzip/unpack the file. The path to IndelExtractor should not contain any spaces i.e. the c:\Program Files\IndelExtractor\ folder is not acceptable.

  3. Execute the Perl script in the usual mannor for youe Operating System.

  4. Continue from step 4 in the following section (Windows Executable).

Windows Executable

  1. Download the file IndelExtractor_exe_2.2.1.zip from my website http://www.nathanhaigh.co.uk/university/lab/haigh/files/software/

  2. Unzip the file. The path to IndelExtractor should not contain any spaces i.e. the c:\Program Files\IndelExtractor\ folder is not acceptable.

  3. Double-click the executable file IndelExtractor.exe. The first time you execute this file it will take approx. 30s to load (subsequent loads will take approx. half the time).

  4. Open IndleExtractor preferences, via the toolbar or the View menu.
    a
    Under the General tab, the Home Directory is set to the location of the IndelExtractor preference file. This files store much of the preferences you set.

    b
    If you wish to use clustalw to generate alignments from inside IndelExtractor, you will have to install clustalw on your machine in a folder that doesn't contain any spaces (e.g. c:\clustalw\). Set the Clustalw Directory on the General tab to this directory (i.e. c:\clustalw\).

  5. From the File menu, select Open Alignment, browse to, and select the file to open (test_aa.aln and test_nt.fas are supplied for testing - only test_aa.aln contains true partial sequences). The alignment will be displayed in the Aligned Sequences Pane and any sequences that are thought to be partial, have their gapped regions ('-') highlighted in cyan to draw you attention to characters that may need to be converted to ambiguous/missing ('X' for proteins and 'N' for nucleotides) characters before an alignment mask is calculated. It is important to convert Gaps to ambiguous/missing characters because <Gaps> affect the way in which alignment masks are calculated (in particular the Indels mask.

    Highlight the regions of the alignment where you want to convert gaps to ambiguous/missing characters (left-click and drag). Now right click to get the Right-click menu and select Convert to Missing; this will convert any gaps ('-') in the highlighted region to ambiguous/missing characters ('X' for proteins and 'N' for nucleotides).

    After you have check any partial sequences that may be present, and done this conversion, you must check the box labelled Partial Sequences Checked? before you are allowed to create an alignment mask.

  6. Select the Mask Type you wish to generate and modify any of the Mask Generation Options shown in the Mask Options Pane. For the moment select Indels, left click ``Go'' to generate the mask. Now change the Percentage Threshold to 80% and right-click ``Go'' to generate the mask, but add it to the display (This makes it easy to compare the difference between two or more types of masks.

  7. Select the Save masked regions of alignment button (disk containing a black circle), and type a name for the file if the default one is not appropriate and select the format of the file you wish to save from the Save as Type: drop-down box. For the moment select NEXUS and click save. On inspection of this file you will notice that the file contains tha whole of the alignment, but also includes a few additional features. These include the actual mask above and below the data matrix, a SETS block that contains the character definition of the mask which is named after the Mask Type. A PAUP block also exists that will result in only the Masked regions of the alignment being analysed when run through PAUP (and visa versa for the Unmasked regions.

    Now try saving the whole alignment (disk icon with pencil) and Unmasked regions (disk icon) in NEXUS format and compare their contents.

    This behaviour of saving the entire alignment irrespective of whether you chose to save the Masked or Unmasked regions can be removed by changing the CHARSET for masked regions option in the Nexus Output tab of the preferences to ``No''.


DESCRIPTION

IndelExtractor is an application who's primary role is to identify positions of interest in a sequence alignment based on a conseneus sequence.

IndelExtractor's main features set includes the following:

Background

IndelExtractor was originally designed to identify and save indels and the surrounding ambiguous alignment for use in a systematic study of indels. However, the application has now expanded to allow the generation of various alignment masks including: consensus, variable/invariable sites as well as the original indel sites mask. Options has also been added to allow the user to BLAST sequences remotely at NCBI and generate the multiple sequence alignments from unaligned sequences (via Clustalw).

IndelExtractor is written in Perl and requires BioPerl and UnivAln modules. The program is designed to be cross-platform and has been tested on both Microsoft Windows and GNU/Linux-based operating systems. However, see BUGS AND CHANGES for exceptions.

Alignment Masks

An alignment 'mask' is a string of characters positioned above or below a sequence alignment so that each character in the mask corresponds to a position in the alignment. With this in mind, a consensus sequence is a type of alignment mask that identifies positions that contain a residue that falls above a pre-defined percentage threshold. This consensus mask may contain many different characters (i.e. the character of the residue that fell above the threshold) or simply a single character which indicates that a particular position contained a residue that fell above the threshold. The former, contains information regarding which residue fell above the threshold but the latter, only indicates if the position contained a residue that fell above the threshold i.e. it contains information as to whether a position is 'masked' (represented by a character in the alignment mask sequence) or 'unmasked' (represented by the absence of a character in the alignment mask sequence).

Partial Sequences

Before an alignment mask is generated, an alignment is analysed to detect partial sequences. This is done by counting the number of gap (-), ambiguous (X for amino acids, N for nucleotides) and missing (?) characters each sequence contains. Any sequence that contains more than 2 standard deviations from the mean of these characters is thought of as being a partial sequence.

When running in GUI mode, these partial sequences are highlighted for you to take a closer look at and if necessary, convert gaps to ambiguity/missing characters. This is an important step, since the appearence of any number of gaps at a position in the alignment has an effect on the alignment mask that is subsequently calculated. As such you must select the option in the Mask Options Pane that states you have check the partial sequences before you are allowed to generate an alignment mask.

When running in CL mode, the -i option results in partial sequences being ignored during the calculation of the alignment mask.

THE GRAPHICAL USER INTERFACE

Overview

The graphical user interface (GUI) has been designed to allow the user to maximise the view of a set of aligned or unaligned sequences as required, while having the most common options only a couple of clicks away and more advanced options available via 'Preferences'. I have tried to make the GUI as self-explanatory as possible, but pop-up help balloons are also used.

Basically, the GUI is divided into three horizontal panes, which can be hidden or made visible via the 'View' menu.

Unaligned Sequences Pane

This is the upper most pane, which by default is hidden from view. It contains a display for a set of unaligned sequences and several buttons used for manipulating the sequences (e.g. BLAST, for BLASTing the unaligned sequences against the NCBI database, align, for aligning the sequences using a locally installed copy of Clustalw. The user is able to carry out basic sequence modifications prior to BLASTing or aligning the sequences.

Aligned Sequences Pane

This is the middle pane, which by default is visible to the user. It contains a display for a set of aligned sequences (either loaded directly from file or an alignment produced by the user from a set of unaligned sequences) and several buttons to save various parts of the alignment. Alignment modifications can be carried out in a similar manner to sequences in the Unaligned Sequences Pane.

Mask Options Pane

This is the lower most pane, which by default is visible to the user. It contains several options that the user can select/modify regarding the generation of an alignment mask.

If Partial Sequences have been detected, you must first select the option to state you have checked the partial sequences, before you are allowed to generate an alignment mask.

Preferences

There are many preferences and modifications the user can make in addition to those that are available immediately via the Aligned Sequences Pane. The user can specify the location of various executable files, make specific alterations to how and what sequences are BLASTed against at NCBI, make alterations regarding the appearance of the GUI etc.

General

This page is used to specify the location of local files and executables, so that IndelExtractor may use them for carrying out certain tasks. The 'Home Directory' is used for saving your preferences to a file (indelextractor.prefs). If you wish to be able to generate multiple alignments from a sequence file, you must first have clustalw installed locally and then point IndelExtractor to the directory containing the executable, otherwise you will not be able to produce an alignment.

Defaults

This page contains several of IndelExtractor's default startup settings. There are default mask settings and output file format settings.

Nexus Output

This page contains settings that change the way NEXUS files are saved. Currently, there is only an option to save a NEXUS 'charset' to the file that defines the current alignment mask (if one is defined). Also, an include/exclude command for that charset (dependent upon whether you are saving 'masked' or 'unmasked' positions respectively) may also be saved to the file.

ClustalW

This page is filled with options to modify the way in which your local copy of clustalw will run alignments from within IndelExtractor. All these settings are initially set to 'Default' so that clustalw will define the settings for the relevant parameters (so defaults will be the defaults of your clustalw program). These can be over-ridden by entering your own settings.

NOTE: Different default settings are used for DNA/protein sequences so be careful that the settings you use are compatible with the type of sequences you are aligning! (see USEFUL LINKS for more details).

BLAST

This page is filled with options to modify the way in which BLAST searches are run from within IndelExtractor. Currently, only remote BLAST (submitting BLAST's to NCBI) is supported. Choose the correct BLAST Algorithm based on the type of sequences you are using. You can also enter a 'Limit by Entrez Query' from the drop-down box or type your own. (see USEFUL LINKS for more details).

Connection

Settings for a proxy server (untested).

GUI

This page has options for modifying the way the GUI looks and feels. You can change the font/font size of the main program text and the font/font size of the sequences being displayed. You can modify how many lines per click a mouse wheel will scroll in BLAST results. You can also turn off the help bubbles,. chnage which panes and colours you want to view on startup.

NOTE: It is probably best not to change the font of the sequences being displayed, as they require fixed width fonts so the columns are displayed in alignment with each other (by all means change the font size!).

INPUT/OUTPUT FILE FORMATS

IndelExtractor supports many sequence/alignment input/output file formats and uses file extensions to correctly identify the input file format or the required output format. Below is a list of supported file formats for sequence/alignment files and their associated extensions as recognised by IndelExtractor:

Sequence Formats

Supported input/output sequence file formats include the following:

NOTE: Not all formats are supported for output of sequences (not my fault)!.

fasta format
File extensions: fasta|fast|fas|seq|fa|fsa|nt|aa

genbank format
File extensions: gb|gbank|genbank|gbk|gbs

scf format
File extensions: scf

abi format
File extensions: abi|ab1

alf format
File extensions: alf

ctf format
File extensions: ctf

ztr format
File extensions: ztr

pln format
File extensions: pln

exp format
File extensions: exp

pir format
File extensions: pir

embl format
File extensions: embl|ebl|emb|dat

raw format
File extensions: txt

gcg format
File extensions: gcg

ace format
File extensions: ace

bsml format
File extensions: bsm|bsml

swiss format
File extensions: swiss|sp

phd format
File extensions: phd|phred

fastq format
File extensions: fastq

Alignment Formats

Several alignment files can be output from IndelExtractor, from the entire alignment to the masked/unmasked positions, all can be output in one of the supported alignment file formats:

Supported input/output alignment file formats include the following:

NOTE: Not all formats are supported for output of alignments (not my fault)!.

fasta format
File extensions: fasta|fast|fas|seq|fa|fsa|nt|aa

maf format
File extensions: maf

msf format
File extensions: msf|pileup|gcg

pfam format
File extensions: pfam|pfm

selex format
File extensions: selex|slx|selx|slex|sx

phylip format
File extensions: phylip|phlp|phyl|phy|ph

NEXUS format
File extensions: nexus|nex

mega format
File extensions: meg|mega

clustalw format
File extensions: aln

meme format
File extensions: meme

emboss format
File extensions: water|needle

psi format
File extensions: psi

One additional output format is that of IndelExtractor (.indel), it was initially developed for saving information pertaining to indel positions and remains for that reason.

Mask Generation Options

Percentage Threshold

This is a percentage value, it is used when calculating some of the masks (e.g. Indel and Consensus).

Minimum Indel Separation

This is the minimum number of residues that two indel regions must be separated by before being joined together to form one larger indel.

Use Similarity Groups

This option is to specify if you would like to use amino acid similarity groupings when calculating some of the masks (e.g. Indel and Consensus). Similarity groups are the 'strong' groups used in the clustalw alignment program and correspond to dayhoff groupings, these groupings are:

  1. STA
  2. NEQK
  3. NHQK
  4. NDEQ
  5. QHRK
  6. MILV
  7. MILF
  8. HY
  9. FYW

If you elect to use similarity groupings, the mask calculations are modified in the following manner: If a particular residue does not break the threshold then the position is examined to see if any of the similarity groupings breaks the threshold.

Mask Type

Indels

An indel mask identified positions of the alignment that are thought to be part of an Indel. Indels are often removed together with any surrounding ambiguous alignment and the removal of 'jagged' alignment ends prior to phylogenetic analysis. This is often laborious and rather subjective as to what constitutes 'surrounding ambiguous alignment', therefore this mask identifies the positions to remove in a quick and objective manner.

The identification of Indels starts with the construction of a consensus sequence using a percentage threshold selected by the user, whereby a particular residue must be present in at least that percentage of the sequences to be included in the consensus sequence. Any position that contains a gap character, automatically has a gap inserted into the consensus sequence. Indel boundaries are then identified by searching the consensus sequence for gap characters, when found, the consensus is scanned left and right to identify the nearest position that broke the threshold - this then constitutes an indel. Neighbouring indels are joined together if the user has selected to merge indels that are only separated by a particular number of residues or less. The user may also elect to use similarity groupings when calculating the indels mask.

Consensus

A Consensus mask identifies positions that break the percentage threshold set by the user. The user may also elect to use similarity groupings when calculating the consensus mask.

Variable Sites

A variable sites mask identifies positions that are variable.

Invariable Sites

A invariable sites mask identifies positions that are invariable.


OPTIONS AND ARGUMENTS

The following options and arguments are for use in CL mode only.

IndelExtractor

Options available to IndelExtractor:

-h|help|?
Prints synopsis and option details for IndelExtractor

-man
Prints the entire user manual for IndelExtractor

-v|version
IndelExtractor version info

-q|quiet
Quietens output from IndelExtractor

-a|alignment filename
filename as input alignment file

-s|sequences filename
filename as input sequence file

-ao|aln_out filename
Output alignment to filename

-so|sequence_out filename
Output sequences to filename

-mo|masked_out filename
Output masked regions to filename

-uo|unmasked_out filename
Output unmasked regions to file filename

-mt|mask_type mask type
Generate an alignment mask of the type mask type. Valid options: indels|consensus|variable|invariable

-t|threshold integer
Default: 60

Use the value specified by integer in mask calculations. Possible values for integer are: 1-100

-is|indel_sep integer
Default: 3

Minimum residue seperation needed between two adjacent indel regions before being combined together. Possible values for integer are: 0-10

-i|ignore_partials
Ignore partial sequences while calculating masks

-nosim
Do not use amino acid similarity groupings when calculating masks

-nocs|nocharset
Do not print a NEXUS charset block for the masked/unmasked regions of the alignment when output is in NEXUS format

Clustalw

NOTE: Clustalw default values are used in the absence of these options. See your own clustalw documentation for their details and default values.

Options available to Clustalw:

-quicktree
-pwmatrix BLOSUM|PAM|GONNET|ID
-pwdnamatrix IUB|CLUSTALW
-pwgapopen
-pwgapext
-ktuple
-topdiags
-window
-pairgap
-matrix BLOSUM|PAM|GONNET|ID
-dnamatrix IUB|CLUSTALW
-transweight
-outorder aligned|input
-gapopen
-gapext
-maxdiv


USEFUL LINKS

Here is a list of useful links related to this program.

http://www.perl.org
Perl website.

http://www.bioperl.org
BioPerl website.

http://www.bioperl.org/Core/Latest/INSTALL.WIN
BioPerl installation (windows).

If ppm can't find some of the required packages, add the following command while in ppm

repository add "Winnipeg_server" http://theoryx5.uwinnipeg.ca/ppms

Another good URL for windows installation is: http://bioperl.org/Core/Latest/Installing_Bioperl_on_Windows_perl5.8.htm

http://www.bioperl.org/Core/external.shtml
Bioperl external modules. These are required by some parts of BioPerl, however, they are not necessary for BioPerl to work.

http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/#univaln
Download site for UnivAln.

ftp://ftp.ebi.ac.uk/pub/software/
Clustalw Download site. Navigate through the link corresponding to your operating system.

http://www-igbmc.u-strasbg.fr/BioInfo/ClustalW/
Description of Clustalw options.

http://par.perl.org/index.cgi
PAR Homepage.

ftp://ftp.ncbi.nih.gov/blast/
local BLAST download files.


TO DO

If you would like me to make any modifications to make IndelExtractor easier to use, or have suggestions for new features e-mail me at nathanhaigh@ukonline.co.uk


BUGS AND CHANGES

New Bugs

None to report yet

Bugs Unrelated to IndelExtractor

RedHat Linux

Windows

Resolved Bugs and Changes

The latest release of IndelExtractor can be obtained from http://www.nathanhaigh.co.uk/university/lab/haigh/files/software/

Please check the latest version before reporting bugs. But please report bugs to me as soon as possible (nathanhaigh@ukonline.co.uk)


COPYRIGHT

Copyright © 2003-2004 Nathan Haigh (nathanhaigh@ukonline.co.uk)

University of York, England.


AUTHOR

Nathan Haigh (nathanhaigh@ukonline.co.uk)

Previous versions of IndelExtractor were based on code from Alignmenator by Konstantinos Thalassinos and were co-written with Bryony Mackenzie (basm101@york.ac.uk)