Depending what operating system (OS) you are using, what type of program file you have and whether you want to run IndelExtractor in graphical (GUI) or command line (CL) mode will determine how you run the program.

Graphical (GUI) Mode IndelExtractor.exe

IndelExtractor.pl

Command Line (CL) Mode

IndelExtractor.exe [-a|-s] in_filename [-ao|-so|-mo|-uo] out_filename [options ...]

IndelExtractor.pl [-a|-s] in_filename [-ao|-so|-mo|-uo] out_filename [options ...]

See OPTIONS AND ARGUMENTS for details of all options

QUICK START GUIDE

This is a quick start quide for those of you who are too lazy to read the rest of the documentation. This also probably means that you also wish to use IndelExtractor in the GUI mode; if not you'll have to read the documentation on OPTIONS AND ARGUMENTS.

There are also two packages of IndelExtractor available:

Perl script file

Windows executable

Perl Script File

As a minimum you must have the Perl interpreter installed, BioPerl v1.4 and BioPerl-run v1.4 installed.

IndelExtractor_pl_2.2.1.zip

IndelExtractor_pl_2.2.1.tgz

http://www.nathanhaigh.co.uk/university/lab/haigh/files/software/

IndelExtractor

c:\Program Files\IndelExtractor\

Windows Executable

IndelExtractor_exe_2.2.1.zip

http://www.nathanhaigh.co.uk/university/lab/haigh/files/software/

IndelExtractor

c:\Program Files\IndelExtractor\

IndelExtractor.exe

IndleExtractor

View

a: Under the General tab, the Home Directory is set to the location of the IndelExtractor preference file. This files store much of the preferences you set.
b: If you wish to use clustalw to generate alignments from inside IndelExtractor, you will have to install clustalw on your machine in a folder that doesn't contain any spaces (e.g. c:\clustalw\). Set the Clustalw Directory on the General tab to this directory (i.e. c:\clustalw\).

File

Open Alignment

test_aa.aln

test_nt.fas

test_aa.aln

Aligned Sequences Pane

Gaps

Indels

Highlight the regions of the alignment where you want to convert gaps to ambiguous/missing characters (left-click and drag). Now right click to get the Right-click menu and select Convert to Missing; this will convert any gaps ('-') in the highlighted region to ambiguous/missing characters ('X' for proteins and 'N' for nucleotides).

After you have check any partial sequences that may be present, and done this conversion, you must check the box labelled Partial Sequences Checked? before you are allowed to create an alignment mask.

Mask Type

Mask Generation Options

Mask Options Pane

Indels

Percentage Threshold

Save masked regions of alignment

Save as Type:

NEXUS

SETS

Mask Type

PAUP

Masked

Unmasked

Now try saving the whole alignment (disk icon with pencil) and Unmasked regions (disk icon) in NEXUS format and compare their contents.

This behaviour of saving the entire alignment irrespective of whether you chose to save the Masked or Unmasked regions can be removed by changing the CHARSET for masked regions option in the Nexus Output tab of the preferences to ``No''.

DESCRIPTION

IndelExtractor is an application who's primary role is to identify positions of interest in a sequence alignment based on a conseneus sequence.

IndelExtractor's main features set includes the following:

IndelExtractor

Background

IndelExtractor was originally designed to identify and save indels and the surrounding ambiguous alignment for use in a systematic study of indels. However, the application has now expanded to allow the generation of various alignment masks including: consensus, variable/invariable sites as well as the original indel sites mask. Options has also been added to allow the user to BLAST sequences remotely at NCBI and generate the multiple sequence alignments from unaligned sequences (via Clustalw).

IndelExtractor is written in Perl and requires BioPerl and UnivAln modules. The program is designed to be cross-platform and has been tested on both Microsoft Windows and GNU/Linux-based operating systems. However, see BUGS AND CHANGES for exceptions.

Alignment Masks

An alignment 'mask' is a string of characters positioned above or below a sequence alignment so that each character in the mask corresponds to a position in the alignment. With this in mind, a consensus sequence is a type of alignment mask that identifies positions that contain a residue that falls above a pre-defined percentage threshold. This consensus mask may contain many different characters (i.e. the character of the residue that fell above the threshold) or simply a single character which indicates that a particular position contained a residue that fell above the threshold. The former, contains information regarding which residue fell above the threshold but the latter, only indicates if the position contained a residue that fell above the threshold i.e. it contains information as to whether a position is 'masked' (represented by a character in the alignment mask sequence) or 'unmasked' (represented by the absence of a character in the alignment mask sequence).

Partial Sequences

Before an alignment mask is generated, an alignment is analysed to detect partial sequences. This is done by counting the number of gap (-), ambiguous (X for amino acids, N for nucleotides) and missing (?) characters each sequence contains. Any sequence that contains more than 2 standard deviations from the mean of these characters is thought of as being a partial sequence.

When running in GUI mode, these partial sequences are highlighted for you to take a closer look at and if necessary, convert gaps to ambiguity/missing characters. This is an important step, since the appearence of any number of gaps at a position in the alignment has an effect on the alignment mask that is subsequently calculated. As such you must select the option in the Mask Options Pane that states you have check the partial sequences before you are allowed to generate an alignment mask.

When running in CL mode, the -i option results in partial sequences being ignored during the calculation of the alignment mask.

THE GRAPHICAL USER INTERFACE

Overview

The graphical user interface (GUI) has been designed to allow the user to maximise the view of a set of aligned or unaligned sequences as required, while having the most common options only a couple of clicks away and more advanced options available via 'Preferences'. I have tried to make the GUI as self-explanatory as possible, but pop-up help balloons are also used.

Basically, the GUI is divided into three horizontal panes, which can be hidden or made visible via the 'View' menu.

Unaligned Sequences Pane

This is the upper most pane, which by default is hidden from view. It contains a display for a set of unaligned sequences and several buttons used for manipulating the sequences (e.g. BLAST, for BLASTing the unaligned sequences against the NCBI database, align, for aligning the sequences using a locally installed copy of Clustalw. The user is able to carry out basic sequence modifications prior to BLASTing or aligning the sequences.

Aligned Sequences Pane

This is the middle pane, which by default is visible to the user. It contains a display for a set of aligned sequences (either loaded directly from file or an alignment produced by the user from a set of unaligned sequences) and several buttons to save various parts of the alignment. Alignment modifications can be carried out in a similar manner to sequences in the Unaligned Sequences Pane.

Mask Options Pane

This is the lower most pane, which by default is visible to the user. It contains several options that the user can select/modify regarding the generation of an alignment mask.

If Partial Sequences have been detected, you must first select the option to state you have checked the partial sequences, before you are allowed to generate an alignment mask.

Preferences

There are many preferences and modifications the user can make in addition to those that are available immediately via the Aligned Sequences Pane. The user can specify the location of various executable files, make specific alterations to how and what sequences are BLASTed against at NCBI, make alterations regarding the appearance of the GUI etc.

General

This page is used to specify the location of local files and executables, so that IndelExtractor may use them for carrying out certain tasks. The 'Home Directory' is used for saving your preferences to a file (indelextractor.prefs). If you wish to be able to generate multiple alignments from a sequence file, you must first have clustalw installed locally and then point IndelExtractor to the directory containing the executable, otherwise you will not be able to produce an alignment.

Defaults

This page contains several of IndelExtractor's default startup settings. There are default mask settings and output file format settings.

Nexus Output

This page contains settings that change the way NEXUS files are saved. Currently, there is only an option to save a NEXUS 'charset' to the file that defines the current alignment mask (if one is defined). Also, an include/exclude command for that charset (dependent upon whether you are saving 'masked' or 'unmasked' positions respectively) may also be saved to the file.

ClustalW

This page is filled with options to modify the way in which your local copy of clustalw will run alignments from within IndelExtractor. All these settings are initially set to 'Default' so that clustalw will define the settings for the relevant parameters (so defaults will be the defaults of your clustalw program). These can be over-ridden by entering your own settings.

NOTE: Different default settings are used for DNA/protein sequences so be careful that the settings you use are compatible with the type of sequences you are aligning! (see USEFUL LINKS for more details).

BLAST

This page is filled with options to modify the way in which BLAST searches are run from within IndelExtractor. Currently, only remote BLAST (submitting BLAST's to NCBI) is supported. Choose the correct BLAST Algorithm based on the type of sequences you are using. You can also enter a 'Limit by Entrez Query' from the drop-down box or type your own. (see USEFUL LINKS for more details).

Connection

Settings for a proxy server (untested).

GUI

This page has options for modifying the way the GUI looks and feels. You can change the font/font size of the main program text and the font/font size of the sequences being displayed. You can modify how many lines per click a mouse wheel will scroll in BLAST results. You can also turn off the help bubbles,. chnage which panes and colours you want to view on startup.

NOTE: It is probably best not to change the font of the sequences being displayed, as they require fixed width fonts so the columns are displayed in alignment with each other (by all means change the font size!).

INPUT/OUTPUT FILE FORMATS

IndelExtractor supports many sequence/alignment input/output file formats and uses file extensions to correctly identify the input file format or the required output format. Below is a list of supported file formats for sequence/alignment files and their associated extensions as recognised by IndelExtractor:

Sequence Formats

Supported input/output sequence file formats include the following:

NOTE: Not all formats are supported for output of sequences (not my fault)!.

fasta format: File extensions: fasta|fast|fas|seq|fa|fsa|nt|aa
genbank format: File extensions: gb|gbank|genbank|gbk|gbs
scf format: File extensions: scf
abi format: File extensions: abi|ab1
alf format: File extensions: alf
ctf format: File extensions: ctf
ztr format: File extensions: ztr
pln format: File extensions: pln
exp format: File extensions: exp
pir format: File extensions: pir
embl format: File extensions: embl|ebl|emb|dat
raw format: File extensions: txt
gcg format: File extensions: gcg
ace format: File extensions: ace
bsml format: File extensions: bsm|bsml
swiss format: File extensions: swiss|sp
phd format: File extensions: phd|phred
fastq format: File extensions: fastq

Alignment Formats

Several alignment files can be output from IndelExtractor, from the entire alignment to the masked/unmasked positions, all can be output in one of the supported alignment file formats:

Supported input/output alignment file formats include the following:

NOTE: Not all formats are supported for output of alignments (not my fault)!.

fasta format: File extensions: fasta|fast|fas|seq|fa|fsa|nt|aa
maf format: File extensions: maf
msf format: File extensions: msf|pileup|gcg
pfam format: File extensions: pfam|pfm
selex format: File extensions: selex|slx|selx|slex|sx
phylip format: File extensions: phylip|phlp|phyl|phy|ph
NEXUS format: File extensions: nexus|nex
mega format: File extensions: meg|mega
clustalw format: File extensions: aln
meme format: File extensions: meme
emboss format: File extensions: water|needle
psi format: File extensions: psi

One additional output format is that of IndelExtractor (.indel), it was initially developed for saving information pertaining to indel positions and remains for that reason.

Mask Generation Options

Percentage Threshold

This is a percentage value, it is used when calculating some of the masks (e.g. Indel and Consensus).

Minimum Indel Separation

This is the minimum number of residues that two indel regions must be separated by before being joined together to form one larger indel.

Use Similarity Groups

This option is to specify if you would like to use amino acid similarity groupings when calculating some of the masks (e.g. Indel and Consensus). Similarity groups are the 'strong' groups used in the clustalw alignment program and correspond to dayhoff groupings, these groupings are:

STA
NEQK
NHQK
NDEQ
QHRK
MILV
MILF
HY
FYW

If you elect to use similarity groupings, the mask calculations are modified in the following manner: If a particular residue does not break the threshold then the position is examined to see if any of the similarity groupings breaks the threshold.

Mask Type

Indels

An indel mask identified positions of the alignment that are thought to be part of an Indel. Indels are often removed together with any surrounding ambiguous alignment and the removal of 'jagged' alignment ends prior to phylogenetic analysis. This is often laborious and rather subjective as to what constitutes 'surrounding ambiguous alignment', therefore this mask identifies the positions to remove in a quick and objective manner.

The identification of Indels starts with the construction of a consensus sequence using a percentage threshold selected by the user, whereby a particular residue must be present in at least that percentage of the sequences to be included in the consensus sequence. Any position that contains a gap character, automatically has a gap inserted into the consensus sequence. Indel boundaries are then identified by searching the consensus sequence for gap characters, when found, the consensus is scanned left and right to identify the nearest position that broke the threshold - this then constitutes an indel. Neighbouring indels are joined together if the user has selected to merge indels that are only separated by a particular number of residues or less. The user may also elect to use similarity groupings when calculating the indels mask.

Consensus

A Consensus mask identifies positions that break the percentage threshold set by the user. The user may also elect to use similarity groupings when calculating the consensus mask.

Variable Sites

A variable sites mask identifies positions that are variable.

Invariable Sites

A invariable sites mask identifies positions that are invariable.

OPTIONS AND ARGUMENTS

The following options and arguments are for use in CL mode only.

IndelExtractor

Options available to IndelExtractor:

-h|help|?: Prints synopsis and option details for IndelExtractor
-man: Prints the entire user manual for IndelExtractor
-v|version: IndelExtractor version info
-q|quiet: Quietens output from IndelExtractor
-a|alignment filename: filename as input alignment file
-s|sequences filename: filename as input sequence file
-ao|aln_out filename: Output alignment to filename
-so|sequence_out filename: Output sequences to filename
-mo|masked_out filename: Output masked regions to filename
-uo|unmasked_out filename: Output unmasked regions to file filename
-mt|mask_type mask type: Generate an alignment mask of the type mask type. Valid options: indels|consensus|variable|invariable
-t|threshold integer: Default: 60; Use the value specified by integer in mask calculations. Possible values for integer are: 1-100
-is|indel_sep integer: Default: 3; Minimum residue seperation needed between two adjacent indel regions before being combined together. Possible values for integer are: 0-10
-i|ignore_partials: Ignore partial sequences while calculating masks
-nosim: Do not use amino acid similarity groupings when calculating masks
-nocs|nocharset: Do not print a NEXUS charset block for the masked/unmasked regions of the alignment when output is in NEXUS format

Clustalw

NOTE: Clustalw default values are used in the absence of these options. See your own clustalw documentation for their details and default values.

Options available to Clustalw:

-quicktree
-pwmatrix BLOSUM|PAM|GONNET|ID
-pwdnamatrix IUB|CLUSTALW
-pwgapopen
-pwgapext
-ktuple
-topdiags
-window
-pairgap
-matrix BLOSUM|PAM|GONNET|ID
-dnamatrix IUB|CLUSTALW
-transweight
-outorder aligned|input
-gapopen
-gapext
-maxdiv

USEFUL LINKS

Here is a list of useful links related to this program.

http://www.perl.org: Perl website.
http://www.bioperl.org: BioPerl website.
http://www.bioperl.org/Core/Latest/INSTALL.WIN: BioPerl installation (windows).; If ppm can't find some of the required packages, add the following command while in ppm; repository add "Winnipeg_server" http://theoryx5.uwinnipeg.ca/ppms; Another good URL for windows installation is: http://bioperl.org/Core/Latest/Installing_Bioperl_on_Windows_perl5.8.htm
http://www.bioperl.org/Core/external.shtml: Bioperl external modules. These are required by some parts of BioPerl, however, they are not necessary for BioPerl to work.
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/#univaln: Download site for UnivAln.
ftp://ftp.ebi.ac.uk/pub/software/: Clustalw Download site. Navigate through the link corresponding to your operating system.
http://www-igbmc.u-strasbg.fr/BioInfo/ClustalW/: Description of Clustalw options.
http://par.perl.org/index.cgi: PAR Homepage.
ftp://ftp.ncbi.nih.gov/blast/: local BLAST download files.

TO DO

If you would like me to make any modifications to make IndelExtractor easier to use, or have suggestions for new features e-mail me at nathanhaigh@ukonline.co.uk

BUGS AND CHANGES

COPYRIGHT

University of York, England.

AUTHOR

Nathan Haigh (nathanhaigh@ukonline.co.uk)

Previous versions of IndelExtractor were based on code from Alignmenator by Konstantinos Thalassinos and were co-written with Bryony Mackenzie (basm101@york.ac.uk)