What tool can I use to align multiple protein sequences to one reference sequence?

What tool can I use to align multiple protein sequences to one reference sequence?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I have a protein of interest which is ~300 amino acids in length. I also have around 40 short sequences (all 9 amino acids in length); these are all very different from each other. I would like to perform multiple pairwise alignments to see if these sequences match up with (or have high sequence identity similarity) any areas in the protein of interest.

Since the short 9-a.a. sequences are very heterogeneous, they will have similarities in different regions of the protein of interest. I would like to know if it is possible to perform the alignments for all 40 sequences in one step, rather than using COBALT 40 times (and checking each of the short sequences against the reference sequence individually).

Please let me know if the description of my problem is not clear enough, I would appreciate any help in identifying a way to do this.

What you want to perform is commonly called a multiple sequence alignment. As @Wayne_Yux said, the first step is to put all of your protein sequences in a single fasta file. You can then use one of several online tools to apply different alignment algorithms to your protein sequence set.

A popular sequence alignment algorithm is Clustal, which progressively builds a multiple sequence alignment from all of the pairwise alignments. hosts a web-based alignment tool that allows you to choose from Clustal and other alignment types (1). For alignment of many small sequences to a single larger sequence, you should use the "SLOW/ACCURATE" option (2). After you upload your fasta (3) and execute multiple alignment (4), an aln file will be generated which you can download (5).

You can then take this aln file and upload it to a different tool that allows you to visualize alignments. Personally, I like ESPript, which gives color-coded alignments in pdf or image formats. Because you want to compare many sequences, it will probably be necessary to change the "Alignments output layout" parameters, e.g. change "Gap between blocks" to a smaller number to fit all of your sequences in one frame.

You can store the 40 sequences in a fasta file and then useblastpto align them all at once to your reference sequence.

After that, you can inspect the alignment hits and see wether they fit your quality expectations

In a multiple sequence alignment, the algorithms will attempt to align the sequences along their length (global alignment). What you need is a local alignment approach with a high penalty for base mismatch.

The E-INS-i algorithm in MAFFT might provide the desired functionality. Select the Advanced Settings and there:

  1. Strategy -> E-INS-i algorithm

  2. Align unrelated segments -> Leave gappy regions

Depending on how heterogeneous your sequences are, you might need to fine-tune other settings as well. Without the data, it is impossible to provide a more detailed guide.


SITES: A number of excellent sites exist all of which permit translation in all six reading frames. I would recommend "ORF Finder" because of its visuals and Pipeline or GeneMark if you are seriously interested in identifying genes within your sequence. The latter two programs permit the analysis of long sequences (submit by attachment not in the box).

Frameshift errors:

path :: protein back-translation and alignment - addresses the problem of finding distant protein homologies where the divergence is the result of frameshift mutations and substitutions. Given two input protein sequences, the method implicitly aligns all the possible pairs of DNA sequences that encode them, by manipulating memory-efficient graph representations of the complete set of putative DNA sequences for each protein. ( Reference: Gîrdea M et al. 2010. Algorithms for Molecular Biology 5:)

Simple translation tools - DNA to protein sequences:

Open Reading Frame Finder (NCBI) - searches for open reading frames (ORFs) in the DNA sequence you enter. The program returns the range of each ORF, along with its protein translation. Use ORF finder to search newly sequenced DNA for potential protein encoding segments, verify predicted protein using newly developed SMART BLAST or regular BLASTP.

Six-frame Translations can be done at Tuebingen, Russia, Bioline, and Science Launcher.

EMBOSS Sixpack (EMBL-EBI) - reads a DNA sequence and outputs the three forward and (optionally) three reverse translations in a visual manner. Alternatively use EMBOSS Transeq

MBS Translator (JustBio Tools) - An excellent new site since one can translate specifically from ATG and the results are presented with the nucleotide sequence overlaying the amino acid sequence. Ideal for Cut/Paste into a manuscript. You need to register to use this free tool. Other quick translation tools are here and here.

Translate (ExPASy, Switzerland) - is a tool which allows the translation of a nucleotide (DNA/RNA) sequence to a protein sequence.

DNA to protein translation (University of the Basque Country, Spain) and here.

Translation of multiple sequences:

Virtual Ribosome ( Reference: R. Wernersson. 2006. Nucl. Acids Res. 34 (web Server Issue): W385-388) - I find that the output from the first two sites is optimal for translating multiple DNA sequences.

RevTrans 1.4 Server (CBS, Danish Technical University)

TranslatorX - is a web server designed to align protein-coding nucleotide sequences based on their corresponding amino acid translations. TranslatorX novelties include: (i) use of all documented genetic codes and the possibility of assigning different genetic codes for each sequence (ii) a battery of different multiple alignment programs (iii) translation of ambiguous codons when possible (iv) an innovative criterion to clean nucleotide alignments with GBlocks based on protein information and (v) a rich output, including Jalview-powered graphical visualization of the alignments, codon-based alignments coloured according to the corresponding amino acids, measures of compositional bias and first, second and third codon position specific alignments. ( Reference: Abascal F, et al. (2010) Nucleic Acids Res. 38: W7-13).


Structural domain database and structural alignment databases

To identify homologs with known structures for target sequences, we used the ASTRAL SCOP40 structural domain database ( 14 , 15 ) (version 1.69, 7290 domains, with <40% sequence identity to each other). Structure-based sequence alignments were made between each pair of domains by three structural comparison programs: DaliLite ( 16 ), FAST ( 17 ) and TM-align ( 18 ). These alignment databases facilitate the use of structural information by allowing lookup of the structure-based sequence alignments for any domain pair, without running the structural comparison programs for them during the multiple sequence alignment process. For each structural domain, we also made PSI-BLAST ( 19 ) searches to retrieve homologs that can be used in profile–profile alignments with target sequences.

An overview of PROMALS algorithm

PROMALS ( 10 ) is a progressive method that clusters similar sequences and aligns them by a simple and fast algorithm, and applies more elaborate techniques to align the relatively divergent clusters to each other. In the first alignment stage, PROMALS aligns similar sequences using a scoring function of weighted sum-of-pairs of BLOSUM62 ( 20 ) scores. The first stage is fast and results in a number of pre-aligned groups (clusters) that are relatively distant from each other. In the second alignment stage, one representative sequence is selected for each pre-aligned group (subsequently referred to as ‘target sequence’). Target sequences are subject to PSI-BLAST searches for additional homologs from UNIREF90 ( 21 ) database and to PSIPRED ( 22 ) secondary structure prediction. Then a hidden Markov model (HMM) of profile–profile alignment with predicted secondary structures is applied to pairs of representatives to obtain posterior probabilities of residue matches. These probabilities serve as sequence-based constraints that are used to derive a probabilistic consistency scoring function. The representative target sequences are progressively aligned using such a consistency scoring function, and the pre-aligned groups obtained in the first stage are merged into the alignment of representatives to form the final multiple alignment of all sequences.

Incorporating 3D structural information in PROMALS

In PROMALS3D, structural constraints are derived for representative sequences with known structures and are combined with sequence-based constraints ( Figure 1 ). First, the program identifies homologs with 3D structures (homolog3D) for target sequences resulting from the first fast alignment stage. For each target sequence, the profile of PSI-BLAST search against the UNIREF90 database are used to initiate a new PSI-BLAST search (one iteration) against the SCOP40 domain database that contains protein domain sequences with known structures. Only structural domains that pass certain similarity criteria (default: e -value <0.001 and sequence identity no <20%) are kept. These structural domains are further filtered to remove redundancy in the following way: if two structural domains pass the criteria and their non-overlapping region is less than 30 residues, only the one with a better e -value is kept. Multiple homolog3Ds could be identified and used for one target sequence if it contains several distinct domains with known structures.

Flowchart of PROMALS3D method.

Flowchart of PROMALS3D method.

Pairwise residue match constraints for two target sequences are derived from sequence-based target-to-homolog3D alignments and structure-based homolog3D-to-homolog3D alignments. For example, if residue A in target S1 is aligned to residue B in homolog3D T1, residue B in homolog3D T1 is aligned with residue C in homolog3D T2 according to a structure comparison program, and residue C in homolog3D T2 is aligned with residue D in target S2, then we deduce that residue A in sequence S1 is aligned with residue D in sequence S2, and this pair is used as a structure-derived constraint (see Figure S1 in Supplementary Data). In the matrix representation, the matrix of structural constraints of two targets S1 and S2 is the product of the multiplication of three residue match matrices: MS1−T1 , MT1−T2 and MT2−S2 . The alignment between a target sequence and its homolog3D can be the PSI-BLAST alignment, or they can be re-aligned by the profile–profile comparison routine used in PROMALS. For PSI-BLAST alignment between a target and its homolog3D and the structural alignment between two homolog3D, if two residues are aligned, its entry in the residue match matrix is 1, otherwise it is 0. For profile–profile comparison using HMM ( 10 ), the entries of a residue match matrix are the posterior probabilities of two residues being aligned (determined by forward–backward algorithm). The structure constraints among target sequences are combined with those constraints derived from profile–profile comparisons in the original PROMALS to deduce a consistency-based scoring function that integrates database sequence profiles, predicted secondary structures and 3D structural information. We used an empirical weight ratio of 1.5 for structure constraints relative to the sequence constraints of profile–profile comparison in the original PROMALS.

Testing the performance of MSA or multiple structural alignment programs

We compared PROMALS3D to other common multiple sequence alignment programs. PROMALS3D, 3DCoffee and Expresso (a web server of 3DCoffee that automatically include 3D information) are multiple alignment programs that use both sequence and structural information. We implemented 3DCoffee by inputting structural alignment constraints to T-Coffee program. We could not obtain the stand-alone version of Expresso and thus manually submitted the 209 ‘twilight zone’ tests in SABmark database to the Expresso server with default options. PROMALS, SPEM ( 9 ), MUMMALS ( 5 ), ProbCons ( 4 ), MAFFT ( 13 ), MUSCLE ( 23 ), T-Coffee ( 3 ) and ClustalW ( 24 ) use only sequence information (PROMALS and SPEM also incorporate secondary structure information that is predicted from sequences). The availability of a known structure for every sequence in SABmark database allows us to benchmark MUSTANG, a multiple structural alignment program ( 25 ) that uses only 3D structural information. We also tested PROMALS3D performance on only structural information by making consistency measures solely from structural constraints of DaliLite alignments or reference alignments. For reference-dependent evaluation of alignment quality, the alignment quality score ( Q -score) is defined as the number of correctly aligned residue pairs in a test alignment divided by the total number of aligned residue pairs in a reference alignment (its value is between 0 and 1). For reference-independent evaluation of alignment quality, we used a number of structure-based scores such as GDT-TS ( 26 ) to reflect the similarities of two structures aligned according to a sequence alignment. The details of reference-independent evaluation are described in a previous article ( 5 ).

Algorithmic frameworks for MSA computation

Despite their wide diversity, MSAMs all share a major key property: their reliance on approximate and usually greedy heuristics, imposed by the NP-complete nature of the problem. These heuristics all depend, more or less explicitly, on specific data properties, such as size, nature of the homology, relatedness, length and so on. As a consequence, any change—even minor—on the kind of data being modeled requires the development of novel heuristic strategies. Such changes have recently included the need of upscaling under the high-throughput sequencing pressure and the need for more complex sequence descriptors, including non-coding RNA or non-transcribed genomic sequences. Shifting modeling needs can also drive the developments of novel heuristics, a fact well illustrated by the recent development of phylogeny-aware aligners. Another driving force behind the development of new heuristics has been the increasing availability of structural data that has fueled the development of hybrid methods able to simultaneously deal with sequences and secondary (RNA) or tertiary (RNA and proteins) structures. Likewise, the explosion of available genomic data has put a lot of pressure on the development of a new generation of non-coding/non-transcribed DNA aligners.

Commonly used algorithms

Main algorithmic components of the most widely used multiple aligners. On the heatmap, orange entries indicate a feature implemented in the considered method. Both the aligners and the components were clustered by similarity using the R-package. A colour version of this figure is available online at BIB online:

Main algorithmic components of the most widely used multiple aligners. On the heatmap, orange entries indicate a feature implemented in the considered method. Both the aligners and the components were clustered by similarity using the R-package. A colour version of this figure is available online at BIB online:

To build an MSA, one needs a scoring function (objective function) able to quantify the relative merits of any alternative alignment with respect to the modeled relationship. The MSA can then be estimated by computing an optimally scoring model. The objective function is a critical parameter, as it precisely defines the modeling accuracy of an MSA and its predictive capacity. When it comes to evolutionary reconstructions, the most commonly used objective functions involve maximizing weighted similarities (as provided by a PAM or BLOSUM substitution matrix) while using an affine gap penalty to estimate indels costs. The substitution cost can be adjusted using tree-based weighting schemes that reflect the independent information contribution of each sequence, and the score of columns is estimated by considering the total all-against-all (sums-of-pairs) substitution cost. It is well known that the sum-of-pairs functions are unlikely to be modeling biological relationships accurately enough [ 6], but they have been shown to provide a reasonable trade-off between structural correctness and computability, that is to say, the possibility to rapidly estimate a reasonable MSA.

Under their most common formulations, the optimization of sums-of-pairs evaluation schemes is NP-complete. One therefore needs to rely on heuristics, the most common one being the progressive alignment algorithm initially described by Hogeweg and Hesper [ 7]. This algorithm involves incorporating the input sequences one by one into the final model, following an inclusion order defined by a pre-computed guide tree. At each node, a pairwise alignment is carried out between either a pair of sequences, a sequence and a profile or two profiles. The pairwise alignments taking place at each node are estimated using more or less sophisticated adaptations of the Needlman and Wunsch global dynamic programming alignment algorithm [ 8]. The combination between a tree-based progressive strategy and a global pairwise alignment algorithm forms the backbone of most available methods ( Figure 1), including ClustalW [ 2], T-Coffee [ 9] and ProbCons [ 10]. It is also particularly well adapted for the design of iterative strategies ( Figure 1), involving reestimating trees and alignments until both converge [ 11], as implemented in MUSCLE [ 12], MAFFT [ 13] and Clustal Omega [ 14].

Aside from the objective function, the main algorithmic component of the progressive alignment is the guide tree estimation procedure. This tree, that decides in which order the sequences will be incorporated, can be obtained using a wide variety of methods, the most standard being Neighbor Joining (NJ) [ 15] and Unweighted Pair Group Method with Arithmetic Mean (UPGMA) [ 16]. The interaction between the objective function (substitution scheme and gap penalties), the weighting scheme and the tree is complex and was extensively explored by Wheeler [ 17] who showed how the proper tuning of these various components can take a standard method up to the level of the most accurate ones. It is therefore unsurprising to observe that the latest algorithmic developments have been focused on guide trees and objective function improvements.

The main caveat of the progressive alignment approach is the existence of local minima (high level of similarity between a subset of sequences resulting from an artifact). For instance, if the guide tree induces the alignment of two distantly related sequences, it often happens that the optimal alignment of these two sequences will not correspond to the pairwise projection one would get from the optimal MSA of the entire data set (i.e. within the MSA the alignment of the two sequences will be slightly suboptimal so as to allow global optimality at the MSA level). This situation is common when dealing with low-identity or low-complexity sequences. When this occurs, the early computation of the first pairwise alignment may prevent the computation of a globally optimal MSA.

The most common strategy to avoid local minima during a progressive alignment is the use of consistency, as originally described by [ 9]. The rationale of consistency is relatively straightforward: given a set of sequences and their associated pairwise alignments, treated as constraints, scores for matching pairs of residues are reestimated so as to deliver pairwise alignments more likely to be compatible with a globally optimal MSA. The first strategy involving such a reestimation of match costs was reported by Morgenstern as overlapping weights [ 18]. This scheme later inspired the T-Coffee scoring scheme that has become the archetypical progressive consistency-based aligner [ 9]. Optimizing an alignment against a set of predefined constraint is known as the Maximum Weight Trace problem. It is NP-complete under its most common formulations and can only be solved for small instances [ 19, 20]. The T-Coffee algorithm is a heuristic approach that involves reestimating the initial costs of every potential pairwise match by taking into account its compatibility with the rest of the pairwise alignment. The resulting scoring scheme makes it more likely to assemble consistent sub-alignments during the progressive MSA procedure. The main strength of this approach is to allow the computation of MSAs even when an objective function is only available to be optimized at the pairwise level. Consistency-based methods and their relationships have been extensively reviewed in [ 4]. Since then, the consistency-based approach has become one of the most popular algorithmic frameworks for the development of novel methods ( Figure 1).

In a consistency-based algorithm, the most critical parameter is the primary library. Given a set of sequences, the primary library is a collection of all possible pairwise sequence comparisons. This library is used to define the consistency-based objective function. In the original T-Coffee [ 9], the library was a compilation of all pairs of residues found aligned in the entire pairwise local and global alignments. These residue pairs were weighted according to the estimated reliability of their source alignments. Later on, a variation of the T-Coffee algorithm named ProbCons [ 10] established the superiority of pair-HMM-based libraries. In ProbCons, libraries are compiled using a pair-HMM to estimate the posterior probability of all possible pairs of residues (between distinct sequences) to be aligned. The use of a pair-HMM soon became popular among other alignment methods ( Figure 1). The main novel features of ProbCons over T-Coffee were the use of a more formal probabilistic framework, thanks to the HMM and the implementation of a biphasic gap penalty when estimating pairwise alignments. Algorithms relying on a similar combination are often referred to as probabilistic consistency algorithms they include the PECAN multiple genome aligner [ 21], which uses the Durbin [ 22] forward only divide and conquer pairwise alignment and MSAProbs [ 23], which relies on a partition function to achieve more informative posterior probabilities when compiling the library. Some degree of consistency was also incorporated in the MAFFT ‘linsi' algorithm.

When benchmarked on structure-based reference alignments, consistency-based aligners have long been shown to yield the most accurate MSAs [ 4, 23]. This accuracy comes, however, at a significant memory and CPU cost, with most implementations being cubic in CPU and quadratic in memory with the number of sequences. Three strategies have been proposed to address this problem. The simplest one involves faster library computation. For instance, FM-Coffee, the fast implementation of T-Coffee computes its library using three fast aligners, and eventually extracts the resulting pairwise projections. The high correlation between the various projections then makes it possible to band the consistency extension and significantly lower time and memory complexity at a near-quadratic level. Even though the resulting alignments are not as accurate as those obtained using the default procedure, they tend to be more accurate than those produced individually by the combined methods. The second strategy involves parallelization. Two such schemes have been recently published Cloud-Coffee [ 24] and MSAProbs [ 23], both which involve parallelizing library computation and the relaxation step during which pairwise costs are reestimated when the progressive alignment assembly is taking place. The last step, which involves splitting computation according to the tree topology, is highly dependent on the guide tree symmetry, best performances being achieved with perfectly balanced guide trees. The third strategy is more sophisticated and involves tuning the library granularity by considering sequence segments rather than single residues. This implementation, available in SeqAn [ 25], is especially well suited to long closely related sequences, in which long identical segments can be identified.

Large-scale multiple aligners

Even in their most optimized forms, consistency-based methods cannot deal with more than a few hundred sequences. This limit is rather severe in a context where the explosion of genomic sequence availability has resulted in unprecedented large homologous families that can require aligning up to 1.5 million members (number of ABC transporters in Pfam) and soon many more. While the biological relevance of large MSAs can be questioned, recent analysis indicates that important results can be established from such large models [ 26], thus making the accurate and efficient building of large MSAs one of the current grand challenge of modern biology. Three methods are currently able to deal with such large data sets: the PartTree mode of MAFFT [ 13], Clustal Omega [ 14] and PASTA [ 27], the newest version of SATé [ 28] ( Figure 1). These methods share a common characteristic: their reliance on a fast pre-clustering step (sub-quadratic in time) that makes it possible to rapidly determine the order in which sequences should be aligned.

In the original progressive methods, the guide tree was estimated by comparing all the sequences against one another to estimate a distance matrix. This comparison can be based on a slow Needleman and Wunsch [ 8] alignment or on a fast k-tuple vector comparison as implemented in MAFFT [ 13], MUSCLE [ 29] and T-Coffee [ 9]. The fast comparison does not, however, solve the issue of quadratic time and space requirements for the matrix computation followed by the cubic time complexity of tree estimation when using either UPGMA or NJ. These requirements become prohibitive when processing over 10 000 sequences. Recent clustering methods have been designed to address this issue. In Clustal Omega [ 14], the guide tree is estimated using the mBed method [ 30]. The principle of mBed is to first estimate the distance between each sequence and a tiny subset of sequences selected on the basis of their length. For each sequence, the result is a distance vector that can be used to run a hierarchical k-means clustering ( Figure 1), whose relatively low complexity (NlogN under the most common heuristic implementations) allows large data sets of 10 000 sequences or more to be aligned. PartTree relies on a slightly different procedure that also involves using a small set of seed sequences to rapidly pre-cluster the sequences. In both mBed and PartTree, the pre-cluster step is followed by the computation of sub-trees that are eventually combined together to form the guide tree. The PartTree approach was recently improved in the SATé algorithm, which involves an extra iterative tree-refinement step. The latest attempt at aligning large data sets is an adapted version of the T-Coffee algorithm that involves combining k-means clustering with consistency-based MSAs at a lower level [ 26]. These approaches scale well, but at the cost of significantly lower accuracies when aligning >1000 sequences, as shown in the Clustal Omega benchmark analysis [ 14]. A probable side effect of this decreased accuracy has been the report of high alignment inconsistencies between MAFFT, Clustal Omega and T-Coffee when dealing with large data sets of relatively similar orthologous mitochondrial sequences. When considering full data sets, the authors report average agreement levels as low as 60% [ 26].

Phylogeny-aware multiple aligners

A major milestone in the development of MSAMs has been the introduction of structure-based reference alignments that can be used to compare the relative capacities of various methods to reconstruct structurally correct alignments from sequence only. The choice of structure seems rather natural because 3D features are known to be more evolutionary resilient than the underlying sequences. On the other hand, this approach relies on the unproven rationale that structurally and evolutionary correct alignments are identical. No proof exists that this assumption may be correct, and a simple reasoning suggests it may not be the case. Indeed, while there can be only one correct way of matching homologous residues—the one that perfectly reflects the unique evolutionary history of the considered sequences and matches—there can be as many structurally correct alignments as there are ways to superpose the sequences with equivalent 3D compactness. Another major potential discrepancy between structural and evolutionary alignments results from convergent evolution. Whenever such a process has shaped some portions of a sequence data set, the resulting alignment matching convergent regions will be structurally correct and evolutionary false—and reciprocally.

This issue has recently been addressed by a series of works aiming at evaluating multiple aligners’ accuracy on the basis of the quality of the phylogenetic models they support. These aligners are referred to as phylogeny-aware aligners ( Figure 1). PRANK [ 31] was one of the first. It relies on the idea that correct MSAs must have indels patterns properly reflecting the underlying phylogenetic tree. PRANK was rapidly followed by SATé [ 28], an iterative multiple aligner derived from MAFFT that attempts to estimate the MSA supporting the highest-scoring maximum likelihood tree. An important merit of this approach is to depart from the long-held assumption that the best MSA is the one maximizing similarity between sequences. In the context of phylogeny-aware aligners, the best MSA is defined as the one yielding the best phylogenetic model [ 32]. The consequences on the resulting alignments are rather significant and were particularly well illustrated in a recent analysis by Blackburne and Whelan who found that ‘similarity-based' MSAMs (e.g. ClustalW, MUSCLE, ProbCons, MAFFT, T-Coffee) and ‘evolution-based' MSAMs (e.g. PRANK and BAliPhy) tend to form discrete clusters under the multidimensional scaling based on their own similarity measures between pairs of alternative MSAs [ 33]. These authors found that the selected aligners have a substantial impact on downstream phylogenetic inference and report the tree topologies and branch length to depend on the aligner category. Aligners also have a clear impact when quantifying positive selection, with different readouts associated with various aligners as reported on the analysis of several Drosophila genomes [ 34]. Morrison suggests that phylogeneticists are usually dissatisfied with similarity-based alignments and tend to manually edit their MSAs to produce alignments more likely to reflect homology from a true evolutionary stand-point [ 32]. This observation may also explain why results and method ranking achieved on evolutionarily simulated data sets significantly differ from those measured on structure-based empirical data [ 4]. However, a recent study by Chang [ 35] shows that the same reliability index can be used to select both the most phylogenetically informative positions and the positions most likely to contain structurally analogous residues. It is worth mentioning that this study gave contrasting results with respect to phylogeny-aware aligners, and while SATé appears to perform well both on structure and evolutionary benchmarks, PRANK was found to return poor structural alignments, while being able to produce alignments supporting trees with an accuracy comparable with the other MSAMs—phylogeny aware and similarity based.

Pairwise Sequence Alignment

Pairwise sequence alignment compares only two sequences at a time and provides best possible sequence alignments. Pairwise is easy to understand and exceptional to infer from the resulting sequence alignment.

Biopython provides a special module, Bio.pairwise2 to identify the alignment sequence using pairwise method. Biopython applies the best algorithm to find the alignment sequence and it is par with other software.

Let us write an example to find the sequence alignment of two simple and hypothetical sequences using pairwise module. This will help us understand the concept of sequence alignment and how to program it using Biopython.

Step 1

Import the module pairwise2 with the command given below &minus

Step 2

Create two sequences, seq1 and seq2 &minus

Step 3

Call method pairwise2.align.globalxx along with seq1 and seq2 to find the alignments using the below line of code &minus

Here, globalxx method performs the actual work and finds all the best possible alignments in the given sequences. Actually, Bio.pairwise2 provides quite a set of methods which follows the below convention to find alignments in different scenarios.

Here, the sequence alignment type refers to the alignment type which may be global or local. global type is finding sequence alignment by taking entire sequence into consideration. local type is finding sequence alignment by looking into the subset of the given sequences as well. This will be tedious but provides better idea about the similarity between the given sequences.

X refers to matching score. The possible values are x (exact match), m (score based on identical chars), d (user provided dictionary with character and match score) and finally c (user defined function to provide custom scoring algorithm).

Y refers to gap penalty. The possible values are x (no gap penalties), s (same penalties for both sequences), d (different penalties for each sequence) and finally c (user defined function to provide custom gap penalties)

So, localds is also a valid method, which finds the sequence alignment using local alignment technique, user provided dictionary for matches and user provided gap penalty for both sequences.

Here, blosum62 refers to a dictionary available in the pairwise2 module to provide match score. -10 refers to gap open penalty and -1 refers to gap extension penalty.

Step 4

Loop over the iterable alignments object and get each individual alignment object and print it.

Step 5

Bio.pairwise2 module provides a formatting method, format_alignment to better visualize the result &minus

Biopython also provides another module to do sequence alignment, Align. This module provides a different set of API to simply the setting of parameter like algorithm, mode, match score, gap penalties, etc., A simple look into the Align object is as follows &minus

For a metasite linked to a wide range of protein sequence analysis and structure predictions online programs, I recommend PredictProtein (ROSTLAB, Technische Universität München). Also see: SCRATCH Protein Predictor (Institute for Genomics & Bioinformatics, University of California, Irvine, U.S.A.)

Several great sites for online analysis of potential membrane spanning proteins are: (Test sequence see Orientation of Proteins in Membranes for 268 unique a-helical membrane protein structures)

TMpred - Prediction of trans-membrane regions and orientation - ISREC (Swiss Institute for Experimental Cancer Research)
TMHMM - Prediction of transmembrane helices in proteins (Center for Biological Sequence Analysis, The Technical University of Denmark)
DAS - Transmembrane Prediction Server (Stockholm University, Sweden)
SPLIT (D. Juretic, Univ. Split , Croatia) - the transmembrane protein topology prediction server provides clear and colourful output including beta preference and modified hydrophobic moment index.
OCTOPUS - Using a novel combination of hidden Markov models and artificial neural networks, OCTOPUS predicts the correct topology for 94% of the a dataset of 124 sequences with known structures. ( Reference: Viklund, H. &

Phobius - is a combined transmembrane topology and signal peptide predictor ( Reference: L. Käll et al. 2004. J. Mol. Biol. 338: 1027-1036) This tool can also be accessed here (EBI).

CCTOP (Consensus Constrained TOPology prediction) server - utilizes 10 different state-of-the-art topology prediction methods, the CCTOP server incorporates topology information from existing experimental and computational sources available in the PDBTM, TOPDB and TOPDOM databases using the probabilistic framework of hidden Markov model. The server provides the option to precede the topology prediction with signal peptide prediction and transmembrane-globular protein discrimination. ( Reference: Dobson L et al. (2015) Nucleic Acids Res 43(W1): W408&ndashW412).

TMFoldWeb - is the web server implementation of TMFoldRec, a transmembrane protein fold recognition algorithm. TMFoldRec uses statistical potentials and utilizes topology filtering and a gapless threading algorithm. It ranks template structures and selects the most likely candidates and estimates the reliability of the obtained lowest energy model. The statistical potential was developed in a maximum likelihood framework on a representative set of the PDBTM database. According to the benchmark test the performance of TMFoldRec is about 77 % in correctly predicting fold class for a given transmembrane protein sequence. ( Reference : Kozma D & Tusnády GE (2015) Biol Direct. 10: 54).

MEMSATSVM - is an improved transmembrane protein topology prediction using SVMs. This method is capable of differentiating signal peptides from transmembrane helices. ( Reference: Reeb J et al. (2015) Proteins 83(3): 473-84).

MEMEMBED - prediction of membrane protein orientation. is able to quickly and accurately orientate both alpha-helical and beta-barrel membrane proteins within the lipid bilayer, showing closer agreement with experimentally determined values than existing approaches. We also demonstrate both consistent and significant refinement of membrane protein models and the effective discrimination between native and decoy structures ( Reference: Nugent T & DT Jones (2013) BMC Bioinformatics 14: 276)

RHYTHM - predicts the orientation of transmembrane helices in channels and membrane-coils, specifically buried versus exposed residues. ( Reference: A. Rose et al. 2009. Nucl. Acids Res. 37(Web Server issue):W575-W580)

TMMOD - Hidden Markov Model for Transmembrane Protein Topology Prediction (Dept. Computer & Information Sciences, University of Delaware, U.S.A.) - on the results page click on " show posterior probabilities" to see a TMHMM-type diagram

PRED-TMR2 (C. Pasquier & S.J.Hamodrakas,Dept. Cell Biology and Biophysics, Univ. Athens, Greece) - when applied to several test sets of transmembrane proteins the system gives a perfect prediction rating of 100% by classifying all the sequences in the transmembrane class. Only 2.5% error rate with nontransmembrane proteins.

TOPCONS - computes consensus predictions of membrane protein topology using a Hidden Markov Model (HMM) and input from five state-of-the-art topology prediction methods. ( Reference: A. Bernsel et al. 2009. Nucleic Acids Res. 37(Webserver issue), W465-8) . For a batch server without BLAST runs use TOPCONS single.

MINNOU ( Membrane protein IdeNtificatioN withOUt explicit use of hydropathy profiles and alignments) - predicts alpha-helical as well as beta-sheet transmembrane (TM) domains based on a compact representation of an amino acid residue and its environment, which consists of predicted solvent accessibility and secondary structure of each amino acid. ( Reference: Cao et al. 2006. Bioinformatics 22: 303-309). A legend to help interpret the results in here.

SuperLooper - provides the first online interface for the automatic, quick and interactive search and placement of loops in proteins . ( Reference: P.W. Hildebrand et al. 2009. Nucl. Acids Res. 37(Web Server issue):W571-W574) )

Transmembrane Kink Predictor (TMKink) - A hallmark of membrane protein structure is the large number of distorted transmembrane helices. Because of the prevalence of bends, it is important to not only understand how they are generated but also to learn how to predict their occurrence. Here, we find that there are local sequence preferences in kinked helices, most notably a higher abundance of proline, which can be exploited to identify bends from local sequence information. A neural network predictor identifies over two-thirds of all bends (sensitivity 0.70) with high reliability (specificity 0.89). ( Reference: Meruelo AD et al. 2011. Protein Sci. 20:1256-64)

SCMMTP Scoring Card Method Membrane Transport Proteins : Identifying and characterizing membrane transport proteins using propensity scores of dipeptides. The training and test accuracies of SCMMTP are 83.81% and 76.11%, respectively. ( Reference: Vasylenko, T. et al. 2015. BMC Bioinformatics, 16 (Suppl 1):S8, 2015)

For drawing the structure of transmembrane proteins two sites are available:

Protter - an open-source tool for interactive integration and visualization of annotated and predicted protein sequence features together with experimental proteomic evidence. Protter supports numerous proteomic file formats and automatically integrates a variety of reference protein annotation sources, which can be readily extended via modular plug-ins. A built-in export function produces publication-quality customized protein illustrations, also for large datasets. ( Reference: U. Omasits et al. 2014. Bioinformatics. 30:884-886 ). Diagram of the holin from bacteriophage lambda generated with Protter:

TOPO2 (S. Johns, UCSF Sequence Analysis Consulting Service, U.S.A.) - this site provides considerable control over the presentation. Extensive documentation is provided here.

TMRPres2D ( T rans M embrane protein R e- Pres entation in 2 D imensions tool) - this Java tool takes data from a variety of protein folding servers and creates uniform, two-dimensional, high analysis graphical images/ models of alpha-helical or beta-barrel transmembrane proteins. ( Reference: I.C. Spyropoulos et al. 2004. Bioinformatics 20: 3258-3260).

Signal peptide recognition & subcellular localization:

A. Bacterial proteins

PSORTb (Brinkman Lab, Simon Fraser Univ., Canada) - provides probably the most accurate bacterial protein subcellular localization predictor. Alternatively use PSORT (Univ. Tokyo, Japan) - a series of programs for the prediction of protein localization sites in cells. Choose programs specific for for animal, yeast, plant or bacterial ( Gram-negative or Gram- positive) proteins.
PSLpred - is a SVM based method, predicts 5 major subcellular localization (cytoplasm, inner-membrane, outer- membrane, extracellular, periplasm) of Gram-negative bacteria. This method includes various SVM modules based on different features of the proteins. The hybrid approach achieved an overall accuracy of 91%, which is best among all the existing methods for the subcellular localization of prokaryotic proteins. ( Reference: 21: 2522-2524.)

CELLO subCELlular LOcalization predictive system - assigns Gram-negative proteins to the cytoplasm , inner membrane, periplasm, outer membrane or extracellular space with overall prediction accuracy of ca. 89% . Also analyzes eukaryotic and Gram-positive proteins. ( Reference: C.S. Yu et al. 2004 . Protein Sci. 13:1402-1406). The updated CELLO2GO (Protein subCELlular LOcalization Prediction with Functional Gene Ontology Annotation) - CELLO2GO should be a useful tool for research involving complex subcellular systems because it combines CELLO and BLAST into one platform and its output is easily manipulated such that the user-specific questions may be readily addressed ( Reference: Yu CS et al. 2014. PLoS ONE 9: e99368).

SignalP - predicts the presence and location of signal peptide cleavage sites in Gram-positive, Gram-negative and eukaryotic proteins (Center for Biological Sequence Analysis, The Technical University of Denmark). For an example of a periplasmic protein use test sequence MalE.
Phobius - is a combined transmembrane topology and signal peptide predictor ( Reference: L. Käll et al. 2004. J. Mol. Biol. 338: 1027-1036).
LipoP 1.0 (Center for Biological Sequence Analysis Technical University of Denmark) - allows prediction of where signal peptidases I & II cleavage sites from Gram negative bacteria will cleave a protein.

SecretomeP - produces ab initio predictions of non-classical i.e. not signal peptide triggered protein secretion. The method queries a large number of other feature prediction servers to obtain information on various post-translational and localizational aspects of the protein, which are integrated into the final secretion prediction ( Reference: J.D. Bendtsen et al. 2005. BMC Microbiology 5: 58).

SSPRED - Identification & classification of proteins involved in bacterial secretion systems. Do not submit more than four proteins at once. ( Reference: Pundhir, S., & Kumar, A. 2011. Bioinformation 6: 380-382).

Signal Find Server - includes (a) FlaFind which predicts archaeal class III (type IV pilin-like) signal peptides (class III signal peptides) and their prepilin peptidase cleavage sites (b) EppA-pilinFind which predicts class III signal peptides processed by a unique archaeal prepilin peptidase, EppA (c) TatFind which predicts archaeal AND bacterial Twin-Arginine Translocation (Tat) signal peptides (d) PilFind which predicts bacterial type IV pilin-like signal peptides and their prepilin peptidase cleavage sites and, (e) TatLipo which predictes haloarchaeal Tat signal peptides that contain a SPase II cleavage site (lipobox).

Signal-3L 2.0 - is an online server for predicting the N-terminal protein signal peptide, and the input is the amino acid sequence only. It is constructed with a hierarchical mixture model, which contains the following three layers: (1) Discrimination of SP (Signal Peptide) proteins and TMH (TransMembrane Helical) proteins from the other globular proteins (2) Recognizing SP proteins from TMH proteins and, (3) Identifying the cleavage sites of SP proteins. (Refer ence: Y-Z. Zhang & H-B. Shen. Journal of Chemical Information and Modeling, 2017, 57: 988-999)

PrediSi - PREDIction of SIgnal peptides (Karsten Hiller, Technical University of Braunschweig)

Signal Find Server - provides several distinct programs: (a) FlaFind predicts archaeal class III (type IV pilin-like) signal peptides (class III signal peptides) and their prepilin peptidase cleavage sites. (b) EppA-pilinFind predicts class III signal peptides processed by a unique archaeal prepilin peptidase, EppA. (c) TatFind predicts archaeal AND bacterial Twin-Arginine Translocation (Tat) signal peptides. (d) PilFind predicts bacterial type IV pilin-like signal peptides and their prepilin peptidase cleavage sites. (e) TatLipo predictes haloarchaeal Tat signal peptides that contain a SPase II cleavage site (lipobox).

B. Eukaryotic proteins

DeepLoc-1.0 predicts the subcellular localization of eukaryotic proteins. It can differentiate between 10 different localizations: Nucleus, Cytoplasm, Extracellular, Mitochondrion, Cell membrane, Endoplasmic reticulum, Chloroplast, Golgi apparatus, Lysosome/Vacuole and Peroxisome. Their model achieves a good accuracy (78% for 10 categories 92% for membrane-bound or soluble), outperforming current state-of-the-art algorithms, including those relying on homology information. ( Reference: Almagro Armenteros JJ et al. 2017. Bioinnformatics 33(21): 3387-3395).

Protein Prowler Subcellular Localisation Predictor version 1.2 accepts amino acid sequences, presented in the FASTA format, and determines the localisation of the protein among the following categories: Secretory pathway (presence of a signal peptide SP) Mitochondrion (presence of a mitochondrial targeting peptide mTP) Chloroplast (presence of a chloroplast transit peptide cTP only applicable to plant proteins) and, Other (nucleus, cytoplasmic, or otherwise). ( Reference: Bodén, M. and Hawkins (2005) Bioinformatics. 21(10): 2279-2286).
WoLF PSORT - ( National Institute of Advanced Science and Technology , Japan) ( Reference: Horton P et al. (2007) Nucleic Acids Res. 35(Web Server issue): W585&ndashW587).

ngLOC - is an n-gram-based Bayesian classifier that predicts subcellular localization of proteins both in prokaryotes and eukaryotes. The overall prediction accuracy varies from 85.3% to 91.4% across species. This program can predict 11 distinct locations each in plant and animal species. ngLOC also predicts 4 and 5 distinct locations on gram-positive and gram-negative bacterial datasets, respectively. ( Reference: King BR et al BMC Res Notes. 5: 351 ).

ProtComp (Softberry, U.S.A.) can be used to predict the subcellular localization for animal/fungal and plant proteins.

SecretomeP - produces ab initio predictions of non-classical i.e. not signal peptide triggered protein secretion. The method queries a large number of other feature prediction servers to obtain information on various post-translational and localizational aspects of the protein, which are integrated into the final secretion prediction ( Reference: J.D. Bendtsen et al. 2005. BMC Microbiology 5: 58).

Other sites for secondary structure predictions include:

JPred4 - is the latest version of the popular JPred protein secondary structure prediction server which provides predictions by the JNet algorithm, one of the most accurate methods for secondary structure prediction. In addition to protein secondary structure, JPred also makes predictions of solvent accessibility and coiled-coil regions. JPred4 features higher accuracy, with a blind three-state (a-helix, ß-strand and coil) secondary structure prediction accuracy of 82.0% while solvent accessibility prediction accuracy has been raised to 90% for residues <5% accessible. ( Reference: A. Drozdetskiy et al. 2015.Nucl. Acids Res. 43 (W1): W389-W394).

Network Protein Sequence @nalysis at IBCP - (Institut de Biologie et Chemie des Proteines, Lyon, France) - has DSC, GORIV, Predator, SOPMA and Heirarchical Neural Network Method plus older programs.

PSIPRED Protein Sequence Analysis Workbench - includes PSIPRED v3.3 (Predict Secondary Structure) DISOPRED3 & DISOPRED2 (Disorder Prediction) pGenTHREADER (Profile Based Fold Recognition) MEMSAT3 & MEMSAT-SVM (Membrane Helix Prediction) BioSerf v2.0 (Automated Homology Modelling) DomPred (Protein Domain Prediction) FFPred 3 (Eukaryotic Function Prediction) GenTHREADER (Rapid Fold Recognition) MEMPACK (SVM Prediction of TM Topology and Helix Packing) pDomTHREADER (Fold Domain Recognition) and, DomSerf v2.0 (Automated Domain Modelling by Homology). ( Reference: Buchan DWA et al. 2013. Nucl. Acids Res. 41 (W1): W340-W348).

For a full range of properties of your protein including hydrophobicity, alpha helix, beta-sheet plots see ProScale (ExPASy, Switzerland).

Disordered states:

Many proteins containing regions that do not form well-defined structures and the following new programs help define these regions:

D2P2 (Database of Disordered Protein Predictions) - A battery of disorder predictors and their variants, VL-XT, VSL2b, PrDOS, PV2, Espritz and IUPred, were run on all vast number of protein sequences. Searches are provided against the database and links are provided to each of the disordered servers. ( Reference: Oates ME et al. (2013). Nucleic Acids Res 41(D1): D508-D516).

RONN (Regional Order Neural Network) - ( Reference: Z.R. Yang et al. 2005. Bioinformatics 21: 3369-3376). For explanation of native disorder see here.
IUPred - The underlying assumption is that globular proteins are composed of amino acids which have the potential to form a large number of favorable interactions, whereas intrinsically unstructured proteins (IUPs) adopt no stable structure because their amino acid composition does not allow sufficient favorable interactions to form. ( Reference: Z. 3433-3434).
DISOPRED3 ( Reference: J.J. Ward et al. 2004. J. Molec. Biol. 337: 635-645).

PrDOS is a server to predict natively disordered regions of a protein chain from its amino acid sequence. PrDOS returns disorder probability of each residue as prediction results.( Reference: Ishida T, & Kinoshita K (2007) Nucleic Acids Res. 35(Web Server issue): W460-4.).

MFDp (Multilayered Fusion-based Disorder predictor) - aims to improve over the current disorder predictors.( Reference: M.J. Mizianty et al. 2010. Bioinformatics 26: i489-i496)

MoRFpred - Molecular recognition features (MoRFs) are short binding regions located within longer intrinsically disordered regions that bind to protein partners via disorder-to-order transitions. MoRFs are implicated in important processes including signaling and regulation. MoRFpred is a computational tool for sequence-based prediction and characterization of short disorder-to-order transitioning binding regions in proteins which identifies all MoRF types (a, ß, coil and complex). ( Reference: F.M. Disfani et al. 2012. Bioinformatics 28: i75-i83).

Scooby-domain (Sequence hydrophobicity predicts domains) is a method to identify globular regions in protein sequence that are suitable for structural studies. The Scooby-domain JAVA applet can be used as a tool to visually identify 'foldable' regions in protein sequence. Interesting graphics. ( Reference: R.A. George et al. 2005. Nucl. Acids Res. 33: W160-W163).

For estimations on the antigenicity of regions of proteins see:

Antigenicity Plot (JaMBW module) - Given a sequence of amino acids, this program computes and plots the antigenicity along the polypeptide chain, as predicted by the algorithm of Hopp & Woods (1981).
SAbPred - is a structure-based antibody prediction server ( Reference: J. Dunbar et al. Nucleic Acids Res . 2016 44(Web Server issue): W474&ndashW478.

EMBOSS Antigenic (EMBOSS package) - this program predicts potentially antigenic regions of a protein sequence, using the method of Kolaskar & Tongaonkar (1990). Also accessible here.

OptimumAntigen&trade Design Tool (GenScript) - peptides are optimized using the industry's most advanced antigen design algorithm. Each peptide is measured against several protein databases to confirm the desired epitope specificity. Benefits of using the OptimumAntigen&trade Design Tool include avoidance of unexposed epitopes, ability to specify desired cross-reactivity, strong antigenicity of chosen peptide, identification of the best conjugation and presentation options for your desired assay(s), use of built in peptide tutorial for synthesis and solubility, and guaranteed immune response.

EpiC (The ProteomeBinders Epitope Choice Resource) collates and presents a structure-function summary and antigenicity prediction of your protein to help you design antibodies that are appropriate to your planned experiments. ( Reference: Haslam, N. & Gibson. T. Proteome Res., 2010, 9 (7): 3759&ndash3763).

SVMTriP - is a method to predict antigenic epitopes using support vector machine to integrate tri-peptide similarity and propensity. ( Reference: B. Yao et al. PLoS ONE (2012) 7(9):e45152).

To screen for coiled-coil regions in proteins use:

Coils - prediction of coiled coil regions in troteins (Swiss node of EMBnet, Switzerland) - ( Reference: A. Lupas et al. 1991 Science 252: 1162-1164). See also PCOILS and MARCOILS.
Paircoils (MIT Laboratory for Computer Science, U.S.A.) - ( Reference: B. Berger et al. 1995. Proc. Natl. Acad. Sci. USA, 92: 8259-8263) or MultiCoil - is based on the PairCoil algorithm and is used for locating dimeric and trimeric coiled coils. ( Reference: E. Wolf et al. 1997. Protein Sci. 6: 1179-1189). CoilP

REPPER (REPeats and their PERiodicities) - detects and analyzes regions with short gapless repeats in proteins. It finds periodicities by Fourier Transform (FTwin) and internal similarity analysis (REPwin). FTwin assigns numerical values to amino acids that reflect certain properties, for instance hydrophobicity, and gives information on corresponding periodicities. REPwin uses self-alignments and displays repeats that reveal significant internal similarities. They are complemented by PSIPRED and coiled coil prediction (COILS), making the server a useful analytical tool for fibrous proteins. ( Reference: M. Gruber et al. 2005. Nucl. Acids Res. 33: W239-W243).

Beta-barrel outer membrane proteins: (Test sequence)

PRED-TMßß (Bagos, P. G., et al. Dept Cell Biology & Biophysics, University of Athens, Greece) - employs a Hidden Markov Model method, capable of predicting and discriminating beta-barrel outer membrane proteins. Gives one the opportunity to download a custom image plot or a 2D representation (see below):

BetaTPred2 (Bioinformatics Center, Institute of Microbial Technology, India) - predict ß turns in proteins from multiple alignment by using neural network from the given amino acid sequence. For ß turn prediction, it uses the position specific score matrices generated by PSI-BLAST and secondary structure predicted by PSIPRED. For a classification of the ß turn type use BetaTurns.

BOMP - The ß-barrel outer membrane protein predictor ( Reference: Berven, F.S. et al. 2004. Nucl. Acids Res. 32 (Web Server issue):W394-9 ).

HHomp - detection of outer membrane proteins by HMM-HMM comparisons

ConBBPred - Consensus Prediction of Transmembrane Beta-Barrel Proteins - gives one a choice of eight prediction programs.

Aligning multiple genomic sequences with the threaded blockset aligner

We define a "threaded blockset," which is a novel generalization of the classic notion of a multiple alignment. A new computer program called TBA (for "threaded blockset aligner") builds a threaded blockset under the assumption that all matching segments occur in the same order and orientation in the given sequences inversions and duplications are not addressed. TBA is designed to be appropriate for aligning many, but by no means all, megabase-sized regions of multiple mammalian genomes. The output of TBA can be projected onto any genome chosen as a reference, thus guaranteeing that different projections present consistent predictions of which genomic positions are orthologous. This capability is illustrated using a new visualization tool to view TBA-generated alignments of vertebrate Hox clusters from both the mammalian and fish perspectives. Experimental evaluation of alignment quality, using a program that simulates evolutionary change in genomic sequences, indicates that TBA is more accurate than earlier programs. To perform the dynamic-programming alignment step, TBA runs a stand-alone program called MULTIZ, which can be used to align highly rearranged or incompletely sequenced genomes. We describe our use of MULTIZ to produce the whole-genome multiple alignments at the Santa Cruz Genome Browser.


( A ) Blocks (alignments)…

( A ) Blocks (alignments) of a hypothetical threaded blockset for sequences h…

( A ) Alignments between…

( A ) Alignments between the chloroplast genomes of Arabidopsis thaliana and Oenothera…

A threaded blockset for vertebrate…

A threaded blockset for vertebrate HoxA regions, displayed in our interactive blockset viewer…

( A ) Accuracy of the multiple alignments produced by different aligners on…

Pictorial representation of an application…

Pictorial representation of an application of MULTIZ. M is a human-ref blockset of…

UCSC Genome Browser display of…

UCSC Genome Browser display of HUMOR alignments. ( A ) Ribosomal protein RPL31.…

Step 2: Generate the Alignment

Select both items in the Project View.

Right-click the selected items, and then click the Run Tool.

In the Run Tool dialog, in the Alignment Creation section, select the ProSPLIGN tool.

Click Next.

ProSplign generates pairwise alignment between a protein and a genomic sequence. Here you can select multiple genomic ranges or transcripts to be aligned to one protein. The General Options tab allows to set various options. You can choose the genomic sequence strand that may be ‘Plus’, ‘Minus’ or ‘Both’. For sequences that have no introns, uncheck the ‘With introns’ checkbox. The genetic code is automatically determined from the organism associated to the sequence, or you can select it manually. You can also choose three of the scoring parameters: the frameshift and the gap opening cost as well as the gap extension cost for one amino acid.

The Refinement Options tab allows to set options for post-processing the alignment. By default, this option is set – to unset, uncheck the Refine the alignment checkbox. The other checkboxes are responsible for removing only the flank regions and for removing Ns from the end of good regions from the full alignment.

The flank positives and the total positives are the minimum percentage of positives the final refined alignment will have. If the percentage of positives is less than the total positives, more bad pieces will be removed. Any flank with percentage of positives less than the flank positives will be trimmed. Good regions shorter than the minimum length of good region will also be trimmed.

The minimum exon identity/positives represent the smallest percentage of exon identity/positives that may appear in the refined alignment for either a full or partial exon. The number of bases in the first and the last exon that will appear in the refined alignment will be at least the minimum flanking exon length.

When you are finished choosing your settings, click Next. To restore both the general and the refinement parameters to their default values, click Defaults.

Select ‘Add to an existing Project’ option and click Finish. The generated alignment is added to the Project.

Meanwhile, observe that the new alignment is displayed in the Graphical Sequence View in the Alignments track. A tooltip appears when you hover over it.


Department of Life Science, Shiv Nadar University, Greater Noida, UP, 201314, India

Basharat Bhat & Ashutosh Singh

Department of Animal Biotechnology, Sher-e-Kashmir University of Agricultural Sciences and Technology, Shuhama, Jammu and Kashmir, 190016, India

Nazir A. Ganai, Syed Mudasir Andrabi & Riaz A. Shah

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar


A.S. conceptualized the problem, B.B. conducted the experiment along with S.M.A. and R.A.S., A.S. and N.A.G. plan the work flow and A.S. N.A.G. and B.B. wrote the Manuscript. All authors reviewed the manuscript.

Corresponding author


BACKGROUND INFORMATION : The three BLAST programs that one will commonly use are BLASTN, BLASTP and BLASTX. BLASTN will compare your DNA sequence with all the DNA sequences in the nonredundant database (nr). BLASTP will compare your protein sequence with all the protein sequences in nr. In BLASTX your nucleotide sequence will be translated in all six reading frames and the products compared with the nr protein database. Several online tutorial are available including BLAST QuickStart and Basic Web BLAST from NCBI and a YouTube video.

Nucleotide BLAST ( BLASTN ) N.B. the default database is the "nucleotide collection (nt/nr)." Comparatively recently NCBI offers the ability to conduct Batch BLAST searches.
Protein BLAST ( BLASTP ) N.B. This program is also coupled with a motif search. If you suspect that your pprotein may only show weak sequence similarity to other proteins, I would suggest clicking on the PSI-BLAST (Position-Specific Iterated BLAST) feature. NCBI provides a PSI-BLAST tutorial, and a YouTube video. Since interpretation of the results requires scientific tact you might want to check the potential homologs using SIB-BLAST. SimpleIsBeautiful is a novel methodology that aims to improve the statistical assessment of hits returned from a PSI-BLAST search to yield better delineation of true and false positives.( Reference: Lee, M.M. et al. 2008. Bioinformatics 24:1339-1343). SIB-BLAST can also be accessed here.
Translated BLAST ( BLASTX ) - runs an input DNA sequence againnst the protein databases.
TBLASTX searches translated nucleotide databases using a translated nucleotide query while TBLASTN searches translated nucleotide databases using a protein query. These are useful resources if you are interested in homologs in unfinished genomes. Under "Databases" select "genomic survey sequences", "High throughput genomic sequences" or "whole-genome shotgun reads"
Microbial Nucleotide BLAST (BLASTN, TBLASTN, TBLASTX etc.). Permits one to compare a nucleic acid or protein sequence against finished archaeal and bacterial genomes .

1. Depending upon the time of day your results may appear almost immediately or your search may be delayed or not accepted at all. Be prepared for plenty of results. You may only want to print the first few pages (e.g.1-5). Alternatively under "Algorithm Parameters" change the "Maximum targets" from 100 (default) to 10 or 50 .
2. For PSI-BLAST, and other searches I frequently enter information in the "Entrez Query" section e.g. Escherichia coli[organism] or Viruses[organism] to see "hits" specifically to E. coli or viruses/bacteriophages (see here for details)
3. It is adviseable to always select " Show results in a new window"

EMB BLAST - (European Molecular Biology network - Swiss node) . Very convenient since it permits one to specifically search databases such as prokaryote, bacteriophage, fungal, & 16S rRNA using BLASTN, and specific bacterial genomes or SwissProt using BLASTX or BLASTN.

BlastStation-Free in the Cloud (TM Software, Inc. founder: Takashi Miyajima) is a web-based 64-bit local BlastStation running on the Cloud computer. It supports megablast, blastn, blastp, and blastx searches allows easy database creation from your FASTA or FASTQ file, which can be compressed in .gz, .Z, or .zip format. A graphical display of search results and a summary table display of search results. The latter can be exported in CSV format, while the hit sequences can be exported in FASTA format. Also available for download in Mac or PC format.

GTOP Sequence Homology Search (Laboratory for Gene-Product Informatics, National Institute of Genetics, Japan) - offers BLASTP search capability against individual Archaea, Bacteria, Eukaryota, and viruses.

Actinobacteriophage Database (Graham Hatfull, U.S.A.) - allows BLASTN and BLASTP analyses against a growing list of phages that infect bacterial hosts within the phylum Actinobacteria.

HHPred Homology detection & structure prediction by HMM-HMM comparison - is a method for database searching and structure prediction that is as easy to use as BLAST but is much more sensitive in finding remote homologs. HHpred is the first server that is based on the pairwise comparison of profile hidden Markov models (HMMs). Whereas most conventional sequence search methods search sequence databases such as UniProt or the NR, HHpred searches alignment databases, like Pfam or SMART. This greatly simplifies the list of hits to a number of sequence families instead of a clutter of single sequences. HHpred accepts a single query sequence or a multiple alignment as input. ( Reference: Söding J et al. 2005. Nucl. Acids Res. 33, W244-W248 (Web Server issue)

IGBLAST - can be used to analyze immunoglobulin (Ig) sequences and T cell receptor (TR) sequences

Primer-BLAST: Finding primers specific to your PCR template (using Primer3 and BLAST).

VecScreen is a system that quickly finds segments of a nucleic acid sequence that may be of vector origin. It helps researchers identify and remove any segments of vector origin before they analyze or submit sequences.

For more sophisticated studies you might want to employ:

PSI-BLAST or PHI-BLAST or DELTA-BLAST (Domain Enhanced Lookup Time Accelerated BLAST) search - (NCBI) Position-Specific Iterative BLAST creates a profile after the initial search.
Genome vs Genome BLAST - (NCBI) BLAST two sequences against one another. N.B. This utilizes BLASTN, P, X as well as TBLASTN and TBLASTX.

Gene Context Tool NG - is an incredible tool for visualizing the genome context of a gene or group of genes (synteny). In the following diagram an RpoN (Sigma54) protein was analyzed. ( Reference: R. Ciria et al. (2004) Bioinformatics 20: 2307-2308).

Cinteny - Server for Synteny Identification and Analysis of Genome Rearrangement using reversal distance as a measure. You may create a project and upload your own data by following the links below or work with pre-loaded data by selecting the genomes below ( Reference: Sinha, A.U. & Meller, J. 2007. BMC Bioinformatics 8: 82)

Other search engines include:

Fasta Protein Similarity Search - (EBI) This tool provides sequence similarity searching against protein databases using the FASTA suite of programs. FASTA provides a heuristic search with a protein query. FASTX and FASTY translate a DNA query. Optimal searches are available with SSEARCH (local), GGSEARCH (global) and GLSEARCH (global query, local database).

TC-BLAST (Saier Laboratory Bioinformatics Grp, Univ. San Diego, U.S.A.) - Scans the transport protein database (TC-DB) producing alignments and phylogenetic trees. The TC-DB details a comprehensive classification system for membrane transport proteins known as the Transport Commission (TC) system.

MEROPS BLAST - permits one to screen protein sequences against an extensive database of characterized peptidases ( Reference: Rawlings, N.D et al. 2002. Nucleic Acids Res. 30: 343-346).

COMPASS - is a profile-based method for the detection of remote sequence similarity and the prediction of protein structure. The server features three major developments: (i) improved statistical accuracy (ii) increased speed from parallel implementation and (iii) new functional features facilitating structure prediction. These features include visualization tools that allow the user to quickly and effectively analyze specific local structural region predictions suggested by COMPASS alignments.( Reference: R.I. Sadreyev et al. 2009. Nucl. Acids Res. 37(Web Server issue:W90-W94)

SANSparallel: interactive homology search against Uniprot - the webserver provides protein sequence database searches with immediate response and professional alignment visualization by third-party software. The output is a list, pairwise alignment or stacked alignment of sequence-similar proteins from Uniprot, UniRef90/50, Swissprot or Protein Data Bank. The stacked alignments are viewed in Jalview or as sequence logos. The database search uses the suffix array neighborhood search (SANS) method, which has been re-implemented as a client-server, improved and parallelized. The method is extremely fast and as sensitive as BLAST above 50% sequence identity. ( Reference: P. Somervuo & L. Holm. 2015. Nucl. Acids Res. 43 (W1): W24-W29).

Detect bacterial toxins through text and homology searches:

DBETH Database of Bacterial ExoToxins for Human - is a database of sequences, structures, interaction networks and analytical results for 229 exotoxins from 26 different human pathogenic bacterial genera. All toxins are classified into 24 different Toxin classes. The aim of DBETH is to provide a comprehensive database for human pathogenic bacterial exotoxins. ( Reference: Chakraborty, A. et al. 2012. Nucl. Acids Res. 40(Database issue): D615-620).

Orthologous genes/proteins

COG analysis - Clusters of Orthologous Groups - COG protein database was generated by comparing predicted and known proteins in all completely sequenced microbial genomes to infer sets of orthologs. Each COG consists of a group of proteins found to be orthologous across at least three lineages and likely corresponds to an ancient conserved domain (CloVR) . Sites which offer this analysis include:

WebMGA ( Reference: S. Wu et al. 2011. BMC Genomics 12:444), RAST ( Reference: Aziz RK et al. 2008. BMC Genomics 9:75), and BASys (Bacterial Annotation System Reference: Van Domselaar GH et al. 2005. Nucleic Acids Res. 33(Web Server issue):W455-459.) and JGI IMG (Integrated Microbial Genomes Reference : Markowitz VM et al. 2014. Nucl. Acids Res. 42: D560-D567. )

Other sites:

EggNOG - A database of orthologous groups and functional annotation that derives Nonsupervised Orthologous Groups (NOGs) from complete genomes, and then applies a comprehensive characterization and analysis pipeline to the resulting gene families. ( Reference: Powell S et al. 2014. Nucleic Acids Res. 42 (D1): D231-D239

OrthoMCL - is another algorithm for grouping proteins into ortholog groups based on their sequence similarity. The process usually takes between 6 and 72 hours.( Reference: Fischer S et al. 2011. Curr Protoc Bioinformatics Chapter 6:Unit 6.12.1-19).

KAAS (KEGG Automatic Annotation Server) provides functional annotation of genes by BLAST or GHOST comparisons against the manually curated KEGG GENES database. The result contains KO (KEGG Orthology) assignments and automatically generated KEGG pathways. ( Reference: Moriya Y et al. 2007. Nucleic Acids Res. 35(Web Server issue):W182-185).

Unique search engine:

MineBlast - performs BLASTP searches in UniProt to identify names and synonyms based on homologous proteins and subsequently queries PubMed, using combined search terms in order to find and present relevant literature. This tool only allows max. 100 queries per user per day. ( Reference: G. Dieterich et al. 2005. Bioinformatics 21: 3450-3451).

Comparison of homology between two small genomes:

Kablammo helps you create interactive visualizations of BLAST results from your web browser. Find your most interesting alignments, list detailed parametersfor each, and export a publication-ready vector image. Incredibly easy to use - here are the results for a BLASTN comparison to Escherichia phages T1 (query)and ADB-2. ( Reference: Wintersinger JA et al. Bioinformatics 31:1305-1306).

SCAN2 ( provides one with a colour-coded graphical alignment of genome length DNAs in Java. In the top panel regions of high sequence identity are presented in red. By highlighting the gray, yellow, green, black boxes one can select specific regions for examination of the sequence alignment. For additional information on the output see here. This site appears to work best with Internet Explorer.
Advanced PipMaker ( Schwartz et al. Genome Research Vol. 10, Issue 4, 577-586, April 2000 ) aligns two DNA sequences and returns a percent identity plot of that alignment, together with a traditional textual form of the alignment. You might want to download Laj ( Penn State - Bioinformatics Group, U.S.A. ) for viewing and manipulating the output from pairwise alignment programs such as PipMaker representations of the alignments.

JDotter - is a Java Dot Plot Viewer ( Viral Bioinformatics Resource Center, University of Victoria, Canada) - a dot matrix plotter for Java. Produces similar diagrams to the above mentioned programs, but with better control on output. Also available here .

zPicture: multiple sequence alignment tool (Comparative Genomics Center, Lawrence Livermore National Laboratory, U.S.A.) - provides nice dotplot graphs and dynamic visualizations. If simple gene locations are provided in the form (e.g. > 2000 5000 RNA_polymerase indicates the the RNA polymerase gene is found on the plus strand between bases 2000 and 5000) this data will be added to the dynamic visualization. zPicture alignments can be automatically submitted to rVista to identify conserved transcription factor binding sites. For more than two genomes go here.