ebi_fasta

 

Function

fastA search of query sequence against sequence databank using EBI Web Services

Description

ebi_fasta uses the Web services of the EBI to submit a sequence for a similarity search by fastA against one of the databanks available at the EBI.

At the EMBL-EBI you will find a collection of databanks different from that at the BEN site. You can among other things search the Alternative Splicing Database, the Ligand-Gated Ion Channel database and a huge collection of complete genomes and proteomes.

Algorithm

ebi_fasta relies on the SOAP based interaction between the Perl client fasta.pl and the Web server at http://www.ebi.ac.uk/Tools/webservices/wsdl/WSFasta.wsdl, which provides access to programs from Pearson's fastA package version 3.

For an explanation of the fastA algorithm, see the on-line help for fasta.

The submission and retrieval procedure

ebi_fasta submits a request in asynchronous mode and obtains from the server at the EBI a jobID. At the side of the server the job is put in a waiting queue and after it has been completed the results are kept stored for some time. ebi_fasta waits 1 minute and then sends to the server a request to check the status of the submitted job (RUNNING/ERROR/DONE). If the job is still "RUNNING" ebi_fasta will at increasing intervals of time submit a new request (up to 29 hours and 55 min). If the job has been "DONE" ebi_fasta sends a request to retrieve the result.

Usage

Here is a sample session with ebi_fasta

> ebi_fasta
fastA search of query sequence against sequence databank using EBI Web
Services
         1 : fasta nuc against nuc
         2 : fasta prot against prot
         3 : fastx (nuc against prot with codon/aa alignment)
         4 : fasty (idem + allowing intracodon gaps)
         5 : tfastx (prot against nuc with codon/aa alignment)
         6 : tfasty (idem + allowing intracodon gaps)
         7 : ssearch (using SW instead) nuc against nuc
         8 : ssearch (using SW instead) prot against prot
         9 : ggsearch (NW global/global alignment) nuc against nuc
        10 : ggsearch (NW global/global alignment) prot against prot
        11 : glsearch (global/local alignment) nuc against nuc
        12 : glsearch (global/local alignment) prot against prot
Select type of search you want to run [1]:
Query sequence: embl:x15320
   general : general databank
    genome : genome/proteome databank
       ASD : ASD or LGIC databank
Databank type [general]: genome
  180454-g : (Euk) Anopheles gambiae str. PEST
    3702-g : (Euk) Arabidopsis thaliana
  284811-g : (Euk) Ashbya gossypii ATCC 10895
  330879-g : (Euk) Aspergillus fumigatus Af293
    5061-g : (Euk) Aspergillus niger
    9913-g : (Euk) Bos taurus
    6238-g : (Euk) Caenorhabditis briggsae

  [some lines have been deleted for brevity]

  414004-g : (Arc) Cenarchaeum symbiosum A
  490899-g : (Arc) Desulfurococcus kamchatkensis 1221n
  333146-g : (Arc) Ferroplasma acidarmanus fer1
  272569-g : (Arc) Haloarcula marismortui ATCC 43049
  478009-g : (Arc) Halobacterium salinarum R1
   64091-g : (Arc) Halobacterium sp. NRC-1
  293091-g : (Arc) Haloquadratum walsbyi

  [some lines have been deleted for brevity]

  585056-g : (Bac) Escherichia coli UMN026
  364106-g : (Bac) Escherichia coli UTI89
  316385-g : (Bac) Escherichia coli str. K-12 substr. DH10B
  511145-g : (Bac) Escherichia coli str. K-12 substr. MG1655
  316407-g : (Bac) Escherichia coli str. K-12 substr. W3110
  585054-g : (Bac) Escherichia fergusonii ATCC 35469
  262543-g : (Bac) Exiguobacterium sibiricum 255-1

  [some lines have been deleted for brevity]

  329254-g : (Pha) Xanthomonas phage OP1
  331627-g : (Pha) Xanthomonas phage OP2
  470314-g : (Pha) Xanthomonas phage Xop411
  322855-g : (Pha) Xanthomonas phage Xp15
  369257-g : (Pha) Yersinia phage Berlin
  532078-g : (Pha) Yersinia phage Yepe2
  110457-g : (Pha) Yersinia phage phiYeO3-12
Genome databank [511145-g]: 272569-g
Output file [x15320.ebi_fasta]:

Go to the input files for this example
Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers (* if not always prompted):
   -program            menu       [1] Search type : fastA or optimal
                                  alignment, nuc. or prot. (Values: 1 (fasta
                                  nuc against nuc); 2 (fasta prot against
                                  prot); 3 (fastx (nuc against prot with
                                  codon/aa alignment)); 4 (fasty (idem +
                                  allowing intracodon gaps)); 5 (tfastx (prot
                                  against nuc with codon/aa alignment)); 6
                                  (tfasty (idem + allowing intracodon gaps));
                                  7 (ssearch (using SW instead) nuc against
                                  nuc); 8 (ssearch (using SW instead) prot
                                  against prot); 9 (ggsearch (NW global/global
                                  alignment) nuc against nuc); 10 (ggsearch
                                  (NW global/global alignment) prot against
                                  prot); 11 (glsearch (global/local alignment)
                                  nuc against nuc); 12 (glsearch
                                  (global/local alignment) prot against prot))
  [-sequence]          sequence   Query sequence
   -dbtype             menu       [general] Databank type (Values: general
                                  (general databank); genome (genome/proteome
                                  databank); ASD (ASD or LGIC databank))
*  -gennucdb           menu       [em_rel] General nucleic acid sequence
                                  databank (Values: em_rel (EMBL databank last
                                  release without EST, GSS, HTG, HTC);
                                  em_rel_hum (EMBL human sequences);
                                  em_rel_mus (EMBL mouse sequences);
                                  em_rel_rod (EMBL other rodents sequences);
                                  em_rel_mam (EMBL other mammal sequences);
                                  em_rel_vrt (EMBL other vertebrate
                                  sequences); em_rel_inv (EMBL invertebrate
                                  sequences); em_rel_pln (EMBL plant
                                  sequences); em_rel_fun (EMBL fungi
                                  sequences); em_rel_pro (EMBL prokaryote
                                  sequences); em_rel_phg (EMBL bacteriophage
                                  sequences); em_rel_vrl (EMBL other viral
                                  sequences); em_rel_env (EMBL environmental
                                  sample sequences); em_rel_tgn (EMBL
                                  transgenic sequences); em_rel_syn (EMBL
                                  synthetic sequences); em_rel_unc (EMBL
                                  unclassified sequences); em_rel_std (EMBL
                                  standard sequences); em_rel_htg (EMBL High
                                  Throughput Genome sequences); em_rel_htc
                                  (EMBL High Throughput cDNA sequences);
                                  em_rel_pat (EMBL patent sequences);
                                  em_rel_sts (EMBL Sequence Tagged Site
                                  sequences); em_rel_est (EMBL Expressed
                                  Sequence Tags); em_rel_gss (EMBL Genome
                                  Survey Sequences); em_rel_tsa (EMBL shotgun
                                  Transcriptome Assembly); em_rel_tpa (EMBL
                                  Third Party Annotations); em_rel_std_hum
                                  (EMBL human standard sequences);
                                  em_rel_std_mus (EMBL mouse standard
                                  sequences); em_rel_std_rod (EMBL other

  [some lines have been deleted for brevity]

                                  em_rel_tpa_syn (EMBL synthetic Third Party
                                  Annotations); em_rel_tpa_unc (EMBL
                                  unclassified Third Party Annotations); emcds
                                  (EMBL coding sequences); emvec (EMBL vector
                                  subset); imgtligm (IMGT/LIGM databank);
                                  imgthla (IMGT/HLA databank); hgvbase
                                  (HGVBASE (European SNP databank)))
*  -genprotdb          menu       [uniprot] General protein sequence databank
                                  (Values: uniprot (UniProt databank);
                                  uniref100 (UniRef100); uniref100_seg
                                  (UniRef100 SEG filtered); uniref90
                                  (UniRef90); uniref50 (UniRef50); uniparc
                                  (UniParc); swissprot (UniProt/SwissProt);
                                  ipi (IPI (International Protein Index));
                                  prints (sequences from PRINTS); pdb
                                  (sequences from PDB); sgt (MSD structural
                                  genomics targets); intact (sequences from
                                  IntAct); imgthlap (IMGT/HLA proteins); epop
                                  (European patent sequences); jpop (Japanese
                                  patent sequences); kpop (Korean patent
                                  sequences); uspop (U.S.A. patent sequences))
*  -genonucdb          menu       [511145-g] Genome databank (Values: 180454-g
                                  ((Euk) Anopheles gambiae str. PEST); 3702-g
                                  ((Euk) Arabidopsis thaliana); 284811-g
                                  ((Euk) Ashbya gossypii ATCC 10895); 330879-g

  [some lines have been deleted for brevity]

                                  Xanthomonas phage Xp15); 369257-g ((Pha)
                                  Yersinia phage Berlin); 532078-g ((Pha)
                                  Yersinia phage Yepe2); 110457-g ((Pha)
                                  Yersinia phage phiYeO3-12))
*  -genoprotdb         menu       [25.H_sapiens-p] Proteome databank (Values:
                                  31436.A_aegypti-p ((Euk) Aedes aegypti);
                                  22426.A_gambiae-p ((Euk) Anopheles gambiae);
                                  3.A_thaliana-p ((Euk) Arabidopsis

  [some lines have been deleted for brevity]

                                  28205.Y_pestis_phage_phiA1122-p ((Pha)
                                  Yersinia pestis phage phiA1122);
                                  31357.Y_phage-p ((Pha) Yersinia phage
                                  Yepe2))
*  -asdnucdb           menu       [altsgen] Alternatively spliced nucleic acid
                                  databank or Ligand-Gated Ion Channel
                                  databank (Values: altsgen (AltSplice
                                  confirmed genes); altsiso (AltSplice
                                  confirmed isoforms); aedb (AEDB alternative
                                  exons); lgicn (Ligand-Gated Ion Channel
                                  database))
*  -asdprotdb          menu       [apdb] Alternatively spliced protein
                                  databank or Ligand-Gated Ion Channel
                                  databank (Values: apdb (ASD peptides); lgicp
                                  (Ligand-Gated Ion Channel database))
  [-outfile]           outfile    [*.ebi_fasta] Output file name

   Additional (Optional) qualifiers (* if not always prompted):
*  -strand             menu       [1] Query sequence strand to search. By
                                  default fastA searches both strands of a
                                  nucleic acid query sequence, but you can
                                  choose to search only the top or bottom
                                  strand. (Values: 1 (both); 2 (top); 3
                                  (bottom))
   -wordsize           integer    [6 for fasta nucleic, 2 for other search
                                  types] Word (ktup) size (1 to 6 for fasta
                                  nucleic, 1 or 2 for other search types)
   -expect             float      [2.0 for nucleic, 10.0 for protein, 5.0 for
                                  mixed] E() value = number of databank
                                  sequences with same or higher Z-score that
                                  you expect to find by chance. fastA lists
                                  sequences with an E() value lower than the
                                  cutoff. (Number 0.000 or more)
*  -matrix             menu       [BL50] Amino acid comparison matrix (Values:
                                  BL50 (BLOSUM50); BL62 (BLOSUM62); BL80
                                  (BLOSUM80); P120 (PAM120); P250 (PAM250);
                                  M10 (Jones, Taylor, Thornton PAM10); M20
                                  (Jones, Taylor, Thornton PAM20); M40 (Jones,
                                  Taylor, Thornton PAM40))
   -gappenalty         integer    [14 for nucleic, 10 for protein, 12 for
                                  fastx/y, 14 for tfastx/y] Gap penalty
                                  (Integer from 2 to 14)
   -gaplength          integer    [4 for nucleic, 2 for protein, 2 for mixed]
                                  Gap length penalty. fastA subtracts from the
                                  similarity score for each gap a penalty of
                                  type <Gap penalty> + <Gap length penalty> *
                                  n (Integer from 0 to 8)
*  -nucfilter          selection  [none] Algorithm for removing low complexity
                                  segments out of nucleic acid query sequence
*  -protfilter         selection  [none] Algorithm for removing low complexity
                                  segments out of protein query sequence
                                  (eventually after translation)
   -hide               float      [0.000] Do not show sequences with E() value
                                  lower than f. (Number 0.000 or more)
   -listsize           integer    [500] Show only the n best scoring sequences
                                  that satisfy E() cutoff (Integer 0 or more)
   -align              integer    [250] Show only alignments for the n first
                                  sequences (Integer 0 or more, but not >
                                  listsize)

   Advanced (Unprompted) qualifiers:
   -[no]stat           boolean    [Y] Compute statistics (is default). If you
                                  switch this off, fastA will sort on opt
                                  instead off bit score and report no more
                                  than 20 best scoring sequences.
   -shuffle            boolean    Use shuffled databank sequences for
                                  statistics. Useful if search set contains no
                                  sequences unrelated to query sequence.
   -histogram          boolean    Show histogram
   -xml                boolean    Use XML formatting

   Associated qualifiers:

   "-sequence" associated qualifiers
   -sbegin1            integer    Start of the sequence to be used
   -send1              integer    End of the sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-outfile" associated qualifiers
   -odirectory2        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages


Standard (Mandatory) qualifiers Allowed values Default
-program Search type : fastA or optimal alignment, nuc. or prot.
1 (fasta nuc against nuc)
2 (fasta prot against prot)
3 (fastx (nuc against prot with codon/aa alignment))
4 (fasty (idem + allowing intracodon gaps))
5 (tfastx (prot against nuc with codon/aa alignment))
6 (tfasty (idem + allowing intracodon gaps))
7 (ssearch (using SW instead) nuc against nuc)
8 (ssearch (using SW instead) prot against prot)
9 (ggsearch (NW global/global alignment) nuc against nuc)
10 (ggsearch (NW global/global alignment) prot against prot)
11 (glsearch (global/local alignment) nuc against nuc)
12 (glsearch (global/local alignment) prot against prot)
1
[-sequence]
(Parameter 1)
Query sequence Readable sequence Required
-dbtype Databank type
general (general databank)
genome (genome/proteome databank)
ASD (ASD or LGIC databank)
general
-gennucdb General nucleic acid sequence databank
em_rel (EMBL databank last release without EST, GSS, HTG, HTC)
em_rel_hum (EMBL human sequences)
em_rel_mus (EMBL mouse sequences)
em_rel_rod (EMBL other rodents sequences)
em_rel_mam (EMBL other mammal sequences)
em_rel_vrt (EMBL other vertebrate sequences)
em_rel_inv (EMBL invertebrate sequences)
em_rel_pln (EMBL plant sequences)
em_rel_fun (EMBL fungi sequences)
em_rel_pro (EMBL prokaryote sequences)
em_rel_phg (EMBL bacteriophage sequences)
em_rel_vrl (EMBL other viral sequences)
em_rel_env (EMBL environmental sample sequences)
em_rel_tgn (EMBL transgenic sequences)
em_rel_syn (EMBL synthetic sequences)
em_rel_unc (EMBL unclassified sequences)
em_rel_std (EMBL standard sequences)
em_rel_htg (EMBL High Throughput Genome sequences)
em_rel_htc (EMBL High Throughput cDNA sequences)
em_rel_pat (EMBL patent sequences)
em_rel_sts (EMBL Sequence Tagged Site sequences)
em_rel_est (EMBL Expressed Sequence Tags)
em_rel_gss (EMBL Genome Survey Sequences)
em_rel_tsa (EMBL shotgun Transcriptome Assembly)
em_rel_tpa (EMBL Third Party Annotations)
em_rel_std_hum (EMBL human standard sequences)
em_rel_std_mus (EMBL mouse standard sequences)
em_rel_std_rod (EMBL other rodents standard sequences)
em_rel_std_mam (EMBL other mammal standard sequences)
em_rel_std_vrt (EMBL other vertebrate standard sequences)
em_rel_std_inv (EMBL invertebrate standard sequences)
em_rel_std_pln (EMBL plant standard sequences)
em_rel_std_fun (EMBL fungi standard sequences)
em_rel_std_pro (EMBL prokaryote standard sequences)
em_rel_std_phg (EMBL bacteriophage standard sequences)
em_rel_std_vrl (EMBL other viral standard sequences)
em_rel_std_env (EMBL environmental sample sequences)
em_rel_std_tgn (EMBL transgenic standard sequences)
em_rel_std_syn (EMBL synthetic standard sequences)
em_rel_std_unc (EMBL unclassified standard sequences)
em_rel_htg_hum (EMBL human HTG sequences)
em_rel_htg_mus (EMBL mouse HTG sequences)
em_rel_htg_rod (EMBL other rodent HTG sequences)
em_rel_htg_mam (EMBL other mammal HTG sequences)
em_rel_htg_vrt (EMBL other vertebrate HTG sequences)
em_rel_htg_inv (EMBL invertebrate HTG sequences)
em_rel_htg_pln (EMBL plant HTG sequences)
em_rel_htg_fun (EMBL fungi HTG sequences)
em_rel_htg_pro (EMBL prokaryote HTG sequences)
em_rel_htg_phg (EMBL bacteriophage HTG sequences)
em_rel_htg_vrl (EMBL other viral HTG sequences)
em_rel_htg_env (EMBL environmental sample HTG sequences)
em_rel_htc_hum (EMBL human HTC sequences)
em_rel_htc_mus (EMBL mouse HTC sequences)
em_rel_htc_rod (EMBL other rodent HTC sequences)
em_rel_htc_mam (EMBL other mammal HTC sequences)
em_rel_htc_vrt (EMBL other vertebrate HTC sequences)
em_rel_htc_inv (EMBL invertebrate HTC sequences)
em_rel_htc_pln (EMBL plant HTC sequences)
em_rel_htc_fun (EMBL fungi HTC sequences)
em_rel_htc_pro (EMBL prokaryote HTC sequences)
em_rel_pat_hum (EMBL human patent sequences)
em_rel_pat_mus (EMBL mouse patent sequences)
em_rel_pat_rod (EMBL other rodents patent sequences)
em_rel_pat_mam (EMBL other mammal patent sequences)
em_rel_pat_vrt (EMBL other vertebrate patent sequences)
em_rel_pat_inv (EMBL invertebrate patent sequences)
em_rel_pat_pln (EMBL plant patent sequences)
em_rel_pat_fun (EMBL fungi patent sequences)
em_rel_pat_pro (EMBL prokaryote patent sequences)
em_rel_pat_phg (EMBL bacteriophage patent sequences)
em_rel_pat_vrl (EMBL other viral patent sequences)
em_rel_pat_env (EMBL environmental sample patent sequences)
em_rel_pat_syn (EMBL synthetic patent sequences)
em_rel_pat_unc (EMBL unclassified patent sequences)
em_rel_sts_hum (EMBL human STS sequences)
em_rel_sts_mus (EMBL mouse STS sequences)
em_rel_sts_rod (EMBL other rodents STS sequences)
em_rel_sts_mam (EMBL other mammal STS sequences)
em_rel_sts_vrt (EMBL other vertebrate STS sequences)
em_rel_sts_inv (EMBL invertebrate STS sequences)
em_rel_sts_pln (EMBL plant STS sequences)
em_rel_sts_fun (EMBL fungi STS sequences)
em_rel_sts_pro (EMBL prokaryote STS sequences)
em_rel_est_hum (EMBL human EST)
em_rel_est_mus (EMBL mouse EST)
em_rel_est_rod (EMBL other rodents EST)
em_rel_est_mam (EMBL other mammals EST)
em_rel_est_vrt (EMBL other vertebrate EST)
em_rel_est_inv (EMBL invertebrate EST)
em_rel_est_pln (EMBL plant EST)
em_rel_est_fun (EMBL fungi EST)
em_rel_est_pro (EMBL prokaryote EST)
em_rel_est_env (EMBL environmental sample EST)
em_rel_gss_hum (EMBL human GSS)
em_rel_gss_mus (EMBL mouse GSS)
em_rel_gss_rod (EMBL other rodents GSS)
em_rel_gss_mam (EMBL other mammals GSS)
em_rel_gss_vrt (EMBL other vertebrate GSS)
em_rel_gss_inv (EMBL invertebrate GSS)
em_rel_gss_pln (EMBL plant GSS)
em_rel_gss_fun (EMBL fungi GSS)
em_rel_gss_pro (EMBL prokaryote GSS)
em_rel_gss_phg (EMBL bacteriophage GSS)
em_rel_gss_vrl (EMBL other viral GSS)
em_rel_gss_env (EMBL environmental sample GSS)
em_rel_gss_tgn (EMBL transgenic GSS)
em_rel_tsa_vrt (EMBL other vertebrate Transcriptome Assembly)
em_rel_tsa_inv (EMBL invertebrate Transcriptome Assembly)
em_rel_tsa_pln (EMBL plant Transcriptome Assembly)
em_rel_tsa_fun (EMBL fungi Transcriptome Assembly)
em_rel_tpa_hum (EMBL human Third Party Annotations)
em_rel_tpa_mus (EMBL mouse Third Party Annotations)
em_rel_tpa_rod (EMBL other rodents Third Party Annotations)
em_rel_tpa_mam (EMBL other mammal Third Party Annotations)
em_rel_tpa_vrt (EMBL other vertebrate Third Party Annotations)
em_rel_tpa_inv (EMBL invertebrate Third Party Annotations)
em_rel_tpa_pln (EMBL plant Third Party Annotations)
em_rel_tpa_fun (EMBL fungi Third Party Annotations)
em_rel_tpa_pro (EMBL prokaryote Third Party Annotations)
em_rel_tpa_phg (EMBL bacteriophage Third Party Annotations)
em_rel_tpa_vrl (EMBL other viral Third Party Annotations)
em_rel_tpa_syn (EMBL synthetic Third Party Annotations)
em_rel_tpa_unc (EMBL unclassified Third Party Annotations)
emcds (EMBL coding sequences)
emvec (EMBL vector subset)
imgtligm (IMGT/LIGM databank)
imgthla (IMGT/HLA databank)
hgvbase (HGVBASE (European SNP databank))
em_rel
-genprotdb General protein sequence databank
uniprot (UniProt databank)
uniref100 (UniRef100)
uniref100_seg (UniRef100 SEG filtered)
uniref90 (UniRef90)
uniref50 (UniRef50)
uniparc (UniParc)
swissprot (UniProt/SwissProt)
ipi (IPI (International Protein Index))
prints (sequences from PRINTS)
pdb (sequences from PDB)
sgt (MSD structural genomics targets)
intact (sequences from IntAct)
imgthlap (IMGT/HLA proteins)
epop (European patent sequences)
jpop (Japanese patent sequences)
kpop (Korean patent sequences)
uspop (U.S.A. patent sequences)
uniprot
-genonucdb Genome databank
180454-g ((Euk) Anopheles gambiae str. PEST)
3702-g ((Euk) Arabidopsis thaliana)
284811-g ((Euk) Ashbya gossypii ATCC 10895)
...
316385-g ((Bac) Escherichia coli str. K12 substr. DH10B)
511145-g ((Bac) Escherichia coli str. K-12 substr. MG1655)
316407-g ((Bac) Escherichia coli str. K12 substr. W3110)
...
369257-g ((Pha) Xanthomonas phage Berlin)
532078-g ((Pha) Yersinia phage Yepe2)
110457-g ((Pha) Yersinia phage phiYeO3-12)
511145-g
-genoprotdb Proteome databank
31436.A_aegypti-p (Euk) Aedes aegypti
22426.A_gambiae-p ((Euk) Anopheles gambiae)
3.A_thaliana-p ((Euk) Arabidopsis thaliana)
...
55.G_theta-p ((Euk) Guillardia theta)
25.H_sapiens-p ((Euk) Homo sapiens)
20004.K_lactis-p ((Euk) Kluyveromyces lactis)
...
29301.X_phage_phiXo411-p ((Pha) Xanthomonas phage phiXo411)
28205.Y_pestis_phage_phiA1122-p ((Pha) Yersinia pestis phage phiA1122)
31357.Y_phage-p (Pha) Yersinia phage Yepe2
25.H_sapiens-p
-asdnucdb Alternatively spliced nucleic acid databank or Ligand-Gated Ion Channel databank
altsgen (AltSplice confirmed genes)
altsiso (AltSplice confirmed isoforms)
aedb (AEDB alternative exons)
lgicn (Ligand-Gated Ion Channel database)
altsgen
-asdprotdb Alternatively spliced protein databank or Ligand-Gated Ion Channel databank
apdb (ASD peptides)
lgicp (Ligand-Gated Ion Channel database)
apdb
[-outfile]
(Parameter 2)
Output file name Output file <sequence>.ebi_<program>
Additional (Optional) qualifiers Allowed values Default
-strand Query sequence strand to search. By default fastA searches both strands of a nucleic acid query sequence, but you can choose to search only the top or bottom strand.
1 (both)
2 (top)
3 (bottom)
1
-wordsize Word (ktup) size 1 to 6 for fasta nucleic, 1 or 2 for other search types 6 for fasta nucleic, 2 for other search types
-expect E() value = number of databank sequences with same or higher Z-score that you expect to find by chance. fastA lists sequences with an E() value lower than the cutoff. Number 0.000 or more 2.0 for nucleic, 10.0 for protein, 5.0 for mixed
-matrix Amino acid comparison matrix
BL50 (BLOSUM50)
BL62 (BLOSUM62)
BL80 (BLOSUM80)
P120 (PAM120)
P250 (PAM250)
M10 (Jones, Taylor, Thornton PAM10)
M20 (Jones, Taylor, Thornton PAM20)
M40 (Jones, Taylor, Thornton PAM40)
BL50
-gappenalty Gap penalty Integer from 2 to 14 14 for nucleic, 10 for protein, 12 for fastx/y, 14 for tfastx/y
-gaplength Gap length penalty. fastA subtracts from the similarity score for each gap a penalty of type <Gap penalty> + <Gap length penalty> * n Integer from 0 to 8 4 for nucleic, 2 for protein, 2 for mixed
-nucfilter Algorithm for removing low complexity segments out of nucleic acid query sequence Choose from selection list of values none
-protfilter Algorithm for removing low complexity segments out of protein query sequence (eventually after translation) Choose from selection list of values none
-hide Do not show sequences with E() value lower than f. Number 0.000 or more 0.000
-listsize Show only the n best scoring sequences that satisfy E() cutoff Integer 0 or more 500
-align Show only alignments for the n first sequences Integer 0 or more, but not > listsize 250
Advanced (Unprompted) qualifiers Allowed values Default
-[no]stat Compute statistics (is default). If you switch this off, fastA will sort on opt instead off bit score and report no more than 20 best scoring sequences. Boolean value Yes/No Yes
-shuffle Use shuffled databank sequences for statistics. Useful if search set contains no sequences unrelated to query sequence. Boolean value Yes/No No
-histogram Show histogram Boolean value Yes/No No
-xml Use XML formatting Boolean value Yes/No No

Input file format

ebi_fasta searches a query sequence against a search sequence set. For the sake of not overloading the EBI Web server you can only submit one query sequence at a time. You can use any normal sequence USA. Note that fastA has a built-in limit of 20.000 bases or amino acids. If the query sequence is longer than 20.000, the program will cut it in slices of 20.000 long, handle them as separate sequences and write the results the one after the other in the output file.

You can select the search set from the list of databanks available at the EBI. Because of the great number of choices available the list has been split into three submenus.

Output file format

By default the output is a standard fastA output. It is possible (parameter -xml) to obtain instead an output in XML format.

Output files for usage example

File: x15320.ebi_fasta

# /ebi/extserv/bin/fasta-35.4.4/fasta35_t -l /ebi/services/idata/v2422/fastacfg/fasta3db -Q -H -n -b 500 -d 250 -E 2.0 -f -14 -g -4 -z 1 @:1- +272569-g+ 6
FASTA searches a protein or DNA sequence data bank
 version 35.04 Jan. 15, 2009
Please cite:
 W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448

Query: @
  1>>>Sequence - 2372 nt
Library: EBI Haloarcula marismortui ATCC 43049 4274642 residues in     9 sequences

4329198 residues in    32 sequences
Statistics: (shuffled [500]) Expectation_n fit: rho(ln(x))= 5.1233+/-0.0134; mu= 26.6307+/- 1.615
 mean_var=88.6374+/-71.074, 0's: 0 Z-trim: 0  B-trim: 0 in 0/7
 Lambda= 0.136228
Algorithm: FASTA (3.5 Sept 2006) [optimized]
Parameters: +5/-4 matrix (5:-4) ktup: 6
 join: 74, opt: 59, open/ext: -14/-4, width:  16
 Scan time: 86.340

The best scores are:                                      opt bits E(32)
EM_PRO:AY596297 AY596297.1 STD:Haloarcula mari (3131724) [r]  283 67.6 4.4e-11
EM_PRO:AY596296 AY596296.1 STD:Haloarcula mari (410554) [f]  116 34.6    0.38
EM_PRO:AY596297 AY596297.1 STD:Haloarcula mari (3131724) [f]  109 33.4    0.85
EM_PRO:AY596297 AY596297.1 STD:Haloarcula mari (3131724) [r]  109 33.4    0.85
EM_PRO:AY596297 AY596297.1 STD:Haloarcula mari (3131724) [f]  108 33.3    0.97
EM_PRO:AY596297 AY596297.1 STD:Haloarcula mari (3131724) [r]  105 32.7     1.5

>>EM_PRO:AY596297 AY596297.1 STD:Haloarcula marismortui   (3131724 nt)
rev-comp initn: 310 init1: 192 opt: 283  Z-score: 257.2  bits: 67.6 E(): 4.4e-11
banded Smith-Waterman score: 331; 55.3% identity (55.3% similar) in 483 nt overlap (1876-1414:2800891-2801365)

          1900      1890      1880      1870      1860      1850   
Seque- TTTAGACGGCTGTTACGCACTTCTTCGTTTTCTGCGCTGAGGATCGGGCAGTGCTCGTAG
                                     :: ::   ::::: :::::: : :  ::::
EM_PRO AGCGCCAGCCGCGCAGCGCGGGTCTCTGGGTCGGCATCGAGGACCGGGCACTCCCGGTAG
          2800870   2800880   2800890   2800900   2800910   2800920

          1840      1830      1820      1810      1800             
Seque- AAGCCAGAGAACAGACCGGCCAGATCGTACAGGTAAGCACACATTACATGCGGCGT---G
       ::: :   ::::    :::: :: :::     ::: :     :     :: :::::   :
EM_PRO AAGGCGTTGAACGTCTCGGCGAGGTCGCGGGTGTAGGTGGCGACGGTGTGGGGCGTCAGG
          2800930   2800940   2800950   2800960   2800970   2800980

   1790      1780      1770      1760      1750       1740         
Seque- CCTTCACGGGCAACCACGGTGAGGGTTTCTTCAAACTGCAGCAG-GCGAGCTGCCAGTTG
        : ::   :::  :: :: ::: ::   :    ::: :   :::  :: :  ::  :: :
EM_PRO TCGTCCGCGGCCGCCTCGATGACGG---CGGGGAACCGGGCCAGTTCGCGGAGCAGGTCG
          2800990   2801000      2801010   2801020   2801030       

    1730       1720      1710      1700      1690      1680        
Seque- CGCTTC-ACGATCTTCACGGATGATAACCGGAGCTGCAGCCAGTTGCTCTTCGTCAATTT
       :::: :  :: :    :: :  : : :  :  :     : :::  : :    :::  :::
EM_PRO CGCTCCTCCGGTTCCGACAGCGGGTCAAGGTCGGGTTCGTCAG-GGAT----GTCGGTTT
   2801040   2801050   2801060   2801070   2801080        2801090  

     1670      1660      1650      1640      1630      1620        
Seque- CTGCTTTACGGAACACGGACAATACACGCGTGTATGCATACTGCATGTATGGCGCGGTAT
       :  : :  :  :  : :    :   ::::: ::   : ::::: : :::::::::::  :
EM_PRO CCACGTCGCCCAGGATGCCACAGCAACGCGCGTGGACGTACTGAACGTATGGCGCGGACT
        2801100   2801110   2801120   2801130   2801140   2801150  

     1610      1600      1590      1580      1570      1560        
Seque- TACCCTCAAACGCCAGCATGTTGTCCCAGTCGAAGATGTAGTCCGTGGTGCGGTTTTTGG
          :::: ::  :::::  :  :::::: :::::: ::  : :  ::::  : : ::: :
EM_PRO GGGCCTCGAAGTCCAGCGCGCGGTCCCACTCGAAGGTGATGCCTTTGGTCGGCTGTTTCG
        2801160   2801170   2801180   2801190   2801200   2801210  

     1550      1540      1530      1520      1510      1500        
Seque- AGAGATCCGCATATTTCACCGCACCAATACCAACCGCGTTAGCCAGTTTTTCCAGCTCGT
       : :      : ::    ::::: :::::::: :::      :: :     :: :  ::::
EM_PRO AAACGATGTCGTAGCGTACCGCGCCAATACCGACCTGACGGGCGATGCGGTCGATGTCGT
        2801220   2801230   2801240   2801250   2801260   2801270  

     1490          1480         1470         1460           1450   
Seque- CGGC----TGGCATATCCGGGTTC---TTTTCTGCC---ACCAGAC-----GGCGTGCAC
       :  :     ::  :  ::::::::     : : :::   ::  :::       ::::: :
EM_PRO CTTCATCAAGGTCTCCCCGGGTTCGGTCGTCCAGCCGGGACTCGACCTCCTCCCGTGCGC
        2801280   2801290   2801300   2801310   2801320   2801330  

          1440      1430      1420      1410      1400      1390   
Seque- GTTCCAGGGCTTCATCCAGCAGATCGGCCAGTTTCACTGTACCACCCGCGCGGGTTTTGA
       : :: :  :::::::::::::: ::: : :: :                           
EM_PRO GGTCGATCGCTTCATCCAGCAGGTCGTCGAGGTCGATGCCGGTCCCCTCACGCGTGCTCA
        2801340   2801350   2801360   2801370   2801380   2801390  

          1380      1370      1360      1350      1340      1330   
Seque- ACGGTTTGCCGTCTTTACCCAGCATCATGCCGAACATGTGGTGTTCCAGCGGTACGGATT
                                                                   
EM_PRO TCCCGCCTTCCGGGAGGTTCACCCAGGAGTAAAACACCTGCCGGAGCTGGTCGGTGTCGT
        2801400   2801410   2801420   2801430   2801440   2801450  

  [some lines have been deleted for brevity]

>>EM_PRO:AY596297 AY596297.1 STD:Haloarcula marismortui   (3131724 nt)
rev-comp initn:  88 init1:  88 opt: 105  Z-score: 68.1  bits: 32.7 E():  1.5
banded Smith-Waterman score: 105; 64.8% identity (64.8% similar) in 88 nt overlap (1039-954:1119581-1119665)

   1070      1060      1050      1040      1030      1020          
Seque- GCTTTGAGATCCGCCACAATTCCTGGCAGCATCGGGTTGTAGAGGCTTTCGCCCATCACG
                                     ::: ::: : ::: :   :::   : :  :
EM_PRO CGTAGCGAAGTCACTTCCGGGAACTTCGCGATCCGGTCGGAGATGTCGTCG--TAGCTGG
          1119560   1119570   1119580   1119590   1119600          

   1010      1000       990        980        970       960        
Seque- TCATCACGGGTCAGCGTCACGTTGAG-ACGATCGTAG-GTGATCTGGTTCTGCGTCATGG
       :: ::::::  :::::: ::::::::  :::  :::: : :  :  : ::  :::: :  
EM_PRO TC-TCACGGTCCAGCGTGACGTTGAGTTCGACGGTAGCGCGGACGCGCTCCTCGTCGTTC
  1119610    1119620   1119630   1119640   1119650   1119660       

      950       940       930       920       910       900        
Seque- TGATGTCGACCAGTTTGCGCCACATCTCGCGGAAATATTCGTCACCGCTTTGCAGTTTTA
                                                                   
EM_PRO TCGACCGCGTTCCAGTCGACGACTGCCTGATAGCCGCGAATGATGCCTGCCTCCTCGAAT
   1119670   1119680   1119690   1119700   1119710   1119720       



2372 residues in 1 query   sequences
4274642 residues in 9 library sequences
 Tcomplib [35.04] (8 proc)
 start: Wed Feb 18 16:19:48 2009 done: Wed Feb 18 16:21:57 2009
 Total Scan time: 86.340 Total Display time:  0.060

Function used was FASTA [version 35.04 Jan. 15, 2009]

Data files

None.

Notes

Since it could take some time before the server at the EBI has processed the job and sent the result, it is preferable to start ebi_fasta "in batch". If you work under wEMBOSS you can do that by writing your E-mail address in the box at the bottom of the page.

References

  1. Pillai S., Silventoinen V., Kallio K., Senger M., Sobhany S., Tate J., Velankar S., Golovin A., Henrick K., Rice P., Stoehr P., Lopez R. SOAP-based services provided by the European Bioinformatics Institute. Nucleic Acids Res. 33(1):W25-W28 (2005)

Warnings

The configuration file of ebi_fasta was last updated on 17 September 2007. Because the collection of databanks available at the EBI is changing there could be some difference between the list of databanks in the menu's and the list of databanks effectively available.

Diagnostic Error Messages

It can happen that the submission of the job fails or that anyhow the job ID number has not been successfully retrieved and stored. In that case ebi_fasta will give up and issue the message :
  ERROR !!  EBI Web Server failed to return job ID

There are various error messages related to the retrieval of the result of a submitted job :

  EBI Web Server failed to respond on <date+time>
  you can later try manual check with command :
  /opt/sw/EBIWS/fasta.pl  --status --jobid <jobid>

  ERROR !!  some error occurred on the EBI Web Server

  ERROR !!  EBI Web Server could not retrieve job result

  ERROR !!  EBI Web Server executed job but failed to retrieve output

  job still not finished after more than 30 h. I QUIT.
  you can try manual check with command :
  /opt/sw/EBIWS/fasta.pl  --status --jobid <jobid>

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program nameDescription
blast BLAST search of query sequence(s) against sequence search set
ebi_blast WU-BLAST search of query sequence against sequence databank using EBI Web Services
fasta fastA search of query sequence(s) against sequence search set
fasts Protein identification from peptides using fastA algorithm
phiblast Search protein sequence set combining matching of pattern with local alignment of a query sequence surrounding the match
psiblast Iterative BLAST search with generation of profile of protein sequence against protein sequence set
ebi_tmhmm Reports membrane spanning regions using EBI Web Services

Author(s)

The application ebi_fasta was written by Guy Bottu (gbottu@vub.ac.be)
BEN, ULB, Brussels, Belgium

The programs fasta,... themselves were written by William R. Pearson (U. of Virginia) and the SOAP based Web services client and server were developed at the EMBL-EBI (Hinxton, UK).

History

Completed 2 May 2006
Modified 23 May 2007 - adapted to changes in EBI Web Services
Modified 9 April 2009 - adapted to changes in EBI Web Services

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.