indexsearch

 

Function

Search sequence databanks using MRS

Description

indexsearch is an interface to MRS. It actually uses a Perl script that accesses the internal search engine of MRS. For that reason it should produce the same results as the MRS WWW Interface. It is however less flexible ; it only allows to search in the sequence databanks and one databank at a time. It has however the advantage that it saves its output directly in the personal directories of the user and as an EMBOSS List File or as a sequence databank in fastA format, which can be given as input to other programs. indexsearch emulates the functionality of the GCG program lookup.

Usage

You must first select the sequence databank you want to search and then you can either type in search terms for different indexed fields or you can type in directly a query in MRS query language.

Interactively at the command line

You can work interactively in a terminal. The current version of indexsearch does not need a VT100 terminal and does not allow to navigate around using arrow keys. Instead it much relies upon "hot keys", that is keys that have a direct effect without that you need to press the <return> key.

You can start the program by typing :
> indexsearch<enter>
search in which databanks ?     (output format : List File with descriptions)

 a) embl        general nucleic acid databank
 b) uniprot     general protein databank (= SwissProt + TrEMBL)
 c) swissprot       manually annotated part of UniProt
 d) trembl          EMBL ORF translations not yet in SwissProt
 e) remtrembl   other EMBL ORF translations till October 2003
 f) uniprot_varsplic    SwissProt splice variants
 g) uniref100   UniProt subset without redundant fragments
 h) uniref90    UniProt subset with no more than 90% identity
 i) uniref50    UniProt subset with no more than 50% identity
 j) pir old general protein databank (December 2004)
 k) genpept     GenBank ORF translations
 l) refseq      NCBI "reference" genes and transcripts
 m) refseqp     NCBI "reference" proteins
 n) vector      Intelligenetics vector databank (January 1996)
 o) emvec       EMBL vector subset
 p) imgt        LIGM databank of Igg. and TcR genes
 q) hla databank of human MHC genes
 r) genomereviews       complete microbial genomes from EMBL
 s) gpcrdb      UniProt G protein coupled receptors subset
 t) epd         Eukaryotic Promoter Database

Select databank by typing lowercase letter. Type F to toggle output format or Q to quit.
Please make your choice : a

The first thing you will see is the DatabaseScreen. You must choose the databank you want to search by typing a lowercase letter. Note that the letters can change when the collection of databanks available at the BEN site changes.

search in embl  (output format : List File with descriptions)

 a) all text fields :   globin AND duplication
 b) entry name (ID) :
 c) accession number :
 d) organism (species) :
 e) organism classification (taxon) :
 f) organelle :
 g) description :
 h) keywords :
 i) comments :
 j) references :
 k) features :
 l) sequence length :
 m) entry creation date :

        interfield operator : AND       append * after text fields : NO

        press lowercase letter to edit search field content,
        M to type instead a complete query in MRS language,
 L to toggle interfield logic, W to toggle wildcarding, R to reset query,
         to start search or Q to quit :

When you have selected the databank, you get into the QueryScreen. Note at the top of the screen the name of the databank you have selected. You must now compose your query. You can press M to get a prompt were you can type in a query in MRS query language, as you would while using the box in the MRS WWW interface, see the on-line help. To make you work easier, indexsearch allows you to type in query words for specific fields, whithout the need to type in the field name. You can for each field type in several words and connect them by the logical operators AND/OR/NOT. Note that indexsearch will to a certain extent reformat your query ; if e.g. you press a, type globin & duplication and press <enter>, you will see that globin AND duplication is now written after a) all text fields :. If you have typed a mistake, you can reset a field by typing the appropriate hot key and then just pressing <enter> and you can reset all fields by typing uppercase R. You can start the search by keeping the control key down while typing d.

 50 entries were found.

 Do you want to :

   S) save the entries (and quit)
   Z) save the entries but do not quit
   P) preview the result on the screen
   R) refine the query
   C) change the set of selected databanks

   Q) quit

 make your choice by pressing key : 
 How should I call the output List File (* search.list *) : globin.list

When the search is terminated, you get into the OutputScreen. Note at the top of the screen how many entries were found. You can here select what you want using a menu with hot keys. You can before saving preview the result on your screen (option P). If you are not satisfied you can go back to the DatabaseScreen to change the databank to be searched (option C) or to the QueryScreen to change the query (option R). In case you intend to perform several complex queries with each time a small difference, there is an option for saving the result without quitting the program.

Through a graphical interface

In the section QUERY/DATABANKCHOICE (usually on top of the page for the program indexsearch) you can find a selector where you can choose the databank to be searched.

In the section QUERY/FIELDSQUERY you can find a list of text boxes for the various fields. You must there type in the search terms. If you want to type in instead a query in MRS query language, you can do this in the "Do instead this query in MRS language" box in the QUERY/FIELDSGENERAL section (usually at the bottom of the page).

Logical operators and wild cards

You can type in several keywords for the same field and connect them by the logical operators AND/OR/NOT (or &/|/!). If you type several words without operator in between the interface will assume a logical "and".

If you type search terms in more than one field the logic between the fields is by default "and". You can change this : working interactively at the command line you can in the QueryScreen use the "hot key" L to toggle between AND, OR and NOT ; under a graphical interface you will find an "Interfield logic" check box or selector. Note that for "not" the query is not symmetric : it is the query_term_on_top but_not the_query_term_below. Often you will not be able to perform the query you want. In this case you must resort to typing a query in MRS language.

You can use the wild cards ? and *. ? stands for any character, * for any string of characters (including nothing). You can let MRS/indexsearch append automatically a wild card * at the end of every query term : you can do this at the command line with the "hot key" W in the QueryScreen and under a graphical user interface by setting "Append wild card * after each query string?" to "y".

Indexed Fields

MRS generates a number of indexes on the different fields of the databank. Note that, contrary to SRS, the current version of MRS does not handle indexed terms containing spaces. For most fields you can find in the index words that correspond to strings of characters delimitated by spaces, with exclusion of string composed only of numbers. indexsearch presents a list of fields that correspond to those you can find in most sequence databanks :
  1. entry name (ID) : each sequence in a databank has a unique entry name or identifier. Originally, the entry name was supposed to be mnemonic and indicate the nature of the sequence, but this is not true anymore for most databanks, with as most notorious exception UniProt/SwissProt.
  2. accession number : just as entry names, accession numbers are unique. A sequence can however have more than one accession number (a primary and eventually one or more secondary accession numbers). Accession numbers remain stable between successive releases of a databank. Furthermore the same sequence in EMBL, GenBank and DDBJ has the same accession number. Therefore accession numbers are very useful for finding back easily a sequence.
  3. organism (species) : the name of the organism that is the source of the sequence. Usually you find here the systematic Latin name, e.g. homo sapiens, and a common english name, e.g. human. Beware, you can find e.g. mouse in one databank and house mouse in another. The NCBI maintains a taxonomy databank which is used to standardize the classification in EMBL/GenBank/DDBJ and which is nearly perfectly respected by UniProt. Remember that when you type homo sapiens the interface in its current form will actually search for homo AND sapiens, which is good because the indexing engine in its current form stores home and sapiens as separete keywords.
  4. organism classification (taxon) : the different taxa of the classification.
  5. organelle : not present in all databanks. Contains words as mitochondrion, chloroplast, plasmid, etc.
  6. description : the description or definition line is present in all databanks and gives a brief explanation what the nature of the sequence is. It is this line that also appears in the MRS/indexsearch output.
  7. keywords : many databanks have a keyword field. Note that the level of standardization of the keywords can vary from databank to databank.
  8. comments : some databanks have a comments fields, usually giving extra information about the origin of the sequence, about its known properties, etc.
  9. references : databank entries usually contain one or more literature references describing the sequencing. In the current version of MRS, you will find in this index words from the title, author names and the name of the magazine, but not page numbers and year of publication.
  10. features : a feature is a region of a sequence about which some specific information has been added. Examples of feature words are : domain, cds (= coding sequence), source, cell line, chromosome, protein_id.
  11. sequence length : length of sequence, numeric index. You can use = < > <= >= to indicate ranges. If you want e.g. all the sequences from 200 to 300 bases/amino acids long, you must precize >=200 AND <=300.
  12. entry creation date : date when sequence was first entered in databank (works only for databanks where the "native" MRS has a "cd_date" field). Date fields contain keyword in the format YYYY-MM-DD and support ranges the same way as numeric fields. If you want e.g. all the sequences entered after 31 January 2007, you must precize >2007-01-31.
  13. all text fields : searches all alphabetic indexes, also those that are not shown on the indexsearch QueryScreen.

Output file format

Indexsearch can write the found sequences in two different output formats :
  1. List file with descriptions : the default. It is a List file usable as input by other EMBOSS programs. The descriptions are present to make identification of the found sequences easy.

    Output files for usage example

    File: globin.list

    #  indexsearch output (EMBOSS List File)
    #  Mon Oct  9 17:36:14 2006
    #  MRS database(s) searched : embl_release|embl_updates
    #  query : globin AND duplication
    #  50 entries found
    embl:AL590842
      # Yersinia pestis CO92 complete genome
    embl:CP000075
      # Pseudomonas syringae pv. syringae B728a, complete genome.
    embl:J00153
      # Homo sapiens HBAP1 pseudogene, complete cds; and hemoglobin alpha 2 (HBA2) and hemoglobin alpha 1 (HBA1) genes, complete cds.
    embl:J00176
      # Human A-gamma-globin gene on chromosome 11, allele B.
    embl:K01898
      # Human beta globin deletion mutation promoting Indian thalassemia.
    embl:M91036
      # Homo sapiens G-gamma globin (G-gamma globin) and A-gamma globin (A-gamma globin) genes, complete cds.
    embl:U01317
      # Human beta globin region on chromosome 11.
    embl:V00489
      # Human alpha-globin gene with flanks.
    embl:V00490
      # Human pseudogene for alpha-2 globin.
    embl:V00576
      # Human repetitive sequence fragment located approximately 1300 base pairs 5' to the capping site of the human beta globin gene.
    embl:CR932181
      # Paramecium tetraurelia, globin, putative, complete gene.
    embl:CR932184
      # Paramecium tetraurelia, globin, putative, complete gene.
    embl:CR932185
      # Paramecium tetraurelia, globin, putative, complete gene.
    embl:CR932197
      # Paramecium tetraurelia, globin, putative, complete gene.
    embl:CR932204
      # Paramecium tetraurelia, globin, putative, complete gene.
    embl:AY450927
      # Macropus eugenii epsilon globin gene, complete cds.
    embl:AY450928
      # Macropus eugenii beta globin gene, complete cds.
    embl:AY459589
      # Macropus eugenii alpha globin gene, complete cds.
    embl:AY459590
      # Macropus eugenii theta globin gene, complete cds.
    embl:J00047
      # Goat beta-x-globin (psi-beta-x) pseudogene, complete cds with 3'flank.
    embl:J00048
      # Goat beta-x-globin (psi-beta-x) pseudogene 3' flanking region.
    embl:J05174
      # Gibbon gamma-1 and gamma-2 globin genes, complete cds.
    embl:K01671
      # Goat germline-like beta-globin gene epsilon III, 5' end and flank.
    embl:K01672
      # Goat germline beta-globin gene epsilon IV, 5' end and flank.
    embl:K02437
      # Goat embryonic beta-globin epsilon-V pseudogene, complete sequence.
    embl:M15844
      # Rabbit alpha-like globin gene cluster, zeta-1 region.
    embl:M15845
      # Rabbit alpha-like globin gene cluster, theta-1 region.
    embl:M15846
      # Rabbit alpha-like globin gene cluster, zeta-3 region.
    embl:M15847
      # Rabbit alpha-like globin gene cluster alpha-1 globin gene, partial cds.
    embl:M91454
      # Orangutan alpha-globin gene duplicate region.
    embl:M94631
      # Hylobates lar (clone LambdaGialphaG1) 3'alpha1Alu1 D, 3'alpha1Alu1 E and 3'alpha1Alu1 F Alu repeat regions.
    embl:M94634
      # Hylobates lar alpha2-globin and alpha1-globin genes, complete cds.
    embl:V00154
      # Goat pseudogene psi-beta-Z for a beta-globin.
    embl:X53419
      # M.mulatta gamma-globin-1(G), gamma-globin-2(A) genes and L1 LINE element
    embl:X53420
      # A.geoffroyi gamma-globin gene and L1 LINE element
    embl:AE017042
      # Yersinia pestis biovar Microtus str. 91001, complete genome.
    embl:AE017282
      # Methylococcus capsulatus str. Bath, complete genome.
    embl:AL939104
      # Streptomyces coelicolor A3(2) complete genome; segment 1/29
    embl:AM286690
      # Alcanivorax borkumensis SK2, complete genome
    embl:BX640428
      # Bordetella parapertussis strain 12822, complete genome; segment 6/14
    embl:CP000031
      # Silicibacter pomeroyi DSS-3, complete genome.
    embl:CP000089
      # Dechloromonas aromatica RCB, complete genome.
    embl:CP000090
      # Ralstonia eutropha JMP134 chromosome 1, complete sequence.
    embl:CP000113
      # Myxococcus xanthus DK 1622, complete genome.
    embl:CP000352
      # Ralstonia metallidurans CH34, complete genome.
    embl:CP000377
      # Silicibacter sp. TM1040, complete genome.
    embl:CP000378
      # Burkholderia cenocepacia AU 1054 chromosome 1, complete sequence.
    embl:L44128
      # Mus caroli L1Mc1, LINE-1 interspersed repetitive DNA, complete sequence.
    embl:L44129
      # Mus caroli L1Mc1, LINE-1 interspersed repetitive DNA, complete sequence.
    embl:L44130
      # Mus caroli L1Mc3, LINE-1 interspersed repetitive DNA, complete sequence.
    

  2. List file, names only : idem, but the descriptions have been omitted in order to avoid making a too long file.
  3. Sequence databank in fastA format : usable as input to many other programs.
Working interactively at the command line you can toggle between output formats with the "hot key" F in the DatabaseScreen or in the QueryScreen. Under the graphical interfaces you will find an "Output format" selector in the OUTPUT SECTION.

Note that the internal logic of indexsearch demands that you select the output format before you start the search. Sometimes you might want to preview a List with descriptions but save a List without descriptions or a fastA format databank. The only way is to change output format and then repeat the search ; note that working interactively at the command line you can from the OutputScreen return to the DatabaseScreen or the QueryScreen while retaining your query.

Data files

None.

Notes

Since indexsearch and the native MRS give the same result, you can do the following : use the MRS WWW interface to experiment till you have found how to perform the optimal query (as much as possible of the sequences you want and as little as possible of sequences you do not want) and then repeat the query with indexsearch.

When you have performed a search with indexsearch you can give the output as input to textsearch for further refinement. textsearch (which searches the description/definition line for strings of characters) is too slow to search a big databank as EMBL but can be useful for smaller sets. You can also combine the results of different indexsearch runs using listor.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

For previewing output on the screen screenful by screenful, indexsearch first counts the number of lines on the screen and then starts displaying the output. If the output contains lines that are longer than the width of the screen the output will run beyond the screen.

See also

Program nameDescription
textsearch Search sequence documentation. Slow, use SRS and Entrez!

Author(s)

The interactive Perl script and the EMBOSS wrapper application indexsearch were written by Guy Bottu (gbottu@vub.ac.be)
BEN, ULB, Brussels, Belgium

MRS is being developed by Maarten Hekkelman (CMBI, Radboud University, Toernooiveld 1, 6525 ED NIJMEGEN, The Netherlands).

MRS is currently distributed via berliOS. Questions and remarks should be mailed to the list mrs-user@lists.berlios.de.

History

The current version of indexsearch.pl (running under EMBOSS and using MRS) has a long "prehistory" ; earlier versions ran under GCG using SRS (completed 7 October 2000) and under EMBOSS using SRS (completed 19 September 2002).

Completed 20 December 2006

Target users

This program is intended to be used by end users. If you want to write a script you can better use the program mrsget.