|
EMBL, Meyerhofstrasse 1, 69117 Heidelberg, Germany
Introduction
Information in molecular biology is available in a growing number of data-banks that specialize on one single aspect such as promoters of eukaryotic genes (EPD), known tertiary structures of proteins (PDB) or all known enzymatic activities (ENZYME) to name only a few. Each databank defines an information domain. Its boundaries to other databanks are outlined by the object it wants to characterize but also by practical judgment. An important goal is to have as little overlap as possible with other databanks and instead of duplicating information provide links to corresponding entries in other databanks. In general the domain defined by a databank does often not fully accommodate real life objects such as a gene, a promoter or an enzyme. To get a full picture of these objects one must often consult many different databanks and carefully extract relevant information from each. For instance, to collect all available information about a specific eukaryotic promoter one could start with an entry in EPD, then move on to the cross-referenced entry in the EMBL databank of nucleotide sequences to obtain the location of the promoter within the gene and its DNA sequence, and finally follow links to the TRANSFAC databank of transcription factors and binding sites in order to find out if any transcription factor binding sites are known. Today almost every databank can be accessed with retrieval systems of various kinds that allow fast searching and inspection of the found data. Since most of these systems are made available for the use through the World Wide Web (WWW) the data before being displayed can be enriched with hypertext links that will directly lead to corresponding information in other databanks. This is already an enormous improvement and helps to raise individual databanks out of their isolation but on the other side introduces the danger of 'drowning' in data since often only a fraction of the displayed information is of interest. Only few retrieval systems allow hiding of undesired information and none of them can automatically extract and compile selected information from more than one databank source. The retrieval system SRS allows installation of several dozen databanks on a single site and provides homogenous access to their contents. Until now the retrieved information was always a set of entries each from a single databank. We added the specification of views so that the user can define virtual entries that may include information from many sources. It will be demonstrated in this paper that views can be used to treat the sum of all installed databanks as a single structure where user defined entities can freely span databank boundaries. Links
The implementation of the views relies heavily on the link indices which are an important and unique feature of the SRS system. To create a link index all cross-references in a databank are read and compiled into a list of all pairs of referencing and referenced entries. All link indices are duplicated so that referenced as well as referencing entries can be looked up. In fact, once the link index is compiled it is of no importance anymore which of the two linked entries is referenced and which is referencing and thus the link becomes bi-directional. Link indices can be used by two link operators of the SRS query language which are '>' (link right) and '<' (link left). The operands can be sets of entries created by previous queries or entire databanks. The expression embl > swissprot retrieves all SWISS-PROT entries that are linked to EMBL entries and embl < swissprot retrieves all EMBL entries that are linked to SWISS-PROT. The link operator can be seen as an arrow that points to the set where the result will be taken from. In other words the link operation yields a subset of the left ('<') or the right ('>') operand. Typically an SRS installation has several dozen databanks where most have links to others. Within such an installation all databanks and all pair wise links define a graph where the databanks are the nodes and the links the edges. In a graph all the nodes are connected through paths consisting of one or more edges. This means that also indirect links can be carried out. The expression embl > pdb is such an example where two operands are only indirectly linked, in this case through SWISS-PROT which maintains links to both EMBL and PDB. The link operation is carried out by first finding the shortest path through the graph from EMBL to PDB which the succession of the two links EMBL to SWISS-PROT and then SWISS-PROT to PDB. A recent extension allows internal links or links among entries of the same databank. Examples of such links are references to the parent node or higher level taxon in entries of the TAXONOMY databank which contains a taxonomic tree spanning all organisms represented by entries in the nucleotide databank GENBANK. The query [taxonomy-name:bos taurus] > taxonomy_up > taxonomy_down returns all species which are of the same family as "bos taurus". The artificial names "taxonomy_up" and "taxonomy_down" need to be used instead of the databank name since the operation returns entries of TAXONOMY. Here, The link is used to model hierarchic structures which can be further exploited by recent extensions of the SRS query language: The operator "> [taxonomy-name:bovinae] >_ taxonomy_down > genbank first collects all species, the leaves in the taxonomic tree, within the family "bovinae" and by linking these to GENBANK finds all nucleotide sequences which exist for these species. The information used to build the link indices originates from the databanks themselves and is thus under the control of the databank provider. However, it is possible to create a user defined link simply by adding a new databank where each entry associates entries from two other databanks so that it can be used as an intermediate step for linking. Such databases of links exist already. For instance, HSSP lists all proteins in SWISS-PROT that are similar to protein chains in PDB. So the query pdb > hssp > swissprot gives all SWISS-PROT entries that are similar to chains in PDB as defined by the information in HSSP. Note that the direct link between PDB and SWISS-PROT has a different meaning; the query pdb > swissprot returns all SWISS-PROT entries for which the tertiary structure has been solved. Another example of such a databank is VIRGIL which links GENBANK and the human genome databank, GDB. In theory one could make these new links even more powerful by adding information about the quality of the links. Each link in the link database could have a score so that, for instance, a > [linkab-score#0.5:] > b returns all entries in "b" that are linked to "a" by information in "linkab" where the score of a link must be 0.5 or higher. Databanks of links are presently scarce but they will become more abundant once their importance and usefulness is better understood. Views
SRS regards an entry as a list of data-fields. The simplest form of a view acts as a filter and allows only selected fields to be displayed. For instance, a view could be defined to display only the ID, the description and the line containing the sequence length for SWISS-PROT entries as shown in Figure 1. Figure 1. A SWISS-PROT entry displayed by a list-view.
This type of view, the list-view, displays the selected data-fields in the order they occur within the entry and in their original format which includes line codes or fields labels. It is mostly useful for 'turning off' undesired information during browsing large sets of entries. The table-view provides more sophistication and possibilities. The entry in the table-view is displayed as a single row. The information displayed is in a different format in the sense that it is shown without the field identification (line codes or field labels). On top of that other conversions can take place to render the information into a more standardized format, e.g., all author names from all data-banks can be converted into the form surname-comma-initials (eg, "Jones, D.G."). The extraction and conversion of the information is accomplished by a parser that is tightly linked to the SRS system and which can be programmed in the Icarus language. Alternative productions can be specified for data-fields displayed in list- and table-views. Figure 2 shows the same information as in Figure 1 using a table-view.
|
SWISSPROT |
Description |
SeqLength |
ACHA_BOVIN |
ACETYLCHOLINE RECEPTOR PROTEIN, ALPHA CHAIN PRECURSOR. |
457 |
ACHA_CHICK |
ACETYLCHOLINE RECEPTOR PROTEIN, ALPHA CHAIN PRECURSOR. |
456 |
ACHA_ELEEL |
ACETYLCHOLINE RECEPTOR PROTEIN, ALPHA CHAIN (FRAGMENT). |
24 |
Figure 2. A list of SWISS-PROT entries shown in a table-view
The output format of the table-view can be HTML code using the HTML3 table specification or plain ASCII which can be directly loaded into spreadsheet programs or relational database systems. Other formats can be easily added since the table is constructed by using a template. The HTML implementation makes use of the various options to display text within cells which can be left or right aligned, centered or preformatted. A view in SRS is independent from queries and can be applied to any set of entries provided they are from the databank for which the view was specified. It is possible to specify a view for multiple databanks so that a single view can be applied to sets generated by querying several databanks at the same time. However, one needs to be careful since particular data-fields may not be shared by all databanks for which the view is defined. Views become particularly useful when combined with links. Each entry displayed can be individually linked with entries from other databanks from which again only selected information is displayed. Since the linked entries in turn can be displayed by a view that specify further links, the overall structure can be seen as a tree where the databanks for which the view is defined are root databanks and the databanks supplying the linked entries leaf-databanks. Figure 3 shows SWISS-PROT entries together with linked entries from ENZYME.
|
SWISSPROT |
Description |
ENZYME |
CatalyticActivity |
MDH_HALMA |
MALATE DEHYDROGENASE (EC 1.1.1.37). |
1.1.1.37 |
(S)-MALATE + NAD(+) = OXALOACETATE + NADH. |
ODPA_ASCSU |
PYRUVATE DEHYDROGENASE E1 COMPONENT, ALPHA SUBUNIT, TYPE I PRECURSOR (EC 1.2.4.1) (PDHE1-A). |
1.2.4.1 |
PYRUVATE + LIPOAMIDE = S-ACETYLDIHYDROLIPOAMIDE + CO(2). |
Figure 3. Two SWISS-PROT entries shown with the Description field together with their linked entries from ENZYME for whch the "CatalyticActivity" field is included. An Individual data-field may have a list of format options. The Sequence field of protein sequence databanks for example, can be displayed as the plain sequence of characters, in GCG, PIR or FASTA format, or as a Java applet which shows the sequence along with a plot for various amino acid characteristics such as the Kyte-Doolittle hydropathy values. Figure 4 shows a SWISS-PROT entry with the sequence displayed by a Java applet. Displaying Java applets within views is an elegant way of linking data with analysis methods and is straightforward since HTML provides tags for calling applets and supplying their input. |
Figure 4. A SWIS-PROT entry with its sequence shown by a Java applet as a hydropathicity plot. The applet can be controlled to display other amino acid properties or to show the curve with a larger scale. |
Since a single root entry may be linked to many leaf entries it is sometimes useful to have only the number of linked entries displayed. The SRSWWW server shows this number as a hypertext link that when clicked retrieves the list of leaf entries. The number is of value itself and can be used for statistics. For instance, using the previous example of the link from TAXONOMY to GENBANK one could obtain a table of taxa together with the number of GENBANK entries that exist for all organisms each taxon represents. By default SRS uses the shortest path to link root and leaf entries. If another link is desired it must be explicitly specified as a query expression in which the root-entry to be linked is represented by the reserved word "entry". For instance to obtain the view with the number of GENBANK entries for each in a list of taxa one must specify the query entry >_ taxonomy_down > genbank
Working with views
SRS offers different ways of working with views. The SRSWWW server provides two pages for first selecting root and leaf databanks and then selecting data-fields for each. An additional page displays all defined views. For each atabank a number of predefined views exist that can be used or modified by the user. Views can be applied to entire sets returned by queries or to single entries selected in an entry list. The command line program "getz" allows definition of simple views on the command line as in % getz -vf description '[embl-description:kinase]' which prints for all kinase sequences in the EMBL databank the Description field. The entry name itself will always be displayed and needs not to be specified. The predefined views of the SRSWWW server are specified in the Icarus language and are available to all SRS programs. More views can be easily added and selected for display by the associated name as in the "getz" command line % getz -view myview '[embl-description:kinase]' that uses the view with the name "myview" to display the result of the query. Discussion
Views or external schemas are commonly used in databank systems, but until now were not applied within retrieval systems. The views described in this paper are restricted since they act on entire data-fields which can be large and complex entities themselves. To get a finer granularity in selecting information a system must be used that supports full data schemas (relational databank systems) or object class specifications (object oriented databank systems). However, the simplicity of viewing on the level of data-fields has the advantage that the implementation and maintenance is much easier. Databanks can be incorporated into the system without the need of having full knowledge of their organization which allows the integration of extremely heterogeneous databanks. Also, most databanks have a very simple structure and most often selection from a list of data-fields is sufficient. SRS accesses the databanks directly on the flat file level. Its parser makes it completely independent from the format of these files which can range from relational tables to, for instance, PDB files. The table-view gives further independence since it can convert the data into a standard format, e.g., the sequence length will always be displayed as a single number within a table column and more complex items such as a date or an author can share a common format across all databanks. If the views will be accepted by the community the immediate effect will be that databank links or cross-references will play an even more important role as they do now which will improve and extend the cross-reference information available. Acknowledgements
The authors are grateful for financial support from the European Union (Grant Gene-CT-93-0043) under the Biomed I program. References
i.P. Bucher and E.N. Trifonov, Nucleic Acid Res. 14, 10009 (1986). ii.The Protein Databank, http://pdb.pdb.bnl.gov/. iii.A. Bairoch, Nucleic Acids Res. 24, 221 (1996). iv.The EMBL Nucleotide Databank, http://www.ebi.ac.uk. v.E. Wingender, P. Dietze, H. Karas, R. Knueppel, Nucleic Acids Res. 24, 238 (1996). vi.T. Etzold and P. Argos, Comput. Appl. Biosci. 9, 59 (1993). vii.A. Bairoch and R. Apweiler, Nucleic Acids Res. 24, 21 (1996). viii.D. Leipe and V. Soussov, http://www3.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html. ix.D.A. Benson, M.S. Boguski, D.J. Lipman, J. Ostell, Nucleic Acids Res. 24, 1 (1996). x.R. Schneider R. and C. Sander, Nucleic Acids Res. 24, 201 (1996). xi.F. Achard, http://www.infobiogen.fr/virgil/home.html. xii.K.H. Fasman, S.I. Letovsky, R.W. Cottingham and D.T. Kingsbury, Nucleic Acids Res. 24, 57 (1996). xiii.Interpreter of Commands and RecUrsive Syntax, http://kappa.embl-heidelberg:8000/srs/. xiv.J. Kyte and R.F. Doolittle, J. Mol. Biol. 157, 105 (1982). xv.T. Etzold and P. Argos, Comput. Appl. Biosci. 9, 49 (1993). xvi.T. Etzold, A. Ulyanov and P. Argos, Methods in Enzymology 266, 114 (1996). Abstract
Information in molecular biology or biology in general is contained in a multitude of different databanks each specializing in a certain area. This specialization is useful since it improves the maintainability of the data and the flexibility of the overall structure which until now manages to exist without almost any standards. However, information gathering from these extremely heterogeneous sources is difficult since the desired data may be scattered over many databanks. This paper presents the concept of views implemented in the retrieval system SRS, that uses links between databanks and a sophisticated parsing engine for information extraction to provide a flexible way of searching and obtaining data across databank boundaries. The implementation provides homogeneous access to all databanks using two types of views: the list-view which displays selected data-fields in their original format and the table-view available for HTML and plain ASCII text which tries to represent the data in a format independent from that of the source databank.
Name:Thure Matthias Etzold Citizenship: German/American Born: 7. 6. 1960 In: Waltham/Mass. USA
University November 1981 - Februar 1983, Biology (Vordiplom), Universität Erlangen/Nürnberg April 1983 - December 1987, Biology (Diplom), Genetics, Botany, Physikalische Chemie, Universität zu Köln Diplom Thesis November 1986 - November 1987, Max-Planck-Institut für Züchtungsforschung, Köln, Prof. Dr. J. Schell and Dr. P. Schreier, "Untersuchungen zur Transformation von Chloroplasten"
Dissertation March 1989 - September 1992, Max-Planck-Institut für Züchtungsforschung, Köln, Prof. Dr. J. Schell and Dr. K. Stüber, "A Retrieval System for Molecular Biological Databanks: Novel Methods for Intertwining Databanks and Defining Internal Data", ab November 1990, Completion with Dr. P. Argos, EMBL, Heidelberg. Current Since November 1992, staff scientist at the EMBL, Heidelberg in the group of Dr. P. Argos.
|