/usr/share/EMBOSS/doc/manuals/domainatrix.txt

Domainatrix






Contents

1. Introduction
2. Background
3. Overview of method for predicting domain complement
4. Implementation
5. Application dependencies

Figure 1  Overview of method for predicting domain complement
Table 1 - New software

Table 2 - New data files and databases



1.0  Introduction
This document describes the  use  of  EMBOSS  applications  in  the  EMBASSY
package domainatrix.  Most of these directly or indirectly make use  of  the
protein structure databases pdb and scop.   All  of  the  applications  here
were written by Jon Ison, Ranjeeva Ranasinghe, Matt Blades  and  Waqas  Awan
and coordinated by  Jon  Ison  to  whom  enquiries  should  be  sent  (email
jison@hgmp.mrc.ac.uk).  This software is part of  an  experimental  analysis
pipeline and we provide it  in  the  hope  that  it  will  be  useful.   The
applications were designed for very specific research purposes and  may  not
be useful or reliable outside that framework.  Please report  bugs  to  Jon.
A detailed description of each application is given with the application
source code (.c files).  A summary of  how  the  applications  can  be  used
together follows.



2.0  Background

Thousands of genes will be identified from the  completed  sequence  of  the
human genome, many of which will lack experimentally  determined  functional
data.  The ability to predict structural and functional  characteristics  of
novel protein sequences would enhance the  functional  annotation  of  these
genes and provide evidence to direct experimental  work.   Indeed,  homology
search tools such as BLAST  are  the  most  widely  used  of  bioinformatics
applications.  Weak homologies remain difficult to detect however and it  is
not always easy to infer specific functional  or  structural  properties  by
inspecting the results of BLAST searches.

We have developed a  novel  approach  whereby  discriminating  elements  for
protein structural and functional features  are  generated  from  the  known
protein structures held in PDB.  The discriminating  elements  are  suitable
for  screening  (scoring)  against  protein  sequences.    Tools   for   the
development of a library of discriminating elements for SCOP  families  (see
Section 3.0 and Figure 1) are described here while tools for the  prediction
of  ligand-binding  properties  are  in  preparation.    Our   methods   are
encapsulated in a software architecture developed under EMBOSS and this  has
required considerable extensions to the AJAX library for  protein  structure
(see Section 4.0).   Several  bespoke  EMBOSS  applications  (Table  1)  and
secondary (derived) protein databases (Table  2)  have  been  generated  and
made available.




3.0  Overview of method for predicting domain complement

The method (Figure 1) uses data-sets derived  from  the  SCOP  database,  in
which protein structural domains are extracted from PDB and organised  in  a
hierarchical classification of class, fold, super-family and family  on  the
basis  of  structural  and  evolutionary  relatedness.   We  have  developed
software to (i) generate various discriminating elements for SCOP  families,
(ii) scan the discriminating elements against a  protein  sequence  database
such as SWISSPROT and  analyse  their  performance  (validation)  and  (iii)
construct a library of discriminating elements suitable for  screening  with
protein sequences (prediction).

The discriminating  elements  include  sparse  sequence  signatures,  hidden
Markov models, simple residue  frequency  matrices,  Gribskov  profiles  and
Henikoff profiles.  They  are  generated  from  a  structure-based  sequence
alignment (seed alignment hereon) of SCOP domains  that  has  been  extended
with sequence relatives (of unknown structure) to the  family  in  question.
The sequence relatives are found by using PSIBLAST to search SWISSPROT  with
the seed alignment for each SCOP family.  Hits  that  can  be  unambiguously
assigned as a relative to a SCOP family are collated into  a  SCOP  families
file and will be used  in  the  extended  alignments.   For  validating  the
discriminating elements, a SCOP validation  file  containing  all  the  hits
from the SCOP families file plus hits  of  ambiguous  family  assignment  is
also prepared.

The method is  described  in  eight  steps  below  as  follows  (numbers  in
parentheses correspond to numbers in Figure 1): i. Prepare clean  coordinate
files.  ii. Prepare SCOP classification file.  iii. Generate seed  alignment
for each SCOP family.  iv.  Prepare SCOP families and validation files.   v.
Generate extended sequence alignment for each  SCOP  family.   vi.  Generate
library of discriminating elements  for  SCOP  families.   vii.  Screen  the
discriminating elements against  SWISSPROT  and  analyse  their  performance
(validation).  viii. Screen novel  protein  sequences  against  the  library
(prediction).


Figure 1  Overview of method for predicting domain complement



The method is described in eight steps numbered in the  figure  as  follows:
(1) Prepare clean coordinate files.  (2) Prepare SCOP  classification  file.
(3) Generate  seed  alignment  for  each  SCOP  family.   (4)  Prepare  SCOP
families and validation files.  (5)  Generate  extended  sequence  alignment
for each SCOP family.  (6) Generate library of discriminating  elements  for
SCOP families.  (7) Screen the  discriminating  elements  against  SWISSPROT
and analyse  their  performance  (validation).   (8)  Screen  novel  protein
sequences against the library (prediction).  Each step is explained  in  the
text.
3.1  Prepare clean coordinate files
We required direct access to the co-ordinate data  held  in  PDB.   However,
the text files provided are notoriously difficult  to  parse  reliably,  the
problems arising from errors in individual PDB  files  and  an  awkward  and
inconsistent file format, which has evolved over some 30 years in  a  rather
ad hoc manner.  A difficult aspect of parsing  is  determining  the  residue
sequence and ensuring that the  atomic  co-ordinates  are  assigned  to  the
correct position in the sequence in the  relevant  data  structure;  residue
numbers must be treated as strings  and  a  sequential  numerical  numbering
scheme  is  not  consistently  used.   While  extensive  validation  is  now
performed on deposited data, including comparisons  of  PDB  SEQRES  records
(used to hold the protein sequence) and the sequence derived  from  the  co-
ordinate records, there is a legacy of PDB files that predate these  quality
control measures.

We required a source of protein  co-ordinate  data  that  allowed  fast  and
convenient access, correctly employed a consistent residue numbering  scheme
and which incorporated information  on  known  structural  domain  in  SCOP.
Software was developed to  parse  the  PDB  files  and,  where  protein  co-
ordinates are  present,  generate  cleaned  up  files  of  co-ordinate  data
corresponding to whole PDB files and individual SCOP domains.   These  files
contain corrections to  some  of  the  errors  and  inconsistencies  in  the
original PDB file, contain minimal bibliographic  data,  and  use  a  highly
parsable and self-consistent format.  Clean coordinate files for  all  known
protein structures in PDB were generated  by  using  pdbparse  and  for  all
known SCOP domains by using domainer.

3.2  Prepare SCOP classification file
The parsable files provided by the SCOP  authors  are  inconvenient  because
the descriptive text is given in a  different  file  to  the  classification
itself and they are not suitable  for  extending  with  other  records.   We
required a  single  source  of  classification  data  that  was  extendable.
scoparse was developed to convert the raw SCOP parsable files  to  a  single
file in EMBL-like format (the SCOP classification file).

The SCOP classification file contains the domains that will be  structurally
aligned.  It is  desirable  to  include  only  high  quality,  non-redundant
domains.  Also, a better alignment will result if the alignment begins  with
a domain that in structural terms is  representative  of  the  family  as  a
whole.  New software was developed to process the SCOP  classification  file
accordingly as  follows.   scopreso  removes  low  resolution  domains.   To
remove redundant domains, knowledge of the domain  sequences  is  necessary.
scopseqs  will  add  PDB  and  SWISSPROT  sequence   records   to   a   SCOP
classification file, but requires  a  file  (provided  by  pdbtosp)  listing
accession numbers for different PDB identifier codes.   scopnr  will  remove
redundant domains from a  SCOP  classification  file  that  has  first  been
processed by scopseqs.  Finally, scoprep will reorder the file so  that  the
representative structure of each family is given first.

3.3  Generate structure-based sequence alignment for each SCOP family
A seed alignment was generated for each family in  the  SCOP  classification
file by using STAMP.  An application wrapper to STAMP called  scopalign  was
developed for this purpose.  The domain  given  first  for  each  family  is
taken to be the structural representative; all other domains in  the  family
are aligned to this one.

3.4  Prepare SCOP families and validation files
The SCOP families file contains sequence  relatives  (hits)  for  each  SCOP
family that will be used to generate the  extended  alignments.   Such  hits
can easily be found by using PSIBLAST however it is likely that the  results
of searches for  two  related  families  would  be  overlapping,  and  would
include redundant as well as fragmentary sequences.   To  avoid  introducing
bias into the extended alignments, a hit should be  listed  under  a  single
family only and redundant hits or hits corresponding  to  sequences  in  the
seed alignments should be excluded.  Fragment sequences should  be  excluded
as they lack biological significance.   In  contrast,  the  SCOP  validation
file must include all of the known  relatives  (excluding  fragments),  i.e.
all the hits from the SCOP families file, plus any redundant hits,  hits  of
ambiguous family assignment and  hits  corresponding  to  sequences  in  the
alignment.

The following software was  developed  to  prepare  the  SCOP  families  and
validation files.  seqsearch generates a file of hits for each  SCOP  family
by  using  PSIBLAST  to  search  SWISSPROT  with  the   corresponding   seed
alignment.  fraggle removes fragment  sequences  from  the  files  of  hits.
seqsort reads multiple files of hits and write (i) a SCOP families file,  of
hits  that  could  be  uniquely  assigned  to  a  family  and  (ii)  a  SCOP
ambiguities  file,  for  hits  of  ambiguous  family  assignment  which  are
assigned as relatives to a SCOP superfamily or fold  instead.   seqnr  reads
the SCOP families and ambiguities files and write (i) a  non-redundant  SCOP
families  file  and  (ii)  a  SCOP  validation  file.   seqnr  ensures  that
sequences from the seed alignments are (i) excluded from the  SCOP  families
file (so they  do  not  appear  twice  in  the  extended  alignments),  (ii)
included in the validation file (if they do not  already  correspond  to  an
existing  hit)  and  (iii)  are  considered  for  purposes  of   calculating
redundancy.

3.5  Generate extended sequence alignment for each SCOP family.
An extended alignment was generated for each family  in  the  SCOP  families
file by using CLUSTALW.  An application wrapper to CLUSTALW called  seqalign
was developed for this purpose.  Hits for a family are added one by  one  to
the appropriate seed alignment.

3.6  Generate library of discriminating elements for SCOP families.
One of each type of discriminating element was generated from  the  extended
alignment of each SCOP family and compiled into a  library.   The  following
software was developed for this purpose:  profgen  generate  simple  residue
frequency  matrices,  Gribskov  profile  and  Henikoff   profiles,    hmmgen
generates a hidden Markov model  by  using  the  HMMER  package  and  siggen
generates  sparse  protein  signatures.   Alternatively  a  sparse   protein
signature can be generated from a  seed  alignment  and  files  of  residue-
residue contact data for the domains in the alignment.   Various  algorithms
for scoring an alignment on  sequence  and  /  or  structural  criteria  and
generating a signature are available but are not described here.

3.7  Screen the discriminating elements against SWISSPROT and analyse  their
performance (validation).
Discriminating elements of the  various  types  can  be  scanned  against  a
sequence database by using the libscan application that  was  developed  for
the purpose.  libscan invokes pattern matching and scoring  algorithms  that
have been implemented in the AJAX programming library.   The  results  of  a
search are returned to the user in a discriminator hits  file  containing  a
list  of  top-scoring  hits  (sequence  regions)  rank-ordered   by   score.
Typically, a database scan would be done by an end-user seeking to  identify
relatives of a SCOP family of interest, but in our case  was  done  to  test
and validate our methods.  We  required  an  assessment  of  the  predictive
power  of  the  discriminating  elements  however  manually  inspecting  the
results  of  database  searches  is  unreliable  and  inconvenient,  and  is
unfeasible for large-scale applications such as this one.

We have  implemented  methods  (incorporated  in  libscan)  to  provide,  by
reference to the validation file, a classification of each hit  returned  by
a search as follows: SEED if the hit was included in the original  alignment
from which the discriminating element was generated. HIT if the hit  is  not
a SEED but was unambiguously assigned as a relative to  the  family  of  the
discriminating element when the validation file was  constructed.  CROSS  if
the hit is a relative of a family that is different from the  discriminating
element but is of the same fold.  FALSE if  the  hit  is  a  relative  of  a
different family with a different fold or UNKNOWN if the hit  is  not  known
to be a FALSE, CROSS or a true hit (a SEED  or  HIT).   This  classification
greatly accelerates interpretation of  the  results  of  database  searches.
libscan also provides statistical estimates of the  significance  of  scored
matches in the  form  of  p-values  that  are  calculated  from  empirically
derived distributions of scores.  The p-values are a very  powerful  aid  to
interpretation and our implementation represents a major enhancement to  the
discriminator methods.

We  developed  the  sigplot  application  to  provide  a  further  level  of
interpretation and graphical display of discriminator performance.   sigplot
reads a discriminator hits file and generates two different types  of  graph
of discriminator performance.   The  first  gives  the  proportion  of  hits
detected that are true hits (a SEED or HIT), CROSS, UNKNOWN or FALSE  versus
the number of hits.   The  second  uses  Receiver  Operating  Characteristic
(ROC)  curves  to  display  the   sensitivity   and   specificity   of   the
discriminating elements.  ROC analysis has  been  used  for  many  years  in
clinical studies to evaluate the usefulness of diagnostic  tests.   The  ROC
curves generated by sigplot plot sensitivity  or  "true  positive  fraction"
versus (1-specificity), the "true negative fraction"  and  are  provided  as
data files suitable for display using  the  popular  UNIX  utility  gnuplot.
They are a powerful aid to interpretation, in particular of the  results  of
multiple discriminator scans side-by-side.

3.8  Screen novel protein sequences against the library (prediction).
The advantage of screening a relatively small library  with  a  sequence  is
that  it  is  sufficient  for  the  sequence  to  detect  its  true   family
(discriminator) in the first rank for an effective prediction.  This  is  in
contrast to searching a  larger  sequence  database  to  identify  homology,
where biologically significant hits may achieve statistically  insignificant
scores and therefore be missed.  A library might help the detection of  such
proteins because they may still score their true discriminator  higher  than
the others in the library  regardless  of  statistical  estimates.   Further
improvements to predictions are gained when multiple  sources  of  evidence,
in this case the different types of discriminating element, are  considered.

The libscan application  allows  a  protein  sequence  or  sequences  to  be
screened against the library of discriminating elements.  The results  of  a
screen are returned to the user in a library scan file containing a list  of
top-scoring SCOP domains rank-ordered by p-value for  each  individual  type
of  discriminator,  and  also  for  all  of  the  discriminator   types   in
combination (combined prediction).  For  the  combined  prediction,  the  p-
value is derived from an empirically derived distribution of the product  of
the p-values of the individual methods.


4.0  Implementation

Unless otherwise stated, all software was developed in C using the AJAX
programming libraries and are included in the EMBOSS distribution.  EMBOSS
has traditionally catered for molecular sequence analysis and considerable
new extensions to the libraries for protein structure and for processing
the results of database searches were required.  A total of 22 new
applications (Table 1) have been added to EMBOSS.  A couple of these are
alpha releases (i.e. have not been thoroughly debugged or some
functionality may be unavailable) but mostly these are beta releases (more
stable code with near full functionality which has been more thoroughly
tested).  The software has been used to construct several new data files
and secondary (derived) databases for protein structure (Table 2) and these
have been validated by manual inspection to ensure integrity of the data.




5.0  Application dependencies

The applications below rely on other programs that are not part of EMBOSS
itself.  These support programs have to be installed for the applications
to work.  When running the applications at the HGMP it is essential that
the appropriate script is run (e.g. by typing use hmmer) to set up the
support programs as follows:

|Application         |Command to run script   |Script                   |
|hmmgen              |'use hmmer'             |/packages/menu/USE/hmmer |
|seqsearch           |'use blast_v2'          |/packages/menu/USE/blast_|
|                    |                        |v2                       |
|seqalign            |'use clustal'           |/packages/menu/USE/clusta|
|                    |                        |l                        |
|scoprep             |'use stamp2'            |/packages/menu/USE/stamp2|
|scopalign           |'use stamp2'            |/packages/menu/USE/stamp2|


scoprep and scopalign will only run with a version of stamp which has been
modified so that pdb identifier codes of length greater than 4 characters
are acceptable.  This involves a trivial change to the stamp module
getdomain.c (around line number 155), a 4 must be changed to a 7 as
follows:
temp=getfile(domain[0].id,dirfile,4,OUTPUT);
temp=getfile(domain[0].id,dirfile,7,OUTPUT);
The modified code is kept on the HGMP file system in /packages/stamp/src2.
The command 'use stamp2' will ensure that the modified version of stamp is
used.


Adaption of STAMP for larger datasets

STAMP was failing to align a large dataset of all the available V set Ig
domains. The ver2hor module generated the following error:
Transforming coordinates...
  ...done.
 ver2hor -f ./scopalign-1022069396.11280.76.post > ./scopalign-
1022069396.11280.out
 error: something wrong with STAMP file
          STAMP length is 370, Alignment length is 422
          STAMP nseq is 155, Alignment nseq is 155

 This was fixed by changing #define MAXtlen 200 to #define MAXtlen 2000 in
alignfit.h.

At the same time the following were changed as a safety measure:
gstamp.c  : #define MAX_SEQ_LEN 10000    (was 2000)
pdbseq.c  : #define MAX_SEQ_LEN 10000    (was 3000)
defaults.h: #define MAX_SEQ_LEN 10000    (was 8000)
defaults.h: #define MAX_NSEQ 10000       (was 1000)
defaults.h: #define MAX_BLOC_SEQ 5000    (was 500)
dstamp.h  : #define MAX_N_SEQ 10000      (was 1000)
ver2hor.h : #define MAX_N_SEQ 10000      (was 1000)

The modified code is kept on the HGMP file system in /packages/stamp/src2.
The command 'use stamp2' (which runs the script /packages/menu/USE/stamp2)
will ensure that the modified version of stamp is used.



Table 1 - New software

|Application |Description                                                |
|pdbparse    |Parses PDB files and writes cleaned-up protein coordinate  |
|            |files.                                                     |
|domainer    |Reads protein coordinate files and writes domain coordinate|
|            |files.                                                     |
|contacts    |Reads coordinate files and writes files of intra-chain     |
|            |residue-residue contact data.                              |
|interface   |Reads coordinate files and writes files of inter-chain     |
|            |residue-residue contact data.                              |
|scopparse   |Converts raw SCOP classification files to a file in        |
|            |EMBL-like format.                                          |
|scopreso    |Removes low resolution domains from a SCOP classification  |
|            |file.                                                      |
|pdbtosp     |Convert raw SWISSPROT-PDB equivalence file to EMBL-like    |
|            |format.                                                    |
|scopseqs    |Adds PDB and SWISSPROT sequence records to a SCOP          |
|            |classification file.                                       |
|scopnr      |Removes redundant domains from a SCOP classification file. |
|scoprep     |Reorder SCOP classification file so that the representative|
|            |structure of each family is given first.                   |
|scopalign   |Generate alignments for families in a SCOP classification  |
|            |file by using STAMP.                                       |
|seqsearch   |Generate files of hits for SCOP family alignments by using |
|            |PSI-BLAST.                                                 |
|fraggle     |Removes fragment sequences from files of hits for SCOP     |
|            |families.                                                  |
|seqsort     |Reads multiple files of hits and writes (i) a SCOP families|
|            |file and (ii) a SCOP ambiguities file.                     |
|seqnr       |Reads a SCOP families file and a SCOP ambiguities file and |
|            |writes (i) a non-redundant SCOP families file and (ii) a   |
|            |SCOP validation file.                                      |
|seqalign    |Generate extended alignments for families in a SCOP        |
|            |families file by using CLUSTALW with seed alignments.      |
|siggen      |Generates a sparse protein signature from an alignment and |
|            |residue contact data.                                      |
|profgen     |Generates various profiles for each alignment in a         |
|            |directory.                                                 |
|hmmgen      |Generates a hidden Markov model for each alignment in a    |
|            |directory by using the HMMER package.                      |
|sigscan     |Scans a signature against SWISSPROT and writes a           |
|            |discriminator hits file.                                   |
|libscan     |Scans each signature, profile or HMM in a directory against|
|            |swissprot and writes a signature hits file for each one. Or|
|            |scans sequences against such a library of discriminating   |
|            |elements and writes a library scan file for each one.      |
|sigplot     |Reads a discriminator hits file and a validation file and  |
|            |generates gnuplot data files of signature performance.     |
|hetparse    |Converts raw dictionary of heterogen groups to a file in   |
|            |embl-like format.  The use of this application is not      |
|            |explained in this document.                                |
|seqwords    |Generate files of hits for scop families by searching      |
|            |swissprot with keywords.  The use of this application is   |
|            |not explained in this document.                            |


All software was developed in C using the  AJAX  programming  libraries  and
are  included  in  the  EMBOSS   distribution.    Documentation   for   each
application                 is                 available                  at
http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/index.html.



Table 2 - New data files and databases

|Data file or |Description                                   |Availabili|
|database     |                                              |ty        |
|SCOP         |Classification and other data including       |HGMP      |
|classificatio|sequences for domains from the SCOP database. |          |
|n file       |Several files processed for different levels  |          |
|             |of redundancy removal are available.          |          |
|Protein      |Protein coordinate and other data extracted   |Public    |
|coordinate   |from each available PDB file.  The files      |          |
|files        |contain cleaned-up data that is               |          |
|             |self-consistent and error-corrected.          |          |
|Domain       |Coordinate and other data for each single SCOP|Public    |
|coordinate   |domain.  Files in EMBL-like and PDB formats   |          |
|files        |are available.                                |          |
|Contacts     |Intra-chain residue-residue contact data.     |Public    |
|files        |Files for whole protein structures and        |          |
|             |individual SCOP domains are available.        |          |
|Seed         |A structure-based sequence alignment of       |HGMP      |
|alignments   |domains of known structure only and belonging |          |
|             |to the same SCOP family.  The files are       |          |
|             |annotated with records describing the SCOP    |          |
|             |classification of the family.  Seed alignments|          |
|             |have been generated from SCOP classification  |          |
|             |files generated at a threshold of 50% and 90% |          |
|             |sequence similarity for redundancy removal and|          |
|             |are available for each SCOP family containing |          |
|             |two or more non-redundant domains.            |          |
|SCOP hits    |Sequence relatives (hits) to individual SCOP  |HGMP      |
|files        |families found from searching SWISSPROT with a|          |
|             |seed alignment.  They have been generated for |          |
|             |each family in SCOP classification files      |          |
|             |generated at a threshold of 50% and 90%       |          |
|             |sequence similarity for redundancy removal.   |          |
|SCOP families|Sequence relatives (hits) to SCOP families.   |HGMP      |
|file         |The file contains the collated results of     |          |
|             |searching SWISSPROT with a seed alignment for |          |
|             |the individual SCOP families; only those hits |          |
|             |of unambiguous family assignment are included.|          |
|SCOP         |Sequence relatives (hits) to each of every    |HGMP      |
|validation   |SCOP family and various superfamilies and     |          |
|file         |folds.  Similar to the SCOP families file     |          |
|             |except that hits of ambiguous family          |          |
|             |assignment (assigned as a relative to a       |          |
|             |superfamily or fold) are also included.       |          |
|Extended     |Seed alignments that have been extended with  |HGMP      |
|alignments   |sequence relatives from a SCOP families file. |          |
|             |Extended alignments have been generated from  |          |
|             |SCOP families files generated at a threshold  |          |
|             |of 40% and 90% sequence similarity for        |          |
|             |redundancy removal and are available for each |          |
|             |SCOP family containing two or more            |          |
|             |non-redundant domains.                        |          |
|Discriminatin|One each of a sparse protein signature, hidden|HGMP      |
|g elements   |Markov model, Gribskov profile and Hennikoff  |          |
|             |profile is available for each SCOP family for |          |
|             |which there is an extended alignment.         |          |


The headings under Availability have the following meaning:  i.  Public,  to
the      general       public       via       anonymous       ftp       from
ftp://ftp.uk.embnet.org/pub/databases.  ii. HGMP,  to  registered  users  of
the HGMP from /data/structure/.  An EMBL-like format has been used  for  the
files wherever possible.
emboss-doc 6.6.0-1 / usr / share / EMBOSS / doc / manuals / domainatrix.txt