/usr/share/doc/blast2/rpsblast.html is in blast2 1:2.2.26.20120620-7.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 | <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator"
content="HTML Tidy for Linux/x86 (vers 1st October 2002), see www.w3.org" />
<title></title>
</head>
<body>
<pre>
RPS Blast: Reversed Position Specific Blast
RPS-BLAST (Reverse PSI-BLAST) searches a query sequence against a database
of profiles. This is the opposite of PSI-BLAST that searches a profile
against a database of sequences, hence the 'Reverse'. RPS-BLAST
uses a BLAST-like algorithm, finding single- or double-word hits
and then performing an ungapped extension on these candidate matches.
If a sufficiently high-scoring ungapped alignment is produced, a gapped
extension is performed and those (gapped) alignments with sufficiently
low expect value are reported. This procedure is in contrast to IMPALA
that performs a Smith-Waterman calculation between the query and
each profile, rather than using a word-hit approach to identify
matches that should be extended.
RPS-BLAST uses a BLAST database, but also has some other files that
contain a precomputed lookup table for the profiles to allow the search
to proceed faster. Unfortunately it was not possible to make this
lookup table architecture independent (like the BLAST databases themselves)
and one cannot take a RPS-BLAST databases prepared on a big-endian
system (e.g., Solaris Sparc) and run it on a small-endian system
(e.g., NT). The RPS-BLAST database must be prepared again for the small-endian
system.
The CD-Search databases for RPS-BLAST can be found at:
ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/
It is necessary to untar the archive and run copymat and formatdb.
It is not necessary to run makemat on the databases from this
directory.
RPS-BLAST was coded by Sergei Shavirin with some help from Tom Madden.
RPS-BLAST reuses some of the IMPALA code for precomputing the lookup tables
and all of the IMPALA code for evaluating the statistical significance of a match.
1. Binary files used in RPS Blast:
The following binary files are used to setup and run RPS Blast:
makemat : primary profile preprocessor
(converts a collection of binary profiles, created by the -C option
of PSI-BLAST, into portable ASCII form);
copymat : secondary profile preprocessor
(converts ASCII matrices, produced by the primary preprocessor,
into database that can be read into memory quickly);
formatdb : general BLAST database formatter.
rpsblast : search program (searches a database of score
matrices, prepared by copymat, producing BLAST-like output).
2. Conversion of profiles into searchable database
*Note*: if you are starting with *.mtx files obtained from the NCBI FTP site or
another source you should skip the steps listed in 2.1.
2.1. Primary preprocessing
Prepare the following files:
i. a collection of PSI-BLAST-generated profiles with arbitrary
names and suffix .chk;
ii. a collection of "profile master sequences", associated with
the profiles, each in a separate file with arbitrary name and a 3 character
suffix starting with c;
the sequences can have deflines; they need not be sequences in nr or
in any other sequence database; if the sequences have deflines, then
the deflines must be unique.
iii. a list of profile file names, one per line, named
<database_name>.pn;
iv. a list of master sequence file names, one per line, in the same
order as a list of profile names, named
<database_name>.sn;
The following files will be created:
a. a collection of ASCII files, corresponding to each of the
original profiles, named
<profile_name>.mtx;
b. a list of ASCII matrix files, named
<database_name>.mn;
c. ASCII file with auxiliary information, named
<database_name>.aux;
Arguments to makemat:
-P database name (required)
-G Cost to open a gap (optional)
default = 11
-E Cost to extend a gap (optional)
default = 1
-U Underlying amino acid scoring matrix (optional)
default = BLOSUM62
-d Underlying sequence database used to create profiles (optional)
default = nr
-z Effective size of sequence database given by -d
default = current size of -d option
Note: It may make sense to use -z without -d when the
profiles were created with an older, smaller version of an
existing database
-S Scaling factor for matrix outputs to avoid round-off problems
default = PRO_DEFAULT_SCALING_UP (currently defined as 100)
Use 1.0 to have no scaling
Output scores will be scaled back down to a unit scale to make
them look more like BLAST scores, but we found working with a larger
scale to help with roundoff problems.
-H get help (overrides all other arguments)
Note: It is not enforced that the values of -G and -E passed to makemat
were actually used in making the checkpoints. However, the values fed
in to makemat are propagated to copymat and rpsblast.
ATTENTION: It is strongly recommended to use -S 1 - the scaling factor
should be set to 1 for rpsblast at this point in time.
2.2. Secondary preprocessing
Prepare the following files:
i. a collection of ASCII files, corresponding to each of the
original profiles, named
<profile_name>.mtx
(created by makemat);
ii. a collection of "profile master sequences", associated with
the profiles, each in a separate file with arbitrary name and a 3 character
suffix starting with c.
iii. a list of ASCII_matrix files, named
<database_name>.mn
(created by makemat);
iv. a list of master sequence file names, one per
line, in the same order as a list of matrix names, named
<database_name>.sn;
v. ASCII file with auxiliary information, named
<database_name>.aux
(created by makemat);
The files input to copymatices are in ASCII format and thus portable
between machines with different encodings for machine-readable files
The following files will be created:
a. a huge binary file, containing all profile matrices, named
<database_name>.rps;
b. a huge binary file, containing lookup table for the Blast search
corresponding to matrixes named <database_name>.loo
c. File containing concatenation of all FASTA "profile master sequences".
named <database_name> (without extention)
Arguments to copymat
-P database name (required)
-H get help (overrides all other arguments)
-r format data for RPS Blast
ATTENTION: "-r" parameter have to be set to TRUE to format data for
RPS Blast at this step.
NOTE: copymat requires a fair amount of memory as it first constructs
the the lookup table in memory before writing it to disk. Users have
found that they require a machine with at least 500 Meg of memory for this
task.
2.3 Creating of BLAST database from <database_name> file containing
all "profile master sequences".
"formatdb" program should be run to create regular BLAST database of all
"profile master sequences":
formatdb -i <database_name> -o T
3. Search
Arguments to RPS Blast
-i query sequence file (required)
-d RPS BLAST Database [File In]
-p if query sequence protein (if FALSE 6 frame franslation will be
conducted as in blastx program)
-P 0 for multiple hits 1-pass, 1 for single hit 1-pass [Integer]
default = 0
-o output file (optional)
default = stdout
-e Expectation value threshold (E), (optional, same as for BLAST)
default = 10
-m alignment view (optional, same as for BLAST)
-z effective length of database (optional)
-1 = length given via -z option to makemat
default (0) implies length is actual length of profile library
adjusted for end effects
APPENDIX:
A. Documentation of the .mtx file format
Format of the .mtx file:
L = Length of SEQ
SEQ = Sequence
ka#-* = Karlin/Altschul parameters, block #. There are three blocks,
each containing four floating point numbers on separate lines.
pX-Y = The position specific scores as integers.
The first element of this file format is [L]. This is the sequence
length. The second line contains the sequence itself, in NCBI AA
notation. After this, there are three KA blocks (four lines of
floating point numbers each), then the positional scores.
The positional scores are arranged in a grid. Each line contains 26
elements, corresponding to the 26 elements in the NCBI AA encoding,
and there are L lines where L is the previously mentioned sequence
length.
Using the symbols mentioned above, it looks something like this:
<L>
<SEQ>
<ka1-1>
<ka1-2>
<ka1-3>
<ka1-4>
<ka2-1>
<ka2-2>
<ka2-3>
<ka2-4>
<ka3-1>
<ka3-2>
<ka3-3>
<ka3-4>
<p1-1> <p1-2> <p1-3> ... <p1-26>
<p2-1> <p2-2> <p2-3> ... <p2-26>
...
<pL-1> <pL-2> <pL-3> ... <pL-26>
One can find the explanation for the three blocks of KA-parameters in
makemat's source code, lines 188-190:
putMatrixKbp(checkFile, compactSearch->kbp_gap_std[0], scaleScores, 1/scalingFactor);
putMatrixKbp(checkFile, compactSearch->kbp_gap_psi[0], scaleScores, 1/scalingFactor);
putMatrixKbp(checkFile, sbp->kbp_ideal, scaleScores, 1/scalingFactor);
Thus, the first KA block is the standard score, the second is for
PSI-Blast, and the third is the ideal score.
</pre>
</body>
</html>
|