/usr/share/RDKit/Contrib/mmpa/README.txt is in rdkit-data 201403-1.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 | This directory contains the scripts used to generate matched molecular pairs (MMPs) from an input list of SMILES.
The fragment indexing algorithm used in the scripts is described in the following publications:
Hussain, J., & Rea, C. (2010). "Computationally efficient algorithm to identify matched molecular pairs (MMPs)
in large data sets." Journal of chemical information and modeling, 50(3), 339-348.
http://dx.doi.org/10.1021/ci900450m
Wagener, M., & Lommerse, J. P. (2006). "The quest for bioisosteric replacements."
Journal of chemical information and modeling, 46(2), 677-685.
The scripts requires RDKit (www.rdkit.org) be installed and properly configured.
Help is available for all the scripts using the -h option
To find all the MMPs in your set
--------------------------------
The program to generate the MMPs from a set is divided into two parts; fragmentation and indexing.
Before running the programs, make sure your input set of SMILES:
- does not contain mixtures (salts etc.)
- does not contain "*" atoms
- has been canonicalised using RDKit.
If your smiles set doesn't satisfy the conditions above the programs are likely to fail or in the case of
canonicalisation result in not identifying MMPs involving H atom substitution.
1) Fragmentation command:
python rfrag.py <SMILES_FILE >FRAGMENT_OUTPUT
Example command:
python rfrag.py <data/sample.smi >data/sample_fragmented.txt
Format of SMILES_FILE is: SMILES ID <space or comma separated>
See data/sample.smi for an example input file
Format of output: WHOLE_MOL_SMILES,ID,SMILES_OF_CORE,SMILES_OF_CONTEXT
See data/sample_fragmented.txt for an example output file
2) Index command:
python indexing.py <FRAGMENT_OUTPUT >MMP_OUTPUT.CSV
Format of output:
SMILES_OF_LEFT_MMP,SMILES_OF_RIGHT_MMP,ID_OF_LEFT_MMP,ID_OF_RIGHT_MMP,SMIRKS_OF_TRANSFORMATION,SMILES_OF_CONTEXT
This program has several options (see help from program below):
Usage: indexing.py [options]
Program to generate MMPs
Options:
-h, --help show this help message and exit
-s, --symmetric Output symmetrically equivalent MMPs, i.e output both
cmpd1,cmpd2, SMIRKS:A>>B and cmpd2,cmpd1, SMIRKS:B>>A
-m MAXSIZE, --maxsize=MAXSIZE
Maximum size of change (in heavy atoms) allowed in
matched molecular pairs identified. DEFAULT=10.
Note: This option overrides the ratio option if both
are specified.
-r RATIO, --ratio=RATIO
Maximum ratio of change allowed in matched molecular
pairs identified. The ratio is: size of change /
size of cmpd (in terms of heavy atoms). DEFAULT=0.3.
Note: If this option is used with the maxsize option,
the maxsize option will be used.
Example commands (with sample outputs):
Default settings:
python indexing.py <data/sample_fragmented.txt >data/sample_mmps_default.csv
Output symmetrically equivalent MMPs (ie forward and reverse transforms):
python indexing.py -s <data/sample_fragmented.txt >data/sample_mmps_sym.csv
Output MMPs where maximum size of change is 3 heavy atoms:
python indexing.py -m 3 <data/sample_fragmented.txt >data/sample_mmps_maxheavy.csv
Output MMPs where no more that 10% of the compound has changed:
python indexing.py -r 0.1 <data/sample_fragmented.txt >data/sample_mmps_maxratio.csv
Output symmetrically equivalent MMPs where maximum size of change is 3 heavy atoms:
python indexing.py -s -m 3 <data/sample_fragmented.txt >data/sample_mmps_sym_maxheavy.csv
SMIRKS canonicalisation
-----------------------
The MMP identification script uses a SMIRKS canonicalisation routine so the same change always has the same output SMIRKS.
To canonicalise a SMIRKS (generated elsewhere) so it is in the same format as MMP identification scripts use command:
python cansmirks.py <SMIRKS_FILE >SMIRKS_OUTPUT_FILE
Example command:
python cansmirk.py <data/sample_smirks.txt >data/sample_cansmirks.txt
Format of SMIRKS_FILE: SMIRKS ID <space or comma separated>
See data/sample_smirks.txt for an example input file
Format of output: CANONICALISED_SMIRKS ID
See data/sample_cansmirks.txt for an example output file
Note: The script will NOT deal with SMARTS characters, so the SMIRKS must contain valid SMILES for left and right hand sides.
The algorithm used to canonicalise SMIRKS is as follows:
1) Canonicalise the LHS.
2) For the LHS the 1st asterisk (attachment point) in the SMILES will have label 1, 2nd asterisk will have label 2 and so on
3) For the RHS, if you have a choice (ie. two attachment points are symmetrically equivalent), always put the label
with lower numerical value on the earlier attachment point in the canonicalised SMILES
Applying SMIRKS to input compounds
----------------------------------
If you want to apply a SMIRKS/transform generated by the programs above to a compound, use the mol_transform.py program
with the following command:
python mol_transform.py -f TRANSFORM_FILE <SMILES_FILE >OUTPUT_FILE
If you want to use a set SMIRKS generated elsewhere, please make sure they have been canonicalised
using the cansmirks.py command.
Example command:
mol_transform.py -f data/sample_smirks_mol_trans.txt <data/sample_smiles_mol_trans.smi >data/sample_mol_trans_output.txt
Format of SMILES_FILE: SMILES ID <space or comma separated>
See data/sample_smiles_mol_trans.smi for an example input file
Format of transform file: transform <one per line>
See data/sample_smirks_mol_trans.txt for an example transform file
Format of output: SMILES,ID,Transform,Modified_SMILES
See data/sample_mol_trans_output.txt for an example output file
Generating and searching an MMP database
----------------------------------------
The pair index used in the MMP identification algorithm can be written to a relational database. For the indexing.py
program described above, the index is written to memory and the program will identify all the MMPs in the dataset.
However, if you just want to ask a (series of) specific questions on a dataset, a relational database containing the
pair index (MMP db) can be used to do that.
The program create_mmp_db.py will build a MMP db for a given dataset and the program search_mmp_db.py can be used to
search the MMP db. The types of searching that can be performed on the db are as follows:
1) Find all MMPs of an input/query compound to the compounds in the db
2) Find all MMPs in the db where the LHS of the transform matches an input substructure
3) Find all MMPs that match the input transform/SMIRKS
4) Find all MMPs in the db where the LHS of the transform matches an input SMARTS
5) Find all MMPs that match the LHS and RHS SMARTS of the input transform
The SMARTS searching utilises the DbCLI tools (http://code.google.com/p/rdkit/wiki/UsingTheDbCLI) that are part
of the RDKit distribution.
Generating the db
-----------------
To generate an MMP db use the following command:
python create_mmp_db.py <FRAGMENT_OUTPUT
The program takes a FRAGMENT_OUTPUT generated by the rfrag.py command (described above) as input.
This program has several options (see help from program below):
Usage: create_mmp_db.py [options]
Program to create an MMP db.
Options:
-h, --help show this help message and exit
-p PREFIX, --prefix=PREFIX
Prefix to use for the db file (and directory for
SMARTS index). DEFAULT=mmp
-m MAXSIZE, --maxsize=MAXSIZE
Maximum size of change (in heavy atoms) that is stored
in the database. DEFAULT=15.
Note: Any MMPs that involve a change greater than this
value will not be stored in the database and hence not
be identified in the searching.
-s, --smarts Build SMARTS db so can perform SMARTS searching
against db. Note: Will make the build process somewhat
slower.
Example commands:
Default Settings:
python create_mmp_db.py <data/sample_fragmented.txt
A sqllite3 db file will be created called mmp.db
Generate a db with the prefix "my_MMP_db" and SMARTS searching capability:
python create_mmp_db.py -p my_MMP_db -s <data/sample_fragmented.txt
A sqllite3 db file will be created called my_MMP_db.db and a DbCLi files will be created in a directory called my_MMP_db_smarts
Generate a db with SMARTS searching capability and where only changes up to (and including) 10 heavy atoms are stored:
python create_mmp_db.py -m 10 -s <data/sample_fragmented.txt
Searching the db
----------------
To search the MMP db use the following command:
python search_mmp_db.py [options] <INPUT_FILE
This program has several options (see help from program below):
Options:
-h, --help show this help message and exit
-t TYPE, --type=TYPE Type of search required. Options are: mmp, subs,
trans, subs_smarts, trans_smarts
-m MAXSIZE, --maxsize=MAXSIZE
Maximum size of change (in heavy atoms) allowed in
matched molecular pairs identified. DEFAULT=10.
Note: This option overrides the ratio option if both
are specified.
-r RATIO, --ratio=RATIO
Only applicable with the mmp search type. Maximum
ratio of change allowed in matched molecular pairs
identified. The ratio is: size of change /
size of cmpd (in terms of heavy atoms) for the QUERY
MOLECULE. DEFAULT=0.3. Note: If this option is used
with the maxsize option, the maxsize option will be
used.
-p PREFIX, --prefix=PREFIX
Prefix for the db file. DEFAULT=mmp
A description of the different search options are shown below:
a) mmp: Find all MMPs of a input/query compound to the compounds in the db
b) subs: Find all MMPs in the db where the LHS of the transform matches an input
substructure. Make sure the attached points are donated by an asterisk and the
input substructure has been canonicalised (eg. [*]c1ccccc1). Note: Up to 3 attachement
points are allowed.
c) trans: Find all MMPs that match the input transform/SMIRKS. Make sure the input
SMIRKS has been canonicalised using the cansmirk.py program.
d) subs_smarts: Find all MMPs in the db where the LHS of the transform matches an
input SMARTS. The attachment points in the SMARTS can be donated by [#0] (eg.
[#0]c1ccccc1).
e) trans_smarts: Find all MMPs that match the LHS and RHS SMARTS of the input transform.
The transform SMARTS are input as LHS_SMARTS>>RHS_SMARTS (eg.
[#0]c1ccccc1>>[#0]c1ccncc1). Note: This search can take a long time to run if a
very general SMARTS expression is used.
Example commands to search a db:
The db was created using command: python create_mmp_db.py -m 10 -s <data/sample_fragmented.txt
a) To carry out a mmp search:
python search_mmp_db.py -t mmp <data/sample_db_input_smi.txt >data/sample_db_search_smi_output.txt
Format of input file: SMILES ID <space or comma separated. The ID field is optional>
See data/sample_db_input_smi.txt for an example input file
Format of output: SMILES_QUERY,SMILES_OF_MMP,QUERY_ID,RETRIEVED_ID,CHANGED_SMILES,CONTEXT_SMILES
See data/sample_db_search_smi_output.txt for an example output file
b) To carry out a LHS transform substructure search:
python search_mmp_db.py -t subs <data/sample_db_input_subs.txt >data/sample_db_search_subs_output.txt
Format of input file: Substructure_SMILES ID <space or comma separated. The ID field is optional>
See data/sample_db_input_subs.txt for an example input file
Format of output: Input_substructure[,input_id],SMILES_MMP1,SMILES_MMP2,MMP1_ID,MMP2_ID,Transform,Context
See data/sample_db_search_subs_output.txt for an example output file
c) To carry out a transform search:
python search_mmp_db.py -t trans <data/sample_db_input_trans.txt >data/sample_db_search_trans_output.txt
Format of input file: SMIRKS ID <space or comma separated. The ID field is optional>
See data/sample_db_input_trans.txt for an example input file
Format of output: [input_id,]SMILES_MMP1,SMILES_MMP2,MMP1_ID,MMP2_ID,Transform,Context
See data/sample_db_search_trans_output.txt for an example output file
d) To carry out a LHS transform substructure SMARTS search:
search_mmp_db.py -t subs_smarts <data/sample_db_input_subs_smarts.txt >data/sample_db_search_subs_smart_output.txt
Format of input file: SMARTS ID <space or comma separated. The ID field is optional>
See data/sample_db_input_subs_smarts.txt for an example input file
Format of output: [input_id,]SMILES_MMP1,SMILES_MMP2,MMP1_ID,MMP2_ID,Transform,Context
See data/sample_db_search_subs_smart_output.txt for an example output file
e) To carry out a transform SMARTS search (with max size change of 6 heavy atoms):
search_mmp_db.py -t trans_smarts -m 6 <data/sample_db_input_trans_smarts.txt >data/sample_db_search_trans_smarts_output.txt
Format of input file: SMARTS ID <space or comma separated. The ID field is optional>
See data/sample_db_input_trans_smarts.txt for an example input file
Format of output: input_transform_SMARTS,[input_id,]SMILES_MMP1,SMILES_MMP2,MMP1_ID,MMP2_ID,Transform,Context
See data/sample_db_search_trans_smarts_output.txt for an example output file
In the event you use the scripts for publication please reference the original publication:
Hussain, J., & Rea, C. (2010). "Computationally efficient algorithm to identify matched molecular pairs (MMPs)
in large data sets." Journal of chemical information and modeling, 50(3), 339-348.
http://dx.doi.org/10.1021/ci900450m
|