/usr/share/doc/python-apsw/html/_sources/couchdb.txt is in python-apsw-doc 3.7.6.3-r1-1build1.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 | .. _couchdb:
CouchDB
*******
.. currentmodule:: apsw
`CouchDB <http://couchdb.apache.org>`__ is an increasingly popular
document oriented database, also known as a schema-less database. It
is also web based using `HTTP
<http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol>`__ for
access and `JSON <http://json.org/>`__ for data representation.
The APSW source distribution now includes a :ref:`virtual table
implementation <virtualtables>` that lets you access CouchDB databases
from SQLite, including both read and write access.
Some suggested uses:
* Loading data from SQLite into CouchDB and vice versa. For example
you can import CSV data into SQLite and then upload it to CouchDB.
* Being able to use SQLite based tools for working on CouchDB data.
* Being able to do `joins
<http://en.wikipedia.org/wiki/Join_%28SQL%29>`__ between data in
SQLite and CouchDB.
* Using the SQLite :ref:`FTS full text search extension <ext-fts3>`
for data in CouchDB.
* Using the SQLite :ref:`RTree extension <ext-rtree>` to do spatial
queries.
Getting it
==========
You can find the code in the :ref:`source distribution
<source_and_binaries>` named :file:`apswcouchdb.py`, or you can get a
copy directly from `source control
<http://code.google.com/p/apsw/source/browse/tools/apswcouchdb.py>`__
(choose "View Raw File").
You will need to have the `Python couchdb module
<http://code.google.com/p/couchdb-python/>`__ installed as well as its
prerequisites.
Usage
=====
To use with your code, place the :file:`apswcouchdb.py` file anywhere
you can import it. To use it with the APSW :ref:`shell` use the the
``.read`` command specifying the filename. The virtual table will be
automatically installed via the :attr:`connection_hooks` mechanism and
registering with the shell if appropriate.
A virtual table maps to a CouchDB database. The database needs to
exist already. Use the following SQL to create the virtual table::
create virtual table mytable using couchdb('http://localhost:5984', 'dbname', col1, col2, col3);
From that point on you can do regular SQL operations against *mytable*
where each row will correspond to a document from *dbname*. When you
drop the table it does not delete the CouchDB database. If you need
usernames and passwords then specify them as part of the url::
http://username:password@localhost:5984
Examples
========
Importing CSV data
------------------
We want to import some CSV data to a CouchDB database. The first step
is to import it into a temporary SQLite table. It could be imported
directly to CouchDB but then we wouldn't have the ability to specify
`column affinity <http://www.sqlite.org/datatype3.html>`__ and all
data would end up as strings in CouchDB. I am using a real estate CSV
file in this example against the :ref:`SQLite shell <shell>`::
-- Specify affinities in temporary table so numbers from
-- the CSV end up as numbers and not strings
create temporary table refixup(street char, city char,
zip char, state char, beds int, baths int, sqft real,
type char, sale_date char, price int, latitude real,
longitude real);
-- Do the actual import
.mode csv
.import realestatetransactions.csv refixup
-- Create the CouchDB virtual table with the same column names
create virtual table realestate using couchdb('http://localhost:5984',
realestate, street, city, zip, state, beds, baths,
sqft, type, sale_date, price, latitude, longitude);
-- Copy the data from the temporary table to CouchDB
insert into realestate select * from refixup;
-- No longer need the temporary table
drop table refixup;
Use ``.help import`` for more hints on importing data.
Using FTS
---------
In this example we have CouchDB documents that are recipes. The
CouchDB _id field is the recipe name::
-- Virtual table
create virtual table recipes using couchdb('http://localhost:5984',
_id, ingredients);
-- FTS table
create virtual table recipesearch using fts3(name, ingredients);
-- Copy the data from CouchDB to FTS3
insert into recipesearch(name, ingredients)
select _id,ingredients from recipes;
-- Which ones have these ingredients
select name as _id from recipesearch where ingredients
MATCH 'onions cheese';
Implementation Notes
====================
Automatic column names
If you don't want to manually specify all columns then you can use
'+' as a column and the first 1,000 documents in the database will
be examined to see what keys they have. Any that you have not
already specified as a column will be added to the column list in
alphabetical order. Only keys without a leading underscore are
considered::
create virtual table mytable using couchdb('http://localhost:5984', 'dbname', '+');
Document Ids and Rowids
CouchDB uses the document id (key _id) as the unique identifier.
SQLite uses a 64 bit integer rowid. In order to map the two a
`temporary table <http://www.sqlite.org/lang_createtable.html>`__ is
used behind the scenes. The revision (_rev) is also stored. SQLite
stores temporary tables separately and discards them when the
corresponding database is closed.
By default the temporary tables are stored in a file. You can use a
`pragma <http://www.sqlite.org/pragma.html#pragma_temp_store>`__ to
change that. For example ``pragma temp_store=memory`` will use
memory instead. My 200,000 document test database resulted in a
temporary mapping table of 50MB if I accessed all rows/documents.
Scalability
It is intended that you can use the virtual table with large
databases. For example development and profiling were done with a
200,000 document database using over 2GB of storage.
This means that behind the scenes the CouchDB `bulk API
<http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API>`__ is used.
(Doing an HTTP request per row/document read or written would be far
too slow.) Pending updates (adds, changes, deletions) are batched
up and sent as a group which means there is a delay between when
your SQL executes and the CouchDB server taking the appropriate
action. This can also lead to a delay in error reporting.
By default documents are read and written in batches of 5,000. The
larger the number the fewer HTTP requests are made but more memory
is consumed. You can change this number like this::
select couchdb_config('read-batch', 2000);
select couchdb_config('write-batch', 10000);
If you are debugging code then setting them to 1 will cause an
immediate HTTP request per row/document read or written rather than
waiting till the batch is full or for a transaction boundary.
Updates
SQLite (and SQL) define a fixed list of columns for each row while
CouchDB can have zero or more keys per document. In normal use of
this module you would have listed a subset of the possible keys as
columns for SQLite. CouchDB does an update by supplying a complete
new document. If an update specified only the keys/columns declared
at the SQLite level then other keys would be lost. Consequently on
each update this module has to obtain the document and update the
fields specified via SQL. Obtaining these documents as needed one
at a time slows down updates, but does prevent them from losing any
fields not known to SQL. If you have specified all the fields in
SQL then you can off this behaviour saving one HTTP request per
document updated::
select couchdb_config('deep-update', 0);
None/null/undefined
In SQL null means a value is not present. No two nulls are equal to
each other plus `other quirks <http://www.sqlite.org/nulls.html>`__.
In Python None is a value more like a non-type specific zero
although it is a singleton. Javascript has both undefined with a
SQL null like meaning and null with Python None like meaning. JSON
can represent null but not undefined.
Whenever a key is not present in a document but a value is required
by SQLite then null is returned. When creating or updating
documents, a value of null from SQLite is treated as meaning you do
not want the key in the document. (Reading it back will still get
you a null in SQLite even though technically the document value is
undefined.) This means that you cannot use this module to set a
value in a CouchDB document to null - the key will not be present.
Transactions
SQLite has the same transaction boundaries as SQL and supports
transactions. CouchDB is atomic but does not support transactions.
Any outstanding batched updates are also flushed on SQLite
transaction boundaries as well as other points such as when a cursor
starts from the beginning again. This means that a rollback can only
discard unflushed pending updates but not undo earlier updates
within the SQL transaction.
Types
Communication between this module and CouchDB uses JSON over HTTP.
All the JSON types map to Python fundamental types, but only to a
subset of SQLite types.
+------------+----------------+--------------+
| JSON | Python | SQLite |
+============+================+==============+
| Float | double | Real |
+------------+----------------+--------------+
| Integer | int (unlimited)| int (max 64 |
| (Actually | | bit) |
| stored as | | |
| a float) | | |
+------------+----------------+--------------+
| String | unicode | char |
+------------+----------------+--------------+
| null | None | null |
+------------+----------------+--------------+
| No | buffer / bytes | blob |
| equivalent | | |
+------------+----------------+--------------+
| List | list | No |
| | | equivalent |
+------------+----------------+--------------+
| Object | dict | No |
| | | equivalent |
+------------+----------------+--------------+
Values in JSON that have no SQLite equivalent such as a list are
`pickled <http://docs.python.org/library/pickle>`__ and supplied to
SQLite as a blob. Similarly any blob supplied to SQLite intended
for CouchDB must be pickled Python data. Use ``-1`` as the pickle
version (selects binary encoding).
Note that this module has no affinity rules. Whatever type is
supplied at the SQL level is then sent as the same type to CouchDB.
You cannot specify types when creating the virtual table. If you
need affinity then use an intermediary temporary table as the
example in the next section shows.
.. note::
Although JSON has separate Integer and Float types, Javascript
itself does not and stores everything as a floating point number
which has about 15 digits of precision. Python, Erlang and CouchDB
support arbitrary large numbers but data that passes through a
Javascript view server will lose precision. For example
9223372036854775807 will come back as 9223372036854776000
(different last 4 digits).
For Python 2 note that Pickle encodes strings and unicode strings
differently even when they have same ASCII contents. If you are
trying to do an equality check then ensure all strings including
dictionary keys are unicode before pickling.
Expressions
SQL is accelerated by using indices. These are
precomputed/presorted views of the data and used when evaluating
queries like ``select * from items where price > 74.99 and
quantity<=10 and customer='Acme Widgets'`` in order to avoid
visiting every row in the table.
If you have constraints like the above then this module uses a
CouchDB `temporary view
<http://wiki.apache.org/couchdb/HTTP_view_API>`__ so that CouchDB
chooses an appropriate subset of documents rather than having to
transfer all of them and have SQLite do the filtering.
With a large number of documents the view can take quite a while for
CouchDB to calculate but is still quicker than sucking down all the
data for SQLite to calculate in most cases. Because the views are
temporary they will eventually be discarded by CouchDB.
This also means that Javascript's rules are used for evaluation in
circumstances such as comparing strings to integers. However SQL
rules are used for nulls - they are not equal to each other or any
other value.
SQLite only supports one index per query. If all the constraints
are joined by **and** then this module can tell SQLite it has one
index covering all of them. If you use **or** then the index/view
can only be used for one side of the **or** expression with SQLite
having to evaluate the other.
You can get SQLite to do the row/document filtering (which means
retreiving all documents) like this::
select couchdb_config('server-eval', 0);
Configuration summary
You can change behaviour of the module by using the
``couchdb_config`` SQL function. If you call with one argument then
it returns the current value and with two sets the value. The
various options are described in more detail in the relevant
sections above.
+-----------------+---------+-------------------------------------------+
| Option | Default | Description |
+=================+=========+===========================================+
| read-batch | 5,000 | How many documents are retrieved at a time|
| | | from the server |
+-----------------+---------+-------------------------------------------+
| write-batch | 5,000 | How many documents being created/changed |
| | | are saved up before sending in one request|
| | | to the bulk api on the server |
+-----------------+---------+-------------------------------------------+
| deep-update | 1 (True)| If a change is made to a row/document then|
| | | the original is retreived from the server |
| | | (one at a time) so that the keys not |
| | | specified as SQL level columns aren't lost|
+-----------------+---------+-------------------------------------------+
| server-eval | 1 (True)| SQL column expressions are evaluated in |
| | | the server (by claiming there is an index)|
| | | rather than downloading all documents |
| | | and having SQLite do the evaluation. |
+-----------------+---------+-------------------------------------------+
|