A typical system will consist of three server computers:
a document repository (Documentum)
a crawler (Document to Database)
a database server (for instance Oracle with the JChem cartridge)
The document repository is supposed to be existing and working. It is not needed when indexing documents from a filesystem.
The database server should also be already installed and running.
This documentation concerns the installation and setup of the crawler server. It is recommended to dedicate a Linux machine for this task.
Download d2db.zip
Log onto the crawler machine as the desired user.
unzip d2db.zip
cd d2db
cp -a conf.sample conf
You are now ready to start the configuration.
The conf
directory contains all configuration. You need to edit at least the d2db.conf
file, which contains an example configuration and comments for each options.
If you are using Documentum as a document repository, you also need to edit documentum/dfc.properties
to configure access to the Documentum server (host, username and password).
Document to Database command-line actions all have the following form:
./d2db <command> <parameters...>
At this point you should be ready to run the first d2db command to initialize the database. This will create the necessary tables. Note that if you created the database schema yourself and only want d2db to populate it, you should have used the d2db.fixedSchema = true
option (in configuration file schema.conf
) and can skip this section.
To create the database tables, run this command once:
Anytime after using d2db create
, you can use the stats
command to query some basic statics about the number of documents, chemical structures and hits in the d2db database. This is also a good way to check that the database is properly created and accessible.
For instance, running it just after create
should give this output:
$ ./d2db stats
[logging information]
Documents : 0
Unique structures : 0
The index
command should be used to tell d2db which documents to index. For indexing a document folder, use:
./d2db index documentum:<folder>
For indexing a directory on a local or shared filesystem, use:
./d2db index <folder>
Note that d2db will automatically detect documents that have already been indexed in a previous run and have not been modified, in which case it will skip over them quickly. This means that the index command can be used both once for an initial indexing of a set of documents, and also later to update the index (add new documents, remove deleted documents, refresh modified documents). You can use the reindex
command to force reindexing all documents even when they have not changed.
Once indexing has been done successfully, you might want to set up a cron job to run the index command regularly.