Document to Structure Migration Guide

21.4: DocumentExtractor has been removed

In version 21.4, the chemaxon.naming.DocumentExtractor class has been removed. The following guide helps in migrating to its alternatives, MolImporter and DocumentToStructure.

Creating an instance

Instead of calling the constructors of DocumentExtractor or the readPDF method:

If the input is stored in a String, call the process method of DocumentToStructure to receive a MolImporter instance.

String params = ...; // D2S options. Optional parameter.
try (MolImporter importer = DocumentToStructure.process(text, params)) {
   // ...
}

If the input is an external file, pass a String (file name), File or InputStream object to the constructor of MolImporter.

File file = ...;
String format = ...; // D2S format and options. Optional parameter.
String encoding = ...; // Character encoding. Optional parameter.
try (MolImporter importer = new MolImporter(file, format, encoding)) {
   // ...
}

When using the constructor of MolImporter, the format must be specified as d2s, or d2s:, followed by the required format options. If the format is omitted entirely, it is automatically detected based on the type of the file.

The constructors of DocumentExtractor which received an URL or URLConnection parameter have no counterpart on MolImporter or DocumentToStructure. In these cases, the input must be converted to one of the applicable input types.

Options

Instead of the configuration methods of DocumentExtractor, MolImporter has format options that can be passed at creation time, separated by commas.

setCasNumberLookup(boolean value) → +cas or -cas
acceptElements(boolean on) → +elements or -elements
acceptIons(boolean on) → +ions or -ions
acceptGroups(boolean on) → +groups or -groups
acceptGenericNames(boolean on) → +vernacular or -vernacular

Processing

The processPlainText() and processHTML() methods of DocumentExtractor have no direct counterpart on MolImporter, as the results of MolImporter can be read immediately, and the content type is automatically detected.

The ProgressListener support of DocumentExtractor is a removed feature, it has no alternative in case of MolImporter.

Reading results

To collect the results in a list, similarly to getHits():

try (MolImporter importer = new MolImporter(file)) {
    List<Molecule> molecules = importer.getMolStream()
            .collect(Collectors.toList());

    // ...
}

The returned Molecules are the same objects that were previously stored in the structure field of the returned Hits. The information stored in the other fields of Hits are stored as properties in the Molecules:

hit.text → (String) mol.getPropertyObject(DocumentToStructure.SOURCE_TEXT)
hit.position → (Integer) mol.getPropertyObject(DocumentToStructure.CHARACTER)
hit.getPageNumber() → (Integer) mol.getPropertyObject(DocumentToStructure.PAGE)
hit.getAllPositions() → no alternative
hit.getPositionsString() → no alternative

Note that all properties can be null if the information is not provided for the current input type.

Main method

The main method of DocumentExtractor has no direct alternative but its results can be reproduced with MolImporter and DocumentToStructure.