Overview

The new API performs extraction on the input document PDF or a ZIP file (enclosed single page or multipage tiff/tif or pdf). Extraction plugins are fetched from the batch class corresponding to the input batch class identifier. The extraction will be performed based on the extraction plugins configurations and rules configured for the particular batch class.

If the document type is given as an input parameter then document classification is not performed and extraction is performed as per specified document type, otherwise classification and extraction is performed on the input to generate the results.

Output of this web service api will be a zip file which will be having a batch.xml file and searchable pdf of all documents classified by web service execution.

Classification Type’s Supported by API

  1. SearchClassification
  2. MultidimensionClassification
  3. ImageClassification
  4. KeywordClassification
  5. AutomaticClassification
  6. BarcodeClassification

Input Parameters

Input parameters to the Web Service API would be

INPUT PARAMETERS

  1. PDF and Tif File(single or multipage)/ ZIP File (zip file may contain single page or multipage tif/tiff or pdf)
  2. batchClassIdentifier: String parameter for batch class identifier
  3. docType(optional parameter) if user enters a docType then no document classification is performed otherwise classification of the document will be performed.
  4. downloadHocr: if set to true, API pulls Batch.xml, Searchable Pdf files and HOCR file in a Zip file in web-service response.

Output Parameters

A zip file consisting output batch.xml and searchable pdf files generated for classified documents. If downloadHocr switch is sent as true in input parameters, output zip will also have hocr xml files generated corresponding to each input file.

Web Service URL

http://<HOSTNAME>:8080/dcma/rest/OcrClassifyExtractSearchablePDF

Example-

localhost:8080/dcma/rest/OcrClassifyExtractSearchablePDF

Checklist:

  1. Extraction would be done only if Extraction module is configured for the particular batch class
  2. Extraction would be performed only for the plugins which have extraction switch ON in batch class configuration.

Sample client code using apache commons http client:-

private static void OcrClassifyExtractSearchablePDF() {

HttpClient client = new HttpClient();

Credentials defaultcreds = new UsernamePasswordCredentials(“ephesoft”, “demo”);

client.getState().setCredentials(new AuthScope(“localhost”, 8080), defaultcreds);

client.getParams().setAuthenticationPreemptive(true);

String url = “http://localhost:8080/dcma/rest/OcrClassifyExtractSearchablePDF“;

PostMethod mPost = new PostMethod(url);

// Adding HTML file for processing

File file1 = new File(“C:\\sample\\US-Invoice.tif”);

Part[] parts = new Part[2];

try {

parts[0] = new FilePart(file1.getName(), file1);

// Adding parameter for batchClassIdentifier

parts[1] = new StringPart(“batchClassIdentifier”, “BC5”);

MultipartRequestEntity entity = new MultipartRequestEntity(parts, mPost.getParams());

mPost.setRequestEntity(entity);

int statusCode = client.executeMethod(mPost);

if (statusCode == 200) {

System.out.println(“Batch class exported successfully”);

InputStream in = mPost.getResponseBodyAsStream();

File f = new File(“C:\\sample\\Output.zip”);

FileOutputStream fos = new FileOutputStream(f);

try {

byte[] buf = new byte[1024];

int len = in.read(buf);

while (len > 0) {

fos.write(buf, 0, len);

len = in.read(buf);

}

} finally {

if (fos != null) {

fos.close();

}

}

} else if (statusCode == 403) {

System.out.println(“Invalid username/password.”);

} else {

System.out.println(mPost.getResponseBodyAsString());

}

} catch (FileNotFoundException e) {

System.out.println(“File not found for processing..”);

} catch (HttpException e) {

e.printStackTrace();

} catch (IOException e) {

e.printStackTrace();

} finally {

if (mPost != null) {

mPost.releaseConnection();

}

}

}

Was this article helpful to you?

Engineering

Comments are closed.