top of page

DCL Reformer

Automated content structure and data reassembly

Bring New Life to Content in an Automated Way

Many organizations have extensive content buried in image-based PDFs (and even paper!) that cannot be digitized using standard OCR tools due to complex tables, charts, figures, foreign characters, chemical formulae, etc. DCL Reformer is an automated solution that transforms static content into structured formats, improving the content's utility for downstream systems.

 

DCL uses computer vision techniques to detect and remove poor OCR-quality content, retaining text for high-accuracy OCR processing and conversion to unstructured text. Complex algorithms, NLP engines, and other techniques are then applied to analyze the unstructured text from documents with wide variations in format and quality, and accurately structure the data. 

The importance of having a plan and process for content QA

Assuring content quality in today's business environment is vitally important. Most content is an accumulation from various sources that builds up over time. Periodic review and analysis focuses efforts on identifying and improving ongoing content quality, consistency and accuracy.

DCL_QA_analysis.png

The first step in DCL Markup Check is content analysis, which quickly identifies areas for investigation. Organizations recognize immediate benefits including

Conversion alone does not improve content functionality

DCL Reformer amends and improves the structure, and hence, the quality of your content. Reformer is a fully-automated workflow system that receives and classifies documents, OCRs tiff images, extracts free-form text from textual and form-based documents, and generates target XML schema with image attachments.

Artifacts are Auto-Identified & Removed Prior to OCR

Artifact Removal

Computer vision techniques detect and remove poor OCR quality content, retaining text for high-accuracy OCR processing and conversion to unstructured text.

Post Extraction

DCL Reformer extracts free-form text or image-like content from true textual content. Removed artifacts are transformed into image files and the remaining content is ready for OCR.

Removed Artifacts Transformed to Image Files
Artifacts Referenced as Image Files (SVG) in XML

Target XML

Complex algorithms, NLP engines, and other techniques are applied to analyze the unstructured text from documents. The automated system references the extracted artifacts as images in the resulting XML. 

Expertise Across all Formats

  • DITA

  • XML

  • HTML, HTML5

  • PubMed JATS

  • MathML

  • NLM XML

  • NISO STS

  • Bookshelf

  • EPUB/MOBI

  • S1000D

  • SGML

  • MS Word

  • and more

reform verb (1)

1a: to put or change into an improved form or condition

  b: to amend or improve by change of form or removal of faults or abuses

intransitive verb

: to become changed for the better

Markets Served

DCL Reformer is a useful solution for any organization with a high volume of incoming content or legacy content that is complex with multiple variants.

Non-Profit Icon
Defense Market Icon
Financial Institutions Icon
Local Goverment Icon
Pharma Icon
Legal System Icon
Museums Icon
Manufacturers Icon
Book Publishers Icon

RELATED WHITE PAPER

This paper describes the implementation of DCL Reformer at the United States Patent and Trademark Office (USPTO). The system is processing millions of pages each month with turnaround measured in minutes.

Cityscape
USPTO White Paper
bottom of page