Content Conversion: Building the Structure for Digital Transformation

Marianne Calilhanna

Mar 4, 20227 min read

Reprinted with permission from "Content Transformation: Breathing New Life into Legacy Content" by Val Swisher and Regina Lynn Preciado from Content Rules

“Content conversion” is a loaded term. It means different things depending on your technical acumen and where you are in the content supply chain. Content conversion is transforming content from one format to another. By this definition, saving a Word file as a PDF file can be considered content conversion. For the purpose of this article, “content conversion” means transforming content from a print or digital format into XML.

Simply having information in a digital format, such as a Microsoft Word file or a PDF, is often not enough for today’s consumers and systems. For content to be discovered easily and used to deliver reflowable mobile experiences, it must be converted into a multidimensional XML format from which systems (and people) can extract pertinent information.

A Brief History Lesson in OCR

Before we jump into converting content into XML, let’s look at a little bit of history as it relates to optical character recognition.

Optical character recognition (OCR) is the process of converting images of typed, handwritten, or printed text into machine encoded text. The genesis of OCR is older than you might think! The concept traces back to the late 1920s when the Austrian

engineer Gustav Tauschek obtained the first patent for his “Reading Machine.”

When it originated, OCR was a mechanical process. Tauschek’s Reading Machine was engineered with gears, mechanisms, and photodetectors that used the input of printed text and generated the output of that text printed on paper.

Today, OCR is taken for granted. OCR services are now commoditized and software such as Adobe Acrobat can automatically apply OCR to a document. However, early OCR systems needed to be trained using images of each character and worked on one font at a time. For this reason, we like to say that at its core, OCR was a precursor to machine learning.

Not all OCR is Equal

A surge of digitization and scanning took place in the mid-1990s. In the early 2000s, great strides were taken to achieve the dream of the “paperless office.” Preserving analog documents by transforming them into digital replicas allowed users to share documents online, archive them electronically, and get rid of some extra filing cabinets. These scans were often rendered as image-based PDFs.

Image-based PDF

An image-based PDF can be thought of as a photocopy. Just as a copy is a facsimile of an original document, so too is an image-based PDF. What appears as text to humans is just a series of pixels to computers and mobile phones. In an image-based PDF, text is no different from images or graphical elements that may also be in the document. You cannot search and retrieval keywords and concepts within the content. It isn’t possible to reflow text for different size displays. There is no capacity for assistive screen readers to read content aloud for users who may be visually impaired or have learning disabilities. The image-based PDF is an image. Nothing more.

Searchable PDF

A searchable PDF is the result of a PDF that has undergone an OCR process. OCR creates a text layer on top of the image, enabling machines to ‘see’ words and sentences like we do. This type of PDF is technically searchable, but searches rely on exact word matches. This means that hyphens, line breaks, special characters, and inconsistent spacing hinder search performance. Furthermore, while a machine may recognize the words Camus, Albert, and The Stranger in a searchable PDF, it won’t recognize that Camus is the author and “The Stranger” is a book title.

The Searchable PDF is still a replica of a printed page. Despite having a text layer, this text does not reflow, which creates a suboptimal experience when read on mobile phones and e-Reading devices. Furthermore, in an effort to navigate content in sidebars, multi-column text, and footnotes , assistive technologies often read content to users in the wrong order.

True PDF

A true PDF (also called a “digitally created PDF” or “PDF Normal”) is typically created as an export from another desktop publishing format, such as saving a Microsoft Office file to PDF or saving another Adobe format as a PDF. True PDFs natively have electronic character designation for both the text and the corresponding metadata., This means that true PDFs provide a more robust search functionality. However, complex content elements, special characters, math, chemical formulae, and so on, are still often “digitized” as images. Therefore, you cannot use complex content elements for filtered search and data analysis. Furthermore, like all PDFs, the final output is not reflowable. Mobile users are left to pinch, zoom, and swipe to read the content on smaller screens. Assistive screen readers still struggle with reading order unless special care is taken by the author to avoid these issues.

Content Structure: The Essential Foundation

Oftentimes, organizations have digitized information, but that information is locked in these “flat” PDF formats rather than being structured for filtered search, responsive design, component reuse, accessibility, and personalization. While PDFs often have document level

metadata, unlocking the multi-dimensional capabilities of the content requires identifying and tagging individual elements within the document at far more granular levels. Properly structuring data and content means applying layers of semantic intelligence to benefit the downstream delivery and consumption of that information. This means moving from PDF formats to XML formats.

Most readers don’t “see” the layers of semantic intelligence embedded in XML formats. However, they easily recognize the differences in discoverability, in reading text on different size screens, and in the personalized content that comes their way.

In addition to the advantages to content consumers, XML supports many mission-critical business drivers that are not available with flat formats like PDF:

Enabling the dream of multichannel publishing, making it possible and affordable to create content once and use it many times in print and multiple digital formats
Ensuring editorial quality and consistency
Harnessing content in new ways to hook it into digital products
Harvesting data to extract portions of content from across large collections to create new, targeted, and meaningful products
Breaking long-form information into nimble, re-usable chunks for dynamic, on-demand, personalized publishing
Fostering accessible and inclusive digital content experiences
Future proofing content against changes in technologies and reading devices

XML provides an anchor for coping with a diverse and expanding number of content distribution channels and products. Organizations that embed XML into the content production process also discover that issues with editorial inconsistencies and version control become a thing of the past.

Conversion in the Twenty-First Century

Many organizations have extensive and valuable data and content buried in paper, PDF files, Word files, and other flat formats. In most cases, the complexity of the content and layouts, inconsistencies in content creation practices, and the sheer volume of documents makes conversion extremely difficult. In the past,unraveling this mess, especially if content includes complex tables, charts, figures, foreign characters, chemical formulae, etc., has

been considered impractical. However, advances in technology and in artificial intelligence (AI) have made intelligent data and automated content structure possible where it simply wasn’t feasible before.

For example, new levels of conversion accuracy are now possible with computer vision. Tracing its origins back to the 1960s, the field of computer vision is a precursor to AI image recognition. While there were developments in AI throughout the 20th century, and even earlier, computer vision has truly come of age just in the past 15 to 20 years due to the confluence of Big Data and a tremendous increase in processing power.

AI, and in particular machine learning (ML), requires a large amount of training data, and fast computers to process it all. Today we have both of these things.

Working with a Content Conversion Service Provider

When you speak with a conversion service provider about your conversion project, it’s important to go beyond the hype of the latest tech terms to ensure your conversion partner

is implementing the right technology and truly providing a return on your investment. A good conversion service provider approaches a complex conversion project as a well-planned technology project and not a cheap transactional service. With the right tools and expertise, a quality content conversion service provider can unlock the value of your content and support new ways to use it. Implementing automated processes and AIrelated

technology during a conversion project improves the consistency of the resulting XML as well as the speed at which large-scale conversion projects can be completed.

Conversion: Not All XML is Created Equal

Many tools and software provide conversion to XML. In 2003, Microsoft introduced the .docx XML format, which was a simple, XML-based format called WordProcessingML or WordML. Office Open XML was introduced in Microsoft Office 2007. However, “valid” XML is not always the same as “useful” XML. Today, specific industries and communities have developed a variety of XML models to further enhance the power of the format and facilitate its use in standard ways. The following list comprises just some of the common XML standards used across industries.

XML Standard	Industry	About
DITA v1.3	Manufacturing, Educational Publishing,Trade Publishing, Medical Devices, and more	The Darwin Information Typing Architecture (DITA) specification defines a set of document types for authoring and organizing topic-oriented information. It is often used to create technical documentation and training materials where hierarchy and content re-use are prevalent. https://www.oasis-open.org/standard/ditav1-3/
DocBook v5.0	Technical Documentation	DocBook is an alternative semantic markup language for technical documentation. It was originally intended for writing technical documents related to computer hardware and software, but it can be used for any other type of documentation. https://docbook.org
JATS v1.3	Scholarly Journal Publishing and STEM Publishing	The Journal Article Tag Suite (JATS) is an XML format used to describe scientific literature published online. It is a technical standard developed by the National Information Standards Organization and approved by the American National Standards Institute with the code Z39.96-2012. https://www.niso.org/publications/z3996-2021-jats
S1000D	Aviation, Aerospace, & Defense	An international specification most often used to produce technical publications related to defense systems and civil aviation products. https://s1000d.org/
SPL	Pharmaceutical Labeling	Structured Product Labeling is a document markup standard approved by Health Level Seven (HL7) and adopted by the FDA in the United States as a mechanism for exchanging drug product information. https://www.fda.gov/industry/fda-resources-data-standards/ structured-product-labeling-resources
TEI	Museums, Libraries, Humanities, and Social Science archives	The Text Encoding Initiative (TEI) provides guidelines to specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences, and linguistics to present historical texts for online research, teaching, and preservation. https://tei-c.org/

Valid XML vs Well-Formed Valid XML

Any content conversion service provider can provide valid XML, and automated XML is a standard output from many systems that manage content. However, when converting content there is an important distinction to understand between a valid XML file and

a well-formed valid XML file. You can have a file that parses and doesn’t technically have an XML error, but that also doesn’t achieve the goal of what the content is supposed to do. What happens too often is that budget demands or conversion teams choose to do

whatever is easiest (i.e., cheapest) instead of doing the right thing not only to convert content to the proper version of XML, but also to structure it to the most useful level of granularity.

The Evolution of Content Conversion

Standards, technology, and systems continuously evolve. The goals of content interchange, intelligent search and discovery, and content reuse improve with each iteration of an industry standard. The systems that work with structured content continue to evolve.

The tools that are available both off the shelf and as proprietary tools and technology that service providers create continue to improve and work smarter.

Getting your conversion done correctly is one of the primary determining factors of the success of any large content transformation project. For this reason, it is critical to engage with consultants and trusted content conversion service providers at the beginning to understand what makes the most sense for your organization and content goals. Working with experts to plan your conversion helps inform your content strategy, reuse, technology

choices, and delivery models. It also saves you from the massive content clean-up efforts that can often derail content transformation projects. Getting the conversion piece is critical, so be sure to identify the right partners in the very early stages.