Case Studies
Delivering customer success since 1981
Technologies Used
-
DOM
-
XPath
-
Regular Expressions
-
Custom Algorithms
-
Exif tools
-
PDF tools
-
Ghostscript
-
XML
Project highlights
-
Analysis of content structure across entire corpus
-
Smooth migration to Silverchair platform
"Thank you for your flexibility, forbearance, and tenacity, especially as we worked through the dual feed and content synchronization challenges! Your efforts are greatly appreciated.
The system launch was completed yesterday around 11:00 AM. Things are looking very good and there’s a lot of excitement.”
Mark Jacobson
Director of Digital Publishing
AACR
American Association of Cancer Research
Content Clarity Analysis of Journal Backfile
Keywords: content structure analysis, platform migration, Silverchair, XML
Background
The American Association for Cancer Research (AACR) is the first and largest cancer research organization dedicated to accelerating the conquest of cancer. AACR publishes 10 peer-reviewed journals that cover the full spectrum of cancer science and medicine.
AACR transitioned all journal content to the Silverchair platform in February 2022 to achieve greater flexibility in the way that AACR journals present content to the cancer research community. Prior to that migration, AACR needed to ensure its entire corpus of content was up to date with the latest XML constructs that exist in NISO JATS 1.3. AACR recognized that analyzing its entire collection prior to platform migration would contribute to a smoother migration process as well as identify content structure issues that could impact interoperability and discoverability.
Solution
DCL’s service Content Clarity was employed to analyze, report, and update content structure across AACR’s journal collection. AACR gathered all content files (XML and PDF) and delivered to DCL. Content Clarity analyzes an entire corpus of a publisher’s XML files to reveal issues in the JATS XML that do not necessarily invalidate the files but do contribute to interoperability and other issues. Processing also inventories digital assets (e.g., .jpg, .gif, .XLSX, .zip, .tif, .mov, etc.).
The next phase validated XML files and health checked the corresponding digital assets (e.g., Content Clarity checks that for every image there is at least one callout in the XML and for every callout there is an image). The third phase performs a variety of semantic checks on the individual XML files, at the article level, at the journal title level, and across the entire corpus.
Findings from the analysis are grouped into two categories—Summary Analytics and Errors and Warnings. Results were presented in multiple spreadsheet reports.
Result
AACR now provides a seamless, modern user experience for all of its content with the knowledge that even legacy content is structured for optimal performance and discovery. Articles that were published 10 or 15 years ago are now fully updated to JATS XML and are up to AACR’s current best practices (e.g., fully-tagged references/affiliations, funding information, ORCIDs, etc.). Publishers see value in creating uniform well-polished “atoms” from legacy materials for use in rapid development of new product offerings. Items such as equations, tables, funding information, bibliographies, etc. are easily captured when structured in updated JATS format and provide flexibility for new uses.