While "conversion" is Data Conversion Laboratory's (DCL) middle name, we offer much more than one-off transactional conversions from PDFs to XML (or InDesign, Framemaker, Word, HTML, etc.). DCL employs some serious technology wizards who are skilled at mining, extracting, structuring, enriching, and well...really manipulating content and data in almost any way you can imagine.
We address intricate content obstacles with skilled teams who specialize in solving puzzles. Following is a service you might find useful in your organization.
Web Scraping and Structured Content Creation
Vast amounts of business-critical information appears only on public websites that are constantly updated to present both new and modified content (e.g., financial regulatory information). While the information on many of these websites is extremely valuable, no standards exist today for the way content is organized, presented, and formatted, or for how individual websites are constructed or accessed. Some content might exist in Excel files or PDFs. This creates a significant challenge for companies that require data sourced from these websites in a timely manner, which might be needed to download and support business practices and downstream systems.
While web crawling sounds somewhat nefarious, there is an important white-hat side to it. Much original source material today appears only on the Web. For many government agencies and various NGOs, the web version is the “document of record,” the most current version available, and where you are referred when you make inquiries regarding reports, articles, white papers, etc.
Although numerous tools exist for managing the fundamental crawling and scraping of websites, they typically operate on a single website at a time. Analyzing and traversing volumes of complex websites—somewhat like developing autonomous vehicles— requires the ability to adapt to changing conditions, across websites and over time.
DCL created a set of best practices for web crawling and harvesting technologies, enabling complete automation when dealing with a variety of different, intricate, and frequently disorganized websites. Our approach has been continuously improved to adapt to the constantly evolving web content environment and support an ongoing enhancement model.
DCL differentiates our data harvesting from simple web scraping by also incorporating machine learning and natural language processing to ensure the final output is well structured and ready for reuse.
Comments