DCL Data Harvester
Harvest new and modified data from public websites
Website Harvesting and AI Transformations That Deliver Structured Data to Your Systems
Organizations need to harvest and structure data and content posted and maintained on public websites. Websites are often the version of record for policy, procedure, legal, and regulatory content. Many businesses benefit from daily robotic scans of updated website content with structured XML feeds back into internal systems.
The volume and complexity of this type of information means that manual approaches are slow, error-prone, and cost-prohibitive. DCL Data Harvester provides automated website scraping configured to your business needs with customized XML feeds back to your organization.
Benefits
-
Daily robotic scans of websites important to your business
-
Harvest new and modified content from a variety of sources: PDF, HTML, XML, RTF, Word
-
Analyze, cleanse, and harmonize data
-
Provide cross-reference linking
-
Convert to XML schema for delivery
A deeper solution beyond simple website scraping
DCL provides a truly useful solution that goes beyond web scraping: it’s website harvesting and AI-based transformations of content into useful formats.
For updates, some sites provide RSS feeds. But often there is a need to go beyond RSS feeds as these are limited to what a website administrator chooses to provide. There may be missing metadata, filtering changes, normalization requirements, format/publishing needs, and the need for accurate metadata.
Sites are global and multi-lingual and contain information in multiple formats, such as HTML, PDF, XML, RTF, and DOCX. This necessitates a deeper solution where data is downloaded, normalized, structured, and converted into a common XML format with defined metadata, and related content is linked. It is critical that website crawling efforts do not look like attacks on the system, which would trigger DDoS alarms (Distributed Denial of Service).
DCL has developed methods and bots to facilitate high-volume data retrieval from hundreds of websites, in a variety of source formats (HTML, RTF, DOCX, TXT, XML, etc.), in both European and Asian languages. We produce a unified data stream that is converted to XML for ingestion into derivative databases, data analytics platforms, and other downstream systems. This process of normalization and transformation of content to automate import into a customer’s business system maximizes business value. A key to successful projects is the depth and quality of up-front analysis to ensure complete and accurate results.
DCL Data Harvester comprises
-
Filtering programs
-
Downloading handler
-
Metadata gatherer
-
File differencing programs
-
Natural Language Processing programs
-
Data and content transformation programs
-
Secure repository
DCL’s solution harnesses technology in Natural Language Processing and Machine Learning to help enable solutions powered by Artificial Intelligence. With sophisticated automated processes, DCL optimizes content to collect information, streamline compliance, facilitate migration to new systems and databases, maximize reuse potential, and ready it for delivery to all outputs.
Mark Gross, President, DCL
DCL Data Harvester is an ideal website scraping solution for all industries that rely on regulatory and compliance management data as well as keeping up to date with constantly changing website content. DCL conducts upfront human analysis of target websites and content to ensure your content and metadata are captured, structured, and complete.