In a rather strange move, Yahoo has today open sourced Anthelion, a focused web crawler for semantic annotations in Web pages that steers in the direction of HTML pages–which are annotated with markup languages like RDFa, Microformats, and Microdata–to GitHub.
Anthelion can be targeted to crawl for specific pages. The system includes a ready-to-run extension for the Apache Nutch Crawler, which can be run on a single machine as well as a Hadoop cluster.
Notably, Yahoo search is the core element of Yahoo’s wide portfolio of web services, and Anthelion is the key for all Yahoo’s search based services.
Last year, at the Conference on Information and Knowledge Management in Shanghai, Yahoo detailed Anthelion in a paper. Microdata and RDFa are syntax formats for structured data about different topics. They’re compatible with the schema.org vocabulary for structured data, a project that the Google, Yahoo, and Bing search engines all work on.
As described in technical terms by Mika, Blanco, Meusel and Petar Ristoski, the Anthelion system combines the benefits of online learning and a bandit-based selection strategy to adopt to the current crawling environment.
Based on a given target function, each newly-discovered URL is classified, where the current crawled page is analyzed with respect to a target function and passed to the learner to further improve its quality. Experiments have shown that this strategy results in the number of retrieved relevant higher by a factor of 3.
The complete code, which is released under Apache License 2.0, can be found at the Yahoo GitHub repository: https://github.com/yahoo/anthelion