Just two days after Yahoo open-sourced Anthelion, its focused web crawler for parsing structured data from HTML pages, it has announced to open source another tool or rather a set of algorithms called Data Sketches.
Data Sketches is as an open source library of a core set of algorithms designed for quick calculations and analysis of large systems where 100 percent accuracy is not required. These algorithms can speed up the counting in many jobs as well as perform quick approximate calculations on a stream of data, touching them only one time.
So for example, one wants to count the exact number of visitors on a site on a particular day, one would require plenty of disk space, memory and time.
By using these algorithms, one can count an approximate number of visitors quickly and that too while using around 100KB of memory and no disk space. Their accuracy is in the range of plus-minus 1.5 percent and depends on the amount of input data.
The whole science is based on a very fundamental function: If you can tolerate a little bit of error in your results, then you can improve the speed of the computation immensely.
says Lee Rhodes, an architect for Yahoo’s Advertising and Data Platforms division.
Data Sketches algorithms are already being used in many Yahoo technologies, for example in the Yahoo Mail, Yahoo Search and also by Yahoo-owned Flurry for calculating real-time counts. As Yahoo makes these algorithms available under open source license, anyone can implement these algorithms into his/her systems.
This year has seen many top tech companies open-sourcing their technologies to developers. Microsoft remains a front-runner among these companies having outsourced its Visual Studio, Chakra Development Platform, DMTK, and the Live Writer Software in last one month only.