Education IT & Web-tech News Research

Yahoo Releases 13.5 TB Of User Interaction Data For Machine Learning Researchers To Deploy In Their Studies

verizon-yahoo, Yahoo
Share on Facebook
Tweet about this on TwitterShare on Google+Share on StumbleUponShare on LinkedInPin on PinterestShare on Reddit

In another example of corporations contributing to the academia, Yahoo Inc. has decided to make the largest-ever machine learning dataset, available to the research community. The company made the announcement today morning, via a blog post.

The massive, 13.5 TB of data contains information — all anonymized of course– about interactions between about 20 million users from the course of February 2015 through May 2015, over webpages including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, and Yahoo Real Estate.

The data contains stuff like age range, gender along with title, summary, and key phrases of the news article under consideration time and also some device information. Unsurprisingly, the data will prove to be a virtual treasure trove of information for researchers — particularly that in the direction of machine learning — seeking patterns in the jumble.

As per Suju Rajan, Director of Personalization Science, Yahoo Labs,

Data is the lifeblood of research in machine learning. However, access to truly large-scale datasets is a privilege that has been traditionally reserved for machine learning researchers and data scientists working at large companies – and out of reach for most academic researchers.”

True enough. After all, theories can only take you so far unless you can test their practicality in the real world. However, with this generous contribution from Yahoo, even freelance researchers will have all the data they could ever want to swim in.

As per TechCrunch, Researchers at Carnegie Mellon University, the University of California in San Diego, and the UMass Amherst Center for Data Science are amongst some of those who will be working upon this data and using it to advance their studies, where it can prove to be a big help in areas like machine learning, artificial intelligence, information retrieval etc.

While Yahoo deploys datasets of these kind to improve its understanding and grasp of areas like search ranking, computational advertising, information retrieval, and core machine learning, this is one of those rare times that hardcore academics will be able to get their hands upon something of the sort and use it — hopefully — for the greater good.

That is probably what the company is hoping. After all, you did not really expect a corporation to do something out of pure selflessness, did you? Similar to IBM Watson, Amazon Machine Learning, Azure Machine Learning and Google. Yahoo’s attempts to aid research also hinge upon a very important factor. All these companies are strongly involved with the fruits of research. Thus, all the aid that they can provide to the folks actually doing the research will in the end, come back to them in form of enhanced and improved technologies. What’s more, AI and Machine learning are proving particularly hard nuts to crack for these corporations.

Here is what Yahoo hopes from this data.

We hope that this data release will similarly inspire our fellow researchers, data scientists, and machine learning enthusiasts in academia, and help validate their models on an extensive, “real-world” dataset. We strongly believe that this dataset can become the benchmark for large-scale machine learning and recommender systems, and we look forward to hearing from the community about their applications of our data.

All in all, the move is very welcome and will go a long way towards advancing research in a score of fields. The data can be accessed by going to the Yahoo Webscope — which is basically a library of interesting and scientifically useful datasets — although, i wouldn’t recommend trying to download it all using your regular data connection.


A bibliophile and a business enthusiast.

[email protected]

Add Comment

Click here to post a comment

Your email address will not be published. Required fields are marked *