Google teams up with Cloudera to run Cloud Dataflow on Apache Spark

google

Google has today announced that it has partnered with Apache Hadoop specialists Cloudera to bring its recently open-sourced Google Cloud Dataflow service to Apache Spark.

Announced in June 2014, Google Cloud Dataflow is a simple, flexible, and powerful system you can use to perform data processing tasks of any size. Cloud Dataflow is employed by various Big Data solutions provider around the world, to solve complex big data problems. Google designed it to relieve operational burden and enabling developers to focus on development.

As Google stresses, Cloud Dataflow has been designed as a platform to ‘democratize large scale data processing’ by enabling easier and more scalable access to data for data scientists, data analysts and data-centric developers. Google open sourced the Cloud Dataflow SDK, which offers a set of primitives for large-scale distributed computing, including rich semantics for stream processing.

Announcing the Cloudera partnership, Google said,

Today, we’re taking the next step in ensuring the portability of the Dataflow programming model by working with Cloudera to make Dataflow run on Spark. We are delighted that Cloudera is joining us, and we look forward to the future growth of the Dataflow ecosystem. We’re confident that Dataflow programs will make data more useful in an ever-growing number of environments, in cloud or on-premises.

There are currently three runners available to allow Dataflow programs to execute in different environments:

Direct Pipeline: The “Direct Pipeline” runner executes the program on the local machine.

Google Cloud Dataflow: The Google Cloud Dataflow service is a hosted and fully managed execution environment for Dataflow programs on Google Cloud Platform. Programs can be deployed on it via a runner. This service is currently in alpha phase and available to a limited number of users; you can apply here.

Spark: Due to Cloudera, the Spark runner allows the same Dataflow program to execute on a Spark cluster, whether in the cloud or on-premises. The runner is part of the Cloudera Labs effort and is available in this GitHub repo.

The new partnership is being incubated under Cloudera Labs. However, do take note that since it is an ‘incubating’ project, it is hence built for experimentation purposes only. Also,Google’s Cloud Dataflow is still an ‘Alpha’ release and hence, if you are deploying this on your core production systems, do so at your own risk.

Google teams up with Cloudera to run Cloud Dataflow on Apache Spark

Deepanshu Khandelwal

Tesla Q1’24 profits less than half of Q1’23, company to fast-track affordable EV roll-out

Amazon ends drone deliveries in California, expands to other US geos

Garena ‘Free Fire’ re-launch in India will see game data stored in local servers: Report

Tesla slashes car, self-driving software prices globally amides sales struggles

Netflix surpasses Q1’24 estimates, but will stop reporting subscriber numbers

Tesla Q1’24 profits less than half of Q1’23, company to fast-track affordable EV roll-out

Amazon ends drone deliveries in California, expands to other US geos

Garena ‘Free Fire’ re-launch in India will see game data stored in local servers: Report

Tesla slashes car, self-driving software prices globally amides sales struggles

Google teams up with Cloudera to run Cloud Dataflow on Apache Spark

Up next

Author

Deepanshu Khandelwal

Tags

Leave a Reply Cancel reply

Tesla Q1’24 profits less than half of Q1’23, company to fast-track affordable EV roll-out

Amazon ends drone deliveries in California, expands to other US geos

Garena ‘Free Fire’ re-launch in India will see game data stored in local servers: Report

Tesla slashes car, self-driving software prices globally amides sales struggles

Netflix surpasses Q1’24 estimates, but will stop reporting subscriber numbers