Announced in June 2014, Google Cloud Dataflow is a simple, flexible, and powerful system you can use to perform data processing tasks of any size. Cloud Dataflow is employed by various Big Data solutions provider around the world, to solve complex big data problems. Google designed it to relieve operational burden and enabling developers to focus on development.
As Google stresses, Cloud Dataflow has been designed as a platform to ‘democratize large scale data processing’ by enabling easier and more scalable access to data for data scientists, data analysts and data-centric developers. Google open sourced the Cloud Dataflow SDK, which offers a set of primitives for large-scale distributed computing, including rich semantics for stream processing.
Announcing the Cloudera partnership, Google said,
Today, we’re taking the next step in ensuring the portability of the Dataflow programming model by working with Cloudera to make Dataflow run on Spark. We are delighted that Cloudera is joining us, and we look forward to the future growth of the Dataflow ecosystem. We’re confident that Dataflow programs will make data more useful in an ever-growing number of environments, in cloud or on-premises.
There are currently three runners available to allow Dataflow programs to execute in different environments:
- Direct Pipeline: The “Direct Pipeline” runner executes the program on the local machine.
- Google Cloud Dataflow: The Google Cloud Dataflow service is a hosted and fully managed execution environment for Dataflow programs on Google Cloud Platform. Programs can be deployed on it via a runner. This service is currently in alpha phase and available to a limited number of users; you can apply here.
- Spark: Due to Cloudera, the Spark runner allows the same Dataflow program to execute on a Spark cluster, whether in the cloud or on-premises. The runner is part of the Cloudera Labs effort and is available in this GitHub repo.
The new partnership is being incubated under Cloudera Labs. However, do take note that since it is an ‘incubating’ project, it is hence built for experimentation purposes only. Also,Google’s Cloud Dataflow is still an ‘Alpha’ release and hence, if you are deploying this on your core production systems, do so at your own risk.