Google is finally providing more information about the Tensor Processing Units it deploys in its in-house AI. The company has been deploying the said units in its in-house processes since 2015 and they have formed a very important part of its AI efforts. However, it is the first time that the company is opening up about the issue in public. The company has published a paper on the subject (and the said paper has as many as 75 co-authors) and Google’s David Patterson is also giving a talk about it today at the National Academy of Engineering in Mountain View, California.
Tensor Processing Units are usually deployed while training neural networks — you know, the stuff everyone is so excited about these days and that has the potential to power all sorts of stuff that involves machine learning.
So here is how TPUs get involved. Google uses these TPUs for efficiently making inferences about new data. This comes after the company has first trained the neural networks on lots of data sourced from various sources. Apparently, the implementation of the TPUs has caused the system performance to spike significantly.
Despite low utilization for some applications, the TPU is on average about 15X-30X faster than its contemporary GPU or CPU.
Google’s unique systems are already powering a slew of applications like Image search, cloud vision and photos. Apart from just performance, these TPU have a slew of other assorted advantages as well, including up to 3.5 times more memory on chip than a K80, a smaller size and finally, 30-80 times better performance per watt.
Rather than be tightly integrated with a CPU, to reduce the chances of delaying deployment, the TPU was designed to be a coprocessor on the PCIe I/O bus, allowing it to plug into existing servers just as a GPU does. Moreover, to simplify hardware design and debugging, the host server sends TPU instructions for it to execute rather than fetching them itself. Hence, the TPU is closer in spirit to an FPU (floating-point unit) coprocessor than it is to a GPU. The goal was to run whole inference models in the TPU to reduce interactions with the host CPU and to be flexible enough to match the NN needs of 2015 and beyond, instead of just what was required for 2013 NNs.
You can read the full paper on the subject, right here.