Tech titan Google claims that the supercomputers it uses to train its AI models are faster and “greener” (aka more power-efficient) than the supercomputers used by multinational tech company Nvidia. These claims come after the Alphabet unit released new information about its supercomputers in the form of a scientific paper.
Named “TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings,” the paper speaks about the fourth generation of Google’s Tensor Processing Unit (TPU), which is the company’s custom chip. These chips, according to the company, are used for over 90% of its work on training AI models – it feeds them data to make them effective at tasks such as generating images or responding to queries with human-like text. The fourth generation of the TPU is also the company’s fifth Google domain-specific architecture (DSA) and its third supercomputer for such ML models.
Google released the scientific paper on Tuesday, highlighting how it connected over 4,000 TPUs to create a supercomputer. The company also developed custom optical switches to help connect individual machines. Then, the AI models (such as the company’s PaLM model) are split across thousands of chips, which must then work together for weeks or more to train the models in question.
A spokesperson for Nvidia declined to comment on the matter.
“Circuit switching makes it easy to route around failed components,” Google Fellow Norm Jouppi and Google Distinguished Engineer David Patterson wrote about the system. “This flexibility even allows us to change the topology of the supercomputer interconnect to accelerate the performance of an ML (machine learning) model.”
For comparably sized systems, Google’s supercomputer is up to 1.7 times faster and 1.9 times more power-efficient than a system that is based on Nvidia’s A100 chip on the market at the same time as the fourth-generation TPU. The company refrains from making a direct comparison between the TPU v4 and Nvidia’s current flagship H100 chip since the H100 arrived on the market after Google’s chip and is made using newer technology.
Google added that the supercomputer is four times larger at 4096 chips, while it also uses 1.3-1.9 times less power than the Nvidia A100. TPU v4s inside the energy-optimized warehouse-scale computers of Google Cloud also use nearly three times less energy, as well as produce nearly 20 times less CO2e than contemporary DSAs in a typical on-premise data center.