Currently there is no metrics to measure the quality of audio or video media produced by an AI system. Some may argue that Fréchet Inception Distance (FID) is the best measure for image quality, as it does complex assessments between the AI generated image with its real life counterpart to produce best results.

Well no matter how many measuring systems are currently making rounds for audio and image qualities, none of them are globally accepted and are just referenced in their specific domains.

To resolve this, Google today proposed Fréchet Audio Distance (FAD) and Fréchet Video Distance (FVD), for measuring the quality of audio and video produced, respectively. It is said that FVD will access the whole video without keeping any reference point. Similarly, FAD also has no reference point and could be used on all kinds of audio in contrast to time-aligned ground truth signals like source-to-distortion ratio (SDR).

Software engineers Kevin Kilgour and Thomas Unterthiner read in a blog post;

Access to robust metrics for evaluation of generative models is crucial for measuring (and making) progress in the fields of audio and video understanding, but currently no such metrics exist. Clearly, some [generated] videos shown below look more realistic than others, but can the differences between them be quantified?

To relate how close FAD and FVD are to human judgement the software engineers conducted a series of tests involving human evaluators. The evaluators were asked to work with 10,000 video pairs and 69,000 5-second audio clips. From the results obtained, the engineer duo said that they related “quite well” with the human judgement.

Kilgour and Unterthiner said;

We are currently making great strides in generative [AI] models. FAD and FVD will help us [keep] this progress measurable and will hopefully lead us to improve our models for audio and video generation.