Google’s Foundation Model Decodes Whale and Bird Calls

Birds’ chirps, trills, and warbles echo through the air, while whales’ boings, “biotwangs,” and whistles vibrate underwater. Despite the variations in sounds and the medium through which they travel, both birdsong and whale vocalizations can be classified by Perch 2.0, an AI audio model from Google DeepMind.

As a bioacoustics foundation model, Perch 2.0 was trained on millions of recordings of birds and other land-based animals, including amphibians, insects, and mammals. Yet researchers were surprised to learn how strongly the AI model performed when repurposed for whales.

Scientists at Google DeepMind and Google Research have been working on whale bioacoustics for almost a decade, with work including algorithms that can detect humpback whale calls, as well as a more recent multispecies whale model that can identify eight distinct species and multiple calls for two of those species. But with the release of Perch 2.0, the researchers came up with the idea of reusing the model to save on computation time and experimentation effort.

“If [Perch 2.0] performs well for our whale use cases, then that means we don’t need to build an entirely separate new whale model—we can just build on top of that,” says Lauren Harrell, a data scientist at Google Research.

That notion is backed by a technique known as transfer learning, where the knowledge gained from a type of task or data can be applied to a different yet related one. In this case, Perch 2.0’s ability to classify bird calls can carry over to classifying whale calls. Transfer learning from a foundation model means you can “recycle all of the training that’s been done and just do a small model at the end for your use cases,” Harrell says. “We’re always making new discoveries about call types. We’re always learning new things about underwater sounds. There’s so many mysterious ocean noises that you can’t just have one fixed model.”

The team evaluated Perch 2.0 on three marine audio datasets containing whale sounds and other aquatic noises. They began by converting each five-second window of audio into a spectrogram, a visual representation of sound intensity across frequencies over time. These images were fed to the model, which produced embeddings or feature sets that preserve the most salient attributes of the data to help determine the nuances between the whistles of a humpback whale and an orca, for example.

Next, the scientists randomly selected a small number of embeddings (from a minimum of four to a maximum of 32) per dataset to train a logistic regression classifier, a type of linear model that predicts a discrete outcome. Results of the training, which have been detailed in a paper presented at the NeurIPS conference workshop on AI for Non-Human Animal Communication last December, showed that the classifier performed well even with just a handful of embeddings, and performance improved as the number of embeddings increased.

The researchers also compared Perch…

Read full article: Google’s Foundation Model Decodes Whale and Bird Calls

The post “Google’s Foundation Model Decodes Whale and Bird Calls” by Rina Diane Caballar was published on 03/17/2026 by spectrum.ieee.org