Audio is everywhere. While audio processing has existed for decades, the rise of artificial intelligence and big data has allowed us to uncover audio’s hidden secrets, providing previously inaccessible insights for business and end-users.
Better understanding audio can help us in the customer care industry allowing businesses to maximize customer satisfaction. Sentiment analysis algorithms can recognize the tone of a customer on a customer service call and analyze the root cause, allowing businesses to change their strategies to better support their consumer.
Another example, NASA’s SoundSee, was an initiative aimed to equip several mini-robots with an array of microphones that monitor machine audio onboard the International Space Station. Using AI, these robots recognize irregularities to notify respective parties to fix the issue, acting as the first line of defense against system failures.
In this article, we will explore what exactly sound is, how it can be measured, and how it can be leveraged using Artificial Intelligence.
So, what is sound?
In a nutshell, sound is generated when objects that vibrate cause air molecules to bump into one another. The oscillation of these air molecules creates micro pressure differentials within the air molecules which effectively generate sound waves. These waves, also known as mechanical waves, travel through a medium, transferring energy from one position to another. If you think about it, this is precisely why there is no sound in space; there simply is no medium in the vacuum of space through which sound can travel.
In the image above, the plot of particles at the bottom represents the areas of low and high pressure within the air caused by the sound. Areas with low pressure have a lower particle density and the areas with higher pressure have a higher particle density. Based on this pressure differential, a curve can be generated with peaks at areas with higher air pressure and valleys at areas with low air pressure.
This visualization of sound waves is known as a waveform and it provides a plethora of details about the sound that can be leveraged when trying to extract features from the audio. A few of the most fundamental of these features are:
In a wave, the period is the time it takes to complete one cycle (refer to the image below). Frequency is the inverse of the period, expressed in Hz which translates to cycles per second. Essentially, the lower time it takes for a cycle to complete, the greater the frequency and vice versa. Visually, a wave with peaks closer to one another would have a higher frequency than a wave with distant peaks.
But how do we perceive frequency?
Our perception of frequency is often represented by the pitch of a sound. While frequency describes the numerical measure of the rate of cyclical repetition of a waveform, pitch is more a subjective term we use to describe a sound. The higher the frequency, the higher the pitch of the sound, and the lower the frequency the lower the pitch of the sound.
Just like frequency, the intensity is another crucial dimension for understanding the composition of a sound. Sound intensity describes the sound power a sound displaces over an area, measured in Watts per square meter. The power of a sound is the rate at which the sound transfers energy over some unit of time. In short, intensity is essentially the amount of energy that a sound displaces.
Now, just like with frequency, the way we perceive intensity is much more subjective. We usually associate sounds with higher intensities as louder and sounds with lower intensities as softer. However, loudness isn’t very consistent among all listeners. Confounding factors such as the duration, the frequency, and the age of the listener can affect how loud a sound feels.
So far, we’ve discussed two unidimensional sound properties: frequency and intensity. Unlike these easily quantifiable properties, timbre is a rather mysterious property of sound that describes a multitude of properties that give a sound its character. Musicians like to describe timbre as the color of sound, which is an interesting but vague description.
To explore what timbre is, let’s look at a simple example. Imagine a trumpet playing a note at the same pitch, duration, and intensity as a violin. While the sounds share most of the same properties, to you and me the two sounds would sound noticeably distinct. The combination of characteristics that separate those two sounds would qualify as the timbre of a sound.
From the physical world to the digital world
Now that we have a basic understanding of the physics of sound and its properties, how can we leverage these properties and do some audio processing? Well, first of all, we need to be able to convert audio into some digital signal that contains the information necessary to manipulate and process the audio.
How Microphones Work: Analog to Digital Conversion (ADC)
Naturally, all audio is found as an analog signal. Analog signals are a continuous graph of time vs. amplitude of a sound with infinite values at every infinitesimal unit of time. Storing a raw analog signal would be nearly impossible, requiring infinite storage. Instead, we perform a combination of operations to extract values from an analog signal at fixed intervals. This allows us to store the signals in a digital format at a fraction of the memory while collecting enough data to reproduce a sound. This process, known as Analog to Digital Conversion (ADC), uses sampling and quantization to collect a finite set of values for any given analog signal.
Sampling: Instead of collecting every value in the continuous analog signal, sampling is the idea of extracting values at fixed, equidistant time intervals. The most common sampling rate for audio is 44.1 kHz, or 44,100 values every second of sound. This sampling rate best allows us to extract all data values that exist in the human hearing range.
Quantization: While sampling focuses on extracting values at fixed time intervals along the horizontal axis, quantization divides values on the vertical axis of a waveform into a range of fixed equidistant values. When selecting a value at a given time interval, quantization rounds the exact value at a given time to the nearest quantized value. The number of quantized values, also known as the resolution, is measured in bits. A normal CD has a bit depth or resolution of 16 bits meaning it has 65,500 quantized values. The higher the bit depth during quantization, the greater the dynamic range when converting an analog signal into a digital one.
When a microphone picks up audio, the diaphragm inside the microphone oscillates forming an analog signal which gets sent to a sound card. This sound card performs ADC and sends the newly generated digital signal to the computer for manipulation or processing.
Using Artificial Intelligence for Audio Processing
We understand what audio is and how we can convert it from a physical format to a digital format, but how can we actually do anything with it? While there are all sorts of different ways to process audio, we’ll be focusing on how Artificial Intelligence has permeated the audio space, allowing us to better understand, enhance, and reproduce audio.
While we won’t dwell on the specifics behind how to implement AI in audio processing, we will go over the different ways that AI can be applied to audio.
Holistically, artificial intelligence is the idea that computers will be able to complete tasks that typically require a higher level of intelligence than a series of logical processes. Deep learning is a subset of artificial intelligence, where complex algorithms modeled after the human brain essentially learn from massive amounts of data.
For deep learning algorithms to truly provide valuable insights with audio, we need to have access to large amounts of audio data. These datasets not only need to be large but also need to be clean and organized. The perfect combination of a large and clean dataset paired with an efficient AI algorithm will yield the best results for any AI process. Using these large datasets, an AI model will observe patterns in the properties of these different sounds such as frequency, duration, intensity, and timbre.
Currently one of the most common uses of AI in the audio field is speech recognition. Personal assistants such as Amazon Alexa, Google Home, and Apple’s Siri all leverage AI to convert a person’s speech to text, understanding the meaning of their request, and produce an audible response.
Speech recognition is only a small portion of opportunities that AI brings in the audio industry. Currently, researchers are implementing AI models that are able to create sound entirely from scratch. This ability is extremely useful for any text to speech program, which uses speech synthesis to produce audio. Deep learning models train on hundreds of hours of speeches and their transcripts. Eventually, they learn how every word and character sound in relation to their context and can produce speech when given a piece of text.
Similarly, AI is being used in the music industry to produce complex musical pieces given a set of parameters. A great example of music synthesis is Google’s recent Blob Opera application generates beautiful sounding harmonies regardless of the way a user organizes the blobs.
It doesn’t stop there. We can use AI to intelligently manipulate digital audio to improve existing audio for our needs. For example, AI can be used to create cleaner speech by removing any background noise or unnecessary artifacts from some audio. Audio super resolution, another facet for audio enhancement, allows us to dramatically enhance low quality audio, by increasing it’s fidelity. All of these features could improve audio quality during calls and increase clarity in poorly recorded audio.
Today, we covered what sound is and the physics behind it. Specifically, we looked at frequency, intensity, and timbre of a sound. Understanding the properties of sound and waveforms is crucial for building better AI sound processing algorithms.
The potential for this industry is limitless and as more researchers and developers are recognizing the prospects, we can expect to see AI audio processing in every industry.