I’m currently working on a project to help me to improve my trumpet playing. The basic premise of the project is that I want a program where I can plug in a microphone, play music at is on my trumpet, and get feedback while I’m playing on my intonation (is my pitch correct, or am I sharp or flat).

This post is part two of a two parter. In part one, I wrote about what sound is physically, how a microphone measures sound, and more or less what it looks like once it’s in the computer. To summarize the parts of that which are important to this post, sound is vibrating air. Using a microphone, you write a program that takes in a measurement of those vibrations as a list of numbers. On most sound cards, you will get a list of 44100 numbers for every second of sound.

Another important point that you need to know from my previous post to follow this one is that the musical pitch of a sound is determined by the frequency of the air’s vibrations. If the air is oscillating 440 times per second (440Hz), then we hear the note as an ‘A’. If the air is vibrating faster, the pitch goes up, and if it’s vibrating slower the pitch goes down.

In this post, I’ll be writing about the mathematical tools that I’m using to analyse the sound I’m recording in order to try to figure out its pitch. In other words, I want to take in a signal as a list of numbers from the microphone, and analyse it to figure out how the air is vibrating, quantify the frequency of the vibrations, and thus be able to display a musical pitch.

## The Sine Wave

Real signals can get fairly complicated, so it’s easier to start explaining the techniques using a simple signal. The easiest signal to work with in terms of modelling sound is a sine wave. For the sake of example, I’ve generated a 440Hz sine wave (musically, this is an A). It looks like this:

Something that you should notice here is the sort of timescales we’re working on. We can fit 5 complete repetitions of the waveform in under 0.012 seconds. This shows that even if the music you’re working with is changing notes quickly, there is probably enough data to identify the pitch of each individual note. On the other hand, I chose the number of repetitions to show based on what looks good on a graph, so you might need to tweak the amount of sound you use depending on which techniques you’re using and the precision you need.

## A Real Signal

For the sake of illustrating how these techniques look on a real signal, I’ve also taken a recording on the trumpet of the same note.

If you compare this graph to the previous one, you’ll notice that it is also repeating regularly with the same interval. If you look just after the 0.002 second mark (0.00227 seconds to be precise), you’ll notice that the signal goes low and starts to repeat the exact same shape again.

Looking at the time between repetitions is actually just another way of looking at the frequency of the signal, so it also have a special name. The time between repetitions is the period of the signal.

Mathematically, the period is the inverse of the frequency, and so if
you can find one you can calculate the other.
`0.00227s = 1 / 440Hz`

or `440Hz = 1 / 0.00227s`

## The Fourier Transform

Contrary to popular belief, the Fourier Transform has very little to do with figuring out how to represent a number using as many fours as possible.

No, the Fourier Transform is actually a way of mathematically modelling a signal as if it were the sum of a collection of sine waves with various frequencies. If you think of our example signal, which is just a sine wave, it would be transformed to show that it is zero everywhere, except for a spike at a single point.

I’d like to talk about the actual implementation of the Fourier Transform here as an algorithm, but it requires a bit too much of a background in calculus to easily explain. However, that doesn’t mean you can’t use it! Fourier Transforms come up in a surprising number of situations in math, statistics, science, and signal processing, so there are many libraries that can use to do the transform for you. If you’re working in software, very often what you’re looking for will call it a Discrete Fourier Transform (in signal processing terms, the numbers coming from the microphone would be considered a discrete signal), or it will be called a Fast Fourier Transform (this is a specific algorithm that performs the Discrete Fourier Transform).

Our input signal was a sine wave with a frequency of 440Hz, so the output from the Fourier Transform was mostly around zero with a single spike at 440Hz. Like many signal processing algorithms that work on discrete numbers, it works fairly well but doesn’t give the same precision as working it out symbolically. What this means is we’ll have a clear peak on our Fourier Transform output, but it won’t be perfectly zero everywhere else.

The Fourier Transform will give you frequency information for frequencies up to half of your microphone’s sampling frequency. The reasons for this are difficult to fit into this blog post, but interestingly are the same reasons that the common sampling rate used in computers is 44100Hz, which is just a bit above double the maximum frequency that most people can hear: 20000Hz.

What do we get if we use it on our real signal?

A real signal (at least, a real musical signal) is composed of many sine waves. The 440Hz spike is still there, but there is a bit more work to do computationally to figure out which of the several spikes is the right one to report as the pitch of the note. In most cases, it will be the first peak, with the other peaks being harmonics of the first peak. Annoyingly, this first peak isn’t always going to be the highest peak.

Something problematic here for our application is that the resolution we’re getting on the frequency domain is only about 100Hz. So you can look at that spike, and determine that the frequency is 440Hz, with a potential error of up to 50Hz on either side. In fact, the actual value of 440Hz on the graph above sits between two points, which is why that first peak is so wide.

This clearly is not ideal, but generally speaking, you can increase the frequency resolution of your Fourier Transform if you give it a longer stretch of audio to work with. This would need to be balanced against the need for the application to respond quickly with the pitch.

## Auto-Correlation

Auto-Correlation (also known as cross-correlation or convolution, depending on how you’re looking at it), focuses on finding the period rather than finding the frequency.

The idea of correlating signals is that if you have two signals, you can line them up and compare them at each point to see how similar they are.

How do we apply it to this problem? Well, let’s say we take our signal, make a copy, and correlate it with itself. It will correlate very highly with itself, because that is literally the idea of correlation. If we then we try taking one of the signals and inserting a time delay to shift it slightly to the right, the correlation will start to drop.

If we go back to when we were looking at the signals earlier, we know that our signal repeats itself. This means that at some point while increasing our delay, we will reach a point where the signal is highly correlated with its delayed version.

It might be easiest to explain Auto-Correlation with an example implementation.

```
// builds a list of correlations with various time delays
fn autocorrelation(input: &Vec<f64>) -> Vec<f64> {
let mut output = Vec::with_capacity(input.len());
for delay in 0..input.len() {
output.push(correlation(&input, delay));
}
output
}
// measures how 'the same' a signal is to itself after
// a time delay
fn correlation(input: &Vec<f64>, delay: usize) -> f64 {
let mut c = 0.0;
for i in 0..input.len()-delay {
c += input[i] * input[i+delay];
}
c
}
```

For our sine wave, the auto-correlation plotted as a function of the time delay looks like this.

As you can see, there are peaks at regular intervals. If you compare the correlation with the signal, you’ll notice that the spikes in the correlation line up exactly with the completion of each repetition in the original signal. In other words, the spike happens at each period.

For our real signal, it looks like this:

I was over the moon when I saw how neat this graph looks, considering that it’s a real world signal. You still need to do some work to find the correct peak. Firstly, discard the first peak, centered at a time delay of zero. That isn’t really a repetition, just the signal unsurprisingly looking like itself before a time delay is introduced. It is a fair approach to just take the first maximum after that one, since that should correspond with the first time the signal is repeating itself. In this example, the first spike happens with a 0.00227 second delay, as we would have expected given our discussion earlier.

Unlike the Fourier Transform, our resolution here depends only on the
sampling frequency of our microphone, 44100Hz. This places each point
in our time delay `1 / 44100Hz = 0.0000227s`

apart. The significance
of this error amount, when looked at as a frequency, will actually
vary depending on where you are on the scale, but at 440Hz it makes
your error in the region of about 5Hz on either side.

## Which Should I Use?

As I’ve shown in this post, both the Fourier Transform and Auto-Correlation are both valid techniques for finding the frequency (and thus the musical pitch) of a signal. Computationally, they can both run reasonably quickly (they don’t call it the Fast Fourier Transform for nothing).

Personally, I find the Auto-Correlation approach easier to reason about. This is especially true when it comes to the trumpet signal, which has a more complicated Fourier Transform output. I also like how I can get better precision using Auto-Correlation without needing to use a longer recording. At this point, my codebase actually has both of these techniques implemented, and even though it’s using the Auto-Correlation to choose the current pitch, it prints out a graph of both the Fourier Transform and the Auto-Correlation to help me to debug any issues I might be having.

## Acknowledgements

Most of my knowledge on signal processing comes from the book Signal Processing and Linear Systems, written by B. P. Lathi and published in 2010 by Oxford University Press (ISBN 978-0-19-539257-9). I would heartily recommend this book to anyone who wants to take the techniques I’ve mentioned here further. Just a disclaimer, signal processing as a subject is inherently maths-heavy, and so if you don’t already have at least a fair amount of calculus in your background that may be a better place to start.

For looking up specific signal processing techniques, getting their formulas, and even animations visualizing the techniques, Wikipedia is an amazing resource. I’ve linked out to it inline a few times in this post already, but just to finish off I’d like to link to their articles on the Fourier Transform, and on Cross Correlation. If you’re worried that Wikipedia isn’t a great reference because it’s just people writing on the Internet, a blog probably isn’t much better.

The graphs in this post were generated using Gnu Octave. Octave is a free programming language, intended to be largely compatible with Matlab. It has a large library of functions that makes it very useful for prototyping signal processing problems like this one, including reading an audio file, computing the Fourier Transform, and computing the Auto-Correlation.

The sounds being analysed in this post were generated (in the case of the sine wave) and recorded (in the case of the trumpet) using Audacity.