Engineering

Why ATC radio is so hard for AI to transcribe

Fly Overhead TeamJune 20, 2026·7 min read

Transcribing air traffic control radio sounds like it should be a solved problem. It is just people talking, and speech-to-text reads our voicemails and dictates our texts well enough. Point a good model at a tower frequency, though, and the output falls apart. ATC radio is one of the hardest audio domains there is. Here is why, from the work behind Fly Overhead's Live ATC feature.

The audio was never built for machines

Controllers and pilots talk over VHF AM radio, a technology that has barely changed since the 1940s. The signal is narrowband, heavily compressed, and stripped of most of the frequency range that helps a listener tell similar sounds apart. Add the squelch tail at the end of each transmission, the static on a weak signal from an aircraft 60 miles out, and the engine and slipstream noise in the cockpit behind the pilot's microphone, and you have audio that a person can follow only because they already know what to expect.

Then there is the problem that has no fix on the receiving end: stepped-on transmissions. AM radio is half duplex and the frequency is a single shared channel. When two stations key their microphones at the same moment, the carriers beat against each other and produce a squeal, and both transmissions are lost. A human controller hears the clip and says “say again,” because they know a readback was expected. A transcription model just produces confident nonsense for that second and a half. The party-line nature of the frequency also means you are not hearing a clean two-way dialogue. You hear the controller plus whichever aircraft happen to be in range, interleaved with no speaker labels.

It is not really English

Standard phraseology looks like English and is built from English words, but its grammar, vocabulary, and number system are their own thing. A general model trained on audiobooks, podcasts, and call-center recordings has never heard this distribution and guesses badly at it.

Numbers are the clearest example. A frequency is read digit by digit, so 121.9 becomes “one two one point niner,” never “a hundred twenty-one nine.” A heading is three digits. An altitude has its own pattern, where 11,000 feet is “one one thousand.” A squawk code is four octal digits. The same digit is even pronounced differently on purpose, because “niner,” “tree,” and “fife” exist precisely so that nine, three, and five do not get confused over a noisy channel. A model that normalizes “niner” to a normal word has thrown away the very thing that disambiguates it.

The grammar is clipped, the phonetic alphabet runs through every exchange, and readbacks repeat instructions verbatim, so the same string of numbers shows up twice in a row from two different voices. None of that matches the statistics of ordinary speech.

Callsigns are the hardest part of all

A callsign is the one token you most want correct, and it is the one the audio fights you on hardest. Airline callsigns use a telephony name that often has nothing to do with the company on the tail. Republic Airways is “Brickyard” on the radio. British Airways is “Speedbird.” A model that has never been told this maps the sounds to the nearest common words and gives you something that looks plausible and is wrong.

General aviation is worse, because the callsign is the tail number, spoken as a mix of phonetic letters and individual digits, and then it changes partway through the conversation. After the controller first abbreviates a tail number, both sides drop to the last three characters for the rest of the exchange. To follow it, the transcriber has to remember that the “Skyhawk three four five” talking now is the same aircraft that checked in two minutes ago as “Cessna one two three four five.” That is a tracking problem on top of a recognition problem, and a plain speech model has no memory of the conversation to lean on.

Everyone is fast, and not everyone sounds the same

Listen to a busy approach controller at a major hub and the words per minute are well past normal conversation. Pilots compress too, because keeping a transmission short is good radio discipline. On top of the speed, the speaker population is global. A single en-route frequency might carry a controller with one regional accent, a domestic crew with another, and an international crew speaking English as a third or fourth language. Each of those is a distribution shift that a model handles worse the further it gets from the clean American English it was mostly trained on.

The feed itself is often the problem

A lot of publicly available ATC audio is not a single clean frequency. To save bandwidth, many feeds merge several frequencies onto one stream, so tower, ground, and approach are stacked together. The moment two of those have simultaneous traffic, you are back to overlapping speakers, except now it happens constantly by design rather than occasionally by accident. We found this the hard way: the same model that produced a usable transcript on a single-frequency tower mount produced garbage on a merged scanner mount, not because the model got worse but because the input became impossible. Picking single-frequency sources is one of the highest-leverage decisions in the whole pipeline, and it happens before a single sample reaches the model.

What actually moves the needle

None of this means the problem is hopeless. It means the gains come from narrowing the problem, not from throwing a bigger general model at it.

Start with clean input. Single-frequency sources, sensible squelch, and dropping segments that are obviously two stations at once beat any amount of post-processing on bad audio.
Bias toward the vocabulary that is actually possible. The set of airports, navaids, fixes, and airlines in a given area is small and knowable. Constraining the model toward those terms, instead of the whole English language, removes most of the plausible-but-wrong guesses.
Use the traffic picture. When you already know from ADS-B which aircraft are within range of a frequency, you have a short list of the callsigns that can legitimately appear. Matching a noisy callsign against that list is far more reliable than recognizing it cold.
Respect the number formats. A heading is three digits between 000 and 360. A squawk is four octal digits. An altimeter setting sits in a narrow range. Light post-processing that knows these shapes fixes a whole class of errors a general model makes.
Keep and show confidence. A transcript that flags its own uncertain spans is honest. One that renders every guess in the same confident type invites a reader to trust a word the model was barely sure of. For anything safety-adjacent, the second kind is worse than no transcript at all.

Where Fly Overhead fits

Live ATC in the Fly Overhead Electronic Flight Bag pairs the transcript with the live traffic on the map, so the words have context: the callsign on the frequency lines up with the target you can see. We lean on single-frequency sources, we treat the transcript as advisory rather than a record, and we are honest that it will get things wrong, especially callsigns and numbers in heavy traffic. It is a tool for situational awareness and for following along, not a substitute for listening.

The live traffic map is free to use on its own. Live ATC transcription is part of the EFB. Either way, the radio is still the radio, and your ears are still the primary instrument.

The audio was never built for machines

It is not really English

Callsigns are the hardest part of all

Everyone is fast, and not everyone sounds the same

The feed itself is often the problem

What actually moves the needle

Where Fly Overhead fits

More from the blog

Density altitude: why airplanes struggle when it gets hot and high

Reading turbulence from the sky: inferring rough air from ADS-B

Rendering 12,000 aircraft at 60fps in a browser