In our previous discussion about Machine Voices we looked at sonic treatments to make a voice recording sound more like an automaton: re-recording, frequency, time, vocoding, speech synthesis, intentional misuse of tools, and layering in Part 1.
Before we get into more treatments it is worth noting that sonic effects alone are not the only factor in making human speech sound like an android voice. When a synthetic intelligence is the goal script writing and voice acting can help give us robotic clues. For example, HAL 9000 from Kubric’s 2001: A Space Odyssey speaks without emotion. It’s creepy that the computer has no feelings, evidenced by the lack of word stress or pitch variations that humans naturally use. The classic Robot B-9 from Lost in Space used a monotone delivery to let us know it’s “Non-Theorizing” status. Many androids have been portrayed by voice actors using a monotone delivery, though pitch variations have also been removed from regularly entoned speech using processing (example: the Cylon Centurions from the late 1970s)
Pace and pause can be used to sound intentionally procedural. C3PO has quirky pauses and a steady pattern to his dialog that is part of the android speech presentation. Concatenated speech that we hear when calling for the local time, or via MovieFone demonstrates how it sounds when pre-recorded voice is presented piecemeal, organized based on current conditions by a computer for the listener. Interactive Voice Response is used by airlines, banks, and tech support so callers can get what they need with little or no “live human” time on the line. We can intentionally emulate these unnatural patterns before any treatment is applied.
Editing may also play a role. For example, the mechanical personality Max Headroom stuttered, like a series of bad edits, to let us know the voice was from a machine.
One of the sonic give aways that we are listening to a machine is some kind of failure. Subtle to extreme distortion can help convince listeners that electronics, transducers, and the power supplies found in a machine are being used for the reproduction of the voice we are treating. Sometimes the best part of re-amping is pushing the system into distortion — a little or a lot. Of course this can be simulated with software, or by passing signal through a piece of gear like a guitar effects pedal. Sometimes massive distortion mixed back in subtly with the untreated signal helps prove the idea while maintaining intelligibility.
Equipment failures don’t have to be limited to electronic or mechanical clip distortion. There are a wealth of opportunities to glitch a voice recording and make it sound less than human. Low resolution digitization — such as 8bit, 8k Hz — can give you some downright awful sounds. Hint: a lot of talking toys operate down in this range. You can downres using all kinds of different software, not just plugins. Or try hardware such as a guitar stomp box featuring bit crushing effects.
@stonevoiceovers suggested using the ProTools Air Frequency Shifter. Early pitch processing such as the Eventide H910 Harmonizer could sound garbled and glitchy, especially when pushed to extremes. Most pitch processing still sounds pretty synthetic with excessive settings. One idea is to simply pitch something way up or down and then process again to return to normal pitch. The filters and shuffling will add some great artifacts. Of course you can keep the pitch changes, even variable pitch change, to make something both simulated and glitchy at the same time.
You don’t have to pitch process to get chopped, stuttered sounds. There are plugins and hardware effects processors dedicated to these kinds of glitch effects. We can even get these kinds of sounds by abusing a simple tremelo or vibrato processor. And if you’re not afraid to experiment, try circuit bending an inexpensive consumer product that records and plays sound. Cheap toys can be especially fun to hack.
Some of my favorite plugins make processing easier by having several techniques readily available on a single display. For mediated voice futzes it hard to beat Speakerphone by Audio Ease. It has tons of great re-recording techniques simulated using convolution — both speaker and microphone emulations are included. Then you can pile on frequency manipulations with EQ, overdrive, room reflections, telecom codec simulations, and much more.
Whereas Speakerphone is a quick, accurate path for emulating the real world, FutzBox offers a palate of sonic changing parameters to play around and create your own flavor. Start with a speaker emulation, then select options to downres, filter, overdrive, even add a noise layer. The interface makes it fun to experiment with different combinations.
I’ve worked in several studios that had a rackmount Eventide H3000 Harmonizer. If you ever have the opportunity to play with this kind of quirky box full of crazy treatments, indulge your ears with synthetic weirdness. A significant number of robot voices have been created using Harmonizers over the years. Eventide makes plugins these days, which is a more convenient way for everyone to access multi-effect sounds.
Ben Burtt is a living legend. His creative use of sound tools inspires us. From classic ARP synthesizers to the Kyma, he blurred the lines between human voice and machine with iconic robots from R2-D2 to Wall-E. We don’t have to use someone else’s plugin; we can devise our own treatment paths too. Techniques like vocoding and speech synthesis demonstrate a confluence of artistic and technical thinking that ask us to create new processes. But powerful tools with no specific voice processing agenda require patience to wield well, and a steep learning curve may not coalesce with an inspired moment. These are deep waters. Come mentally prepared.
Native Instruments makes powerful music oriented software like Kontakt for sampling, and Reaktor for synthesis. To write your own code consider Pure Data and Max/MSP. More tool recommendations: 10 Great Tools For Sound Designers, What’s The Deal With Procedural Game Audio, and Google search for new ideas in sonic tools, including audio related discussion groups.
TIPS AND TRICKS
(1) Levels. Metallic resonances and other synthetic treatments can generate crazy level spikes. Watch your input and output levels so you don’t produce unwanted clipping.
(2) Dynamics. Just because your dialog was compressed and/or limited before you treated it, doesn’t mean you can ignore dynamics after. Consider another pass through dynamics processing so your low level sounds don’t get lost, and the newly created signal peaks don’t prevent you from setting this dialog loudness on par with everything else.
(3) Diction. When you mangle in the 2k – 5k Hz range the intelligibility and presence of the dialog may become diminished. Our ear/brain system uses the 6k – 8k Hz range to distinguish S sounds from F sounds, with potential confusion for other sounds like TH, SH, or CH. Sometimes you just need EQ to enhance these areas. Other times you may need to blend in some unprocessed or less processed voice in these frequency ranges to recover the diction that was lost from treatment. If you lower resolution, remember Nyquist showed that we need at least twice the sampling rate to represent an audible frequency, meaning a downres to 8k Hz constrains the audible frequency to only 4k Hz!
(4) Distortion. If you choose overdrive as part of your processing chain, keep in mind that distortion is a form of dynamic range compression. If you’re unsure whether to use it or not, remember that distortion can be a substitute or compliment to any dynamics processing needed after applying other techniques. Even a little intentional clipping could help improve the signal chain — from a simple futz, to a full-on sentient machine.
If you’ve got any tips, tricks, or other suggestions please share. Have fun making Machine Voices!
A friend of mine was trying to make a voice sound like it was coming from a toy. He referenced the Extreme EQ article I wrote a while back. Him telling me about it, combined with some recent projects I’ve been doing, inspired me to assemble more ideas about treating voice recordings for machine-like effect.
I first heard the term futz from some film mixing colleagues, which refers to changing a recording so that it sounds like it’s on the phone, an intercom, a megaphone, or other mediated delivery of a voice. We can extend this idea to any kind of talking machine, whether it transmits a human voice, or represents a sentient machine like Hal 9000, C3PO, or Optimus Prime.
Some of the earliest practical voice treatments were made by placing telephone speakers, megaphones, etc. inside a sound isolation box with a microphone. The interior of the box was often lined with sound absorptive material to help reduce audible reflections inside. Sound was fed into the emitter of choice and then recorded using the microphone. But you don’t have to build the box unless you plan to do this kind of re-recording on a regular basis — having these things isolated might be helpful if you do it often enough.
I love re-amping, especially when I’m going for realism. You know, playing a sound file of someone speaking on my mobile phone sounds very convincingly like that person talking on my phone! Sometimes old school re-recording is the better option: a quick and convincing method to get a machine-like voice treatment. And you don’t have to be a purist about it; you can add other forms of manipulation before and after.
Transducers found in machines often have a specific frequency response that we hear as machine-like. Using extreme roll offs and obnoxious, narrow boosts can help simulate an android, toy, or other talking machine. Listen to these kinds of sounds in the cinema, on TV, and real life talking devices to help you decide when your EQ settings help create a convincing treatment. For some specific ideas see: Extreme EQ.
Some of the earliest robot voices included the sound of the speaker inside the chassis of the machine. A really tight delay can simulate that kind of reflection. And you can create some great metallic resonances by cycling the result back into the delay again and again in a time smeared feedback loop. Delays under 30ms or so can create comb filters, also known as “phasing.” If you slowly increase the delay time then let it recover you can create flanging: comb filtering with variable notches and peaks. A closely related effect called chorusing also varies delay times back and forth for moving comb filters that can sound synthetic and hollow. @r0barr likes to use a ring modulator, which could be considered a frequency effect plus a time effect, because it smears specific frequency ranges over time by driving a filter into oscillation.
All of these forms of delay can have a synthetic, manufactured kind of sound. If you’re going for a vintage machine, or something subtle, a simple time based effect may be all you need. Or combine with other manipulations to mashup old and new sonic characteristics.
Some of my favorite plugins for these kinds of legacy effects are made by Soundtoys: EchoBoy, Phase Mistress, and Crystallizer.
(See also: Phase)
Both @daviddas and @recordingreview mentioned the TAL Vocoder by name. Breaking speech into smaller components, vocoding was originally created to reduce the bandwidth needed to transmit a voice. It was also used to encrypt voice communications including military use. By the nineteen sixties artists and technicians collaborated to make several different models of a “talking synthesizer”, or put another way, a singing machine. Because the artistic use of vocoders was based on music, using one may benefit from knowledge of music. If not a musician yourself, consider collaborating with your composer and musician friends.
There’s a fascinating history about vocoders on this wiki page if you’d like to read more.
@MikeHillier made a really obvious suggestion that I had completely overlooked: “get Appletalk to say it.” @chewedrock recommended “Atari speech synthesis – software automatic mouth.” Speech synthesis is a great option for a machine voice.
YOU’RE DOING IT WRONG
Intentional misuse and reinvention can be incredibly fun. Some of our beloved music processors such as pitch correction can be applied to speaking dialog instead of music for some very tasty synthetic voices. @grhufnagl said, “I really love using Melodyne to control pitch & time, alongside its formant control.”
Convolution and noise reduction may have been intended to emulate the real world and clean noise out of recordings, respectively. But we can choose to apply these in creative ways to generate interesting artifacts. Freeware and low cost software tends to cut corners, making them more prone to audible errors that sound unnatural and weird. Almost any audio tool can be used in ways it wasn’t intended to produce ear catching flaws.
Good sound design often features the prominent sound that we notice, with layers of quieter elements adding color and flavor. This can work for machine voices too. We can use the voice signal as a key to open a gate on other sounds — static, digitization artifacts, droning guitars, and many, so many more sonic clues that the voice is being mediated. Add samples, such as servos to move a mechanical mouth, or the hum of a power supply. These finishing touches are like highlights and shadows in visual art that add believable, three dimensional characteristics.
Next time: more treatments, more plugins, plus voice acting ideas and tips/tricks in Part 2
Here are some of the interesting things I saw on the exhibit floor.
(1) Triad Orbit was showing a very cleaver clamp with 5/8” threads for putting microphones is less traditional locations. The foam inside the clamp makes it safer to crank down on pretty fixtures, plus adds gripping power to keep it from sliding. It’s called the IO-C Mounting Clamp and I need several!
(2) I like to stop by the Latch Lake booth in case they are giving away their fabulous Jam Nuts, which they were. I used both of them on a recording gig immediately following the convention. Latch Lake introduced a burley new tripod mic stand with the same boom clutch as found on their weighted base models. Want.
(3) I saw and heard the new Cliff Mics ribbon. It was impressive on a number of levels. The magnets were so massive and strong, I thought they were going to pull the hair off my face. Interestingly, the cover was made of mesh cloth rather than metal.
(4) On recommendation I took some time to check out Miktek. Apparently the late, great Oliver Orchut of TAB Funkenwerks designed most of their microphones. I was especially interested to hear the figure 8 of their multi-pattern mikes, with insanely good off axis rejection and an even transition from on to off axis. Impressive.
Did you see something at AES that belongs on this list? Let me know, won’t you?
The key thing that separates linear media production from interactive is: indeterminacy. Middleware helps us manage this difference.
Justification of middleware:
(1) Puts more audio control in the hands of audio people, and
(2) Simplifies work for coders.
FMOD Studio is now sample based, not frame based.
It was suggested during this discussion that Master Audio seems ideally suited for 2D and casual games. It supports all systems to which Unity can publish, including web. It has better documentation than Fabric.
UPDATE Jan 5, 2015:
I originally reported Master Audio as Open Source, but it is not. When you buy, you get access to all of the source code — true. But game makers do not submit new code to Dark Tonic to update the product, rather Dark Tonic takes responsibility to write and publish Master Audio and allow game makers access in case they want to add code for their game. Brian Hunsaker of Dark Tonic clarified that Master Audio is used by AAA game studios, not merely 2D or indie developers. It is intended for any Unity based product that does not require realtime audio parameter changes.
My notes from Game Audio 3 Sound Design and Mix: Challenges and Solutions – Games, Film, Advertisement at 137th AES Convention 2014, Los Angeles
Presented Oct 9 by a panel of industry veterans: Charles Deenen, John Fasal, Tim Gedemer, Csaba Wagner, and Bryan Watkins.
There were many comparisons in working between game audio and film sound.
The short timelines and high quality standards seem similar for both.
Game audio folks seem less set in their ways, more collaborative with sound professionals, than people who make film trailers. This may be related to the veteran status of film trailer folks (typically 20+ years) to game audio (typically under 20 years of experience). One area of overlap: when people who are good at game trailers expand their career, there is a somewhat natural transition to film trailers.
Game audio source is typically pretty clean: studio recordings, ADR. Film source is typically noisy production audio that may need significant cleanup.
The expected playback system for a game trailer is a desktop or laptop computer, with limited frequency response, especially in the low end. Film trailers enjoy higher fidelity in movie theaters, home theaters. Also, volume measurement standards and best practices for loudness are different for theaters than the internet.
Deenan mentioned a unifying concept for his work in both film and game audio is trying to reduce source elements to cleaner, more fundamental sounds. Layers seem to combine better when they are stripped down, simplified.
Fasal showed a picture where a bike rack was attached to a car as a microphone mounting system. He also said that every car seems to need attention to where mikes are placed to get the best sounds… that there is no “tried and true” recipe. Though recording technology has changed in the last 30 years, the opportunity to listen, choose great sources, and carefully place microphones remains.
We tend to think interactive music started with video games like Frogger, but Mozart held public concerts where musicians would play a score arranged by audience participation of throwing dice (“Musikalsches Würfelspiel”). What was true then is also true now: people enjoy influencing how music is played, or interacting with proxies that cause changes in music. Interactive music is often employed to reduce listener fatigue, because people tend to spend more time in a given play pattern than listening to the same music cue in linear media.
Horizontal Sequencing: using crossfades to switch between two different streams, or rearranging the order in which different musical “chunks” are presented.
Vertical Layering: Additive, where one, some or all layers can play simultaneously and everything still works; or Interchange, where some layers are mutually exclusive.
Music Data: Individual notes/samples are available to play and a separate instruction stream (a la player piano roll) sequences how to play them. Examples include MIDI and MOD.
Generative Music, also known as Algorithmic Composition
Some random element is introduced to support indeterminancy, like rolling dice. Rules govern how likely different musical events may happen. Wind chimes are a kind of generative music system.
As powerful and compelling as these interactive music forms are, linear music continues to play an important role, often being the best solution for a given situation in a video game.