Julia Wohl, RAD intern
I came across a clip from a September 1979 episode of All Things Considered featuring NPR reporter Steve Proffitt. He explores the mechanisms behind the computer-generated voice of the Speak & Spell, a device from Texas Instruments with synthetic speech capabilities that quizzes its user on their spelling ability. I thought Proffitt’s piece offered an opportunity to dissect how the TI team defined a “normal” voice, how the TI team’s choice resonates with news broadcasting pronunciation standards, and how these similar standards have a bearing on the present.
The team of TI engineers is regarded as the first to implement speech synthesis capabilities into a small and affordable computational device. Within two years of the introduction of the Speak & Spell, both Bell Labs and Intel introduced similar devices that used digital signal processing. These advancements paved the way for smartphones and smart speakers.
Gene Frantz, interviewed by Proffitt for the story, and his team of engineers found the broadcaster’s vocal tract to be a viable model for the Speak & Spell. Alice Helton, a linguist with whom TI engineers worked closely, chose the American Heritage Dictionary of the English Language to govern the standard of pronunciation. Helton chose the voice, too. She recalled they decided to use the Dallas-area radio announcer Mitch Carr, who reported for NPR and is currently a radio broadcaster for KRLD in Texas.
NPR journalists have continued to examine why and how the voices of news broadcasters and AI assistants alike reinforce ideas of how people in certain roles are supposed to sound.
In a 2018 Code Switch episode, hosts Shereen Marisol Meraji and Gene Demby draw the same conclusion as Frantz: the meaning of “normal” is subjective. Meraji and Demby offer important context to the first formal academic definition of the “normal” American dialect. They explain that in 1924, linguist John Kenyon of Hiram College surveyed accents of the people surrounding him and developed a set of pronunciation standards. Kenyon published his standards in dictionary form in 1944, and by 1951 the National Broadcasting Company (now NBC) had adopted it to guide their news broadcasters towards clarity in their communication. A regional standard for communication became a national one.
The multitude of dialects spoken, heard and understood across the nation complicates Kenyon’s and NBC’s standards for clear communication. Speakers and listeners, including machines that are programmed to detect a voice and emit one, perceive specific dialects as more normal and clearer than others for a number of reasons. Sometimes, people’s accents or native language can impede their ability to communicate with each other, depending on who the speaker and listener are, and can result in discrimination.
In the same 2018 Code Switch episode, Meraji and Demby interview an aspiring broadcast journalist from Baltimore named Deion Broxton. Listen to Broxton recall how he visited a speech therapist to adjust his accent, starting at 00:13:33.
In other instances, a listener’s perception depends on the gender of the speaker. The Speak & Spell’s male-sounding voice stands in contrast to today’s familiar chorus of female-sounding voices, which guide users on smartphones, smart speakers and public transit.
I called Frantz to learn more about why the engineers chose the voice of a man and a news broadcaster in particular. He explained to me that higher frequency voices, or voices often associated with being female, went above a threshold the device could capture. Additionally, the higher the frequency that is sampled, the more storage required in the device. Confined by their budget, the initial device did not have sufficient storage capacity for a higher frequency voice. So engineers were forced to model a lower frequency voice, one generally associated with men, which would take up less storage space on the device.
As Scott Simon found in a 2011 radio essay, there are several other explanations for the decision to assign a female-sounding voice to a computer. Rebecca Zorach, director of the Social Media Project at the University of Chicago's Center for the Study of Gender and Sexuality shared with CNN, “Most such decisions are probably the result of market research, so they may be reflecting gender stereotypes that already exist in the general public.”
Robert LoCascio, a leader of the Equal AI initiative, offered an alternative explanation to NPR’s Laura Sydell in 2018. He told her, “The male-dominated AI industry brings its own unconscious bias to the decision of what gender to make a virtual assistant.”
Since 1979, computer-generated speech has transformed from the glitchy and grating to the welcoming and warm. The techniques to approximate human inflection and intonation with computers have advanced and voice-assistive devices have proliferated. Despite these innovations, digital technologies have retained the vestiges of traditional gender roles and a specific type of pronunciation.