How the world’s biggest companies got millions of people to let temps analyze some very sensitive recordings.
Ruthy Hope Slatis couldn’t believe what she was hearing. She’d been hired by a temp agency outside Boston for a vague job: transcribing audio files for Amazon.com Inc. For $12 an hour, she and her fellow contractors, or “data associates,” listened to snippets of random conversations and jotted down every word on their laptops. Amazon would only say the work was critical to a top-secret speech-recognition product. The clips included recordings of intimate moments inside people’s homes.
This was in fall 2014, right around the time Amazon unveiled the Echo speaker featuring Alexa, its voice-activated virtual-assistant software. Amazon pitched Alexa as a miracle of artificial intelligence in its first Echo ad, in which a family asked for and received news updates, answers to trivia questions, and help with the kids’ homework. But Slatis soon began to grasp the extent to which humans were behind the robotic magic she saw in the commercial. “Oh my God, that’s what I’m working on,” she remembers thinking. Amazon was capturing every voice command in the cloud and relying on data associates like her to train the system. Slatis first figured she’d been listening to paid testers who’d volunteered their vocal patterns in exchange for a few bucks. She realized that couldn’t be.
The recordings she and her co-workers were listening to were often intense, awkward, or intensely awkward. Lonely sounding people confessing intimate secrets and fears: a boy expressing a desire to rape; men hitting on Alexa like a crude version of Joaquin Phoenix in Her. And as the transcription program grew along with Alexa’s popularity, so did the private information revealed in the recordings. Other contractors recall hearing kids share their home address and phone number, a man trying to order sex toys, a dinner party guest wondering aloud whether Amazon was snooping on them at that very instant. “There’s no frickin’ way they knew they were being listened to,” Slatis says. “These people didn’t agree to this.” She quit in 2016.
In the five years since Slatis first felt her skin crawl, a quarter of Americans have bought “smart speaker” devices such as the Echo, Google Home, and Apple HomePod. (A relative few have even bought Facebook’s Portal, an adjacent smart video screen.) Amazon is winning the sales battle so far, reporting that more than 100 million Alexa devices have been purchased. But now a war is playing out between the world’s biggest companies to weave Alexa, Apple’s Siri, Alphabet’s Google Assistant, Microsoft’s Cortana, and Facebook’s equivalent service much deeper into people’s lives. Mics are built into phones, smartwatches, TVs, fridges, SUVs, and everything in between. Consulting firm Juniper Research Ltd. estimates that by 2023 the global annual market for smart speakers will reach $11 billion, and there will be about 7.4 billion voice-controlled devices in the wild. That’s about one for every person on Earth.
The question is, then what? These machines are not creating audio files of your every decibel—tech companies say their smart speakers record audio only when users activate them—but they are introducing always-on mics to kitchens and bedrooms, which could inadvertently capture sounds users never intended to share. “Having microphones that listen all the time is concerning. We’ve found that users of these devices close their eyes and trust that companies are not going to do anything bad with their recorded data,” says Florian Schaub, a University of Michigan professor who studies human behavior around voice-command software. “There’s this creeping erosion of privacy that just keeps going and going. People don’t know how to protect themselves.”
Amazon declined interview requests for this story. In an emailed statement, a spokeswoman wrote, “Privacy is foundational to how every team and employee designs and develops Alexa features and Echo devices. All Alexa employees are trained on customer data handling as part of our security training.” The company and its competitors have said computers perform the vast majority of voice requests without human review.
Yet so-called smart devices inarguably depend on thousands of low-paid humans who annotate sound snippets so tech companies can upgrade their electronic ears; our faintest whispers have become one of their most valuable datasets. Earlier this year, Bloomberg News was first to report on the scope of the technology industry’s use of humans to review audio collected from their users without disclosures, including at Apple, Amazon, and Facebook. Few executives and engineers who spoke with Bloomberg Businessweek for this story say they anticipated that setting up vast networks of human listeners would be problematic or intrusive. To them, it was and is simply an obvious way to improve their products.
Current and former contractors such as Slatis make clear that the downsides of pervasive audio surveillance were obvious to those with much less financial upside at stake. “It never felt right,” says a voice transcriber for an Alexa rival who, like most of the contractors, signed a nondisclosure agreement and spoke on condition of anonymity for fear of reprisals. “What are they really selling to customers?”
Nerds have imagined voice commands to be the future of computing for more than a half-century. (Thank Star Trek.) But for most of that time, teaching machines to identify and respond to spoken sentences required matching audio files verbatim to transcribed text, a slow and expensive process. Early pioneers bought or built massive libraries of recordings—people reading newspapers or other prewritten material into mics. The Sisyphean nature of the projects eventually became an industry joke. In the 1990s, a former product manager on the speech team at Apple Inc. recalls, it offered each volunteer willing to record voice patterns at their lab a T-shirt emblazoned with the phrase “I Helped Apple Wreck a Nice Beach,” a computer’s garble of “recognize speech.”
Apple, which declined to comment for this story, became the first major company to flip the model in 2011, when it shipped the iPhone 4S with Siri, acquired the year before from a Pentagon-funded research spinoff. No longer did recordings have to be scripted and amassed in labs. Apple sold more than 4 million 4S phones within days, and soon began piling up an incalculable mountain of free, natural voice data. For the first few years, the company largely trusted outside speech-software specialists to use the data to improve Siri’s abilities, but Apple retook control around 2014. “The work was very tedious: After listening for 15 or 30 minutes, you’d get headaches,” Tao Ma, a former senior Siri speech scientist, says of transcribing user recordings. The in-house team farmed out much of this work to IT contractors in Europe, including Ireland-based GlobeTech.
Over the past few years, Apple has grown more aggressive in its harvesting and analysis of people’s voices, worried that Siri’s comprehension and speed were falling behind those of Alexa and Google Assistant. Apple treated Siri’s development like a verbal search engine that it had to prep to fulfill endless user queries and ramped up its dependence on audio analysis to feed the assistant’s lexicon. Temps were expected to account for the clips’ various languages, dialects, and cultural idiosyncrasies.
Former contractors describe the system as something out of the Tower of Babel or George Orwell’s 1984. At a GlobeTech office near an airport in Cork, Ireland, some say, they sat in silence at MacBooks wearing headphones, tasked with transcribing 1,300 clips a day, each of which could be a single sentence or an entire conversation. (This quota was reduced from as many as 2,500 clips, others say, to improve accuracy rates.) When a contractor clicked play on a voice recording, the computer filled a text box with the words it thought Siri “heard,” then prompted the worker to approve or correct the translation and move on. GlobeTech didn’t respond to requests for comment.
A program the workers used, called CrowdCollect, included buttons to skip recordings for a variety of reasons—accidental trigger, missing audio, wrong language—but contractors say there was no specific mechanism to report or delete offensive or inappropriate audio, such as drunk-sounding users slurring demands into the mics or people dictating sexts. Contractors who asked managers whether they could skip overly private clips were told no clips were too private. They were expected to transcribe anything that came in. Contractors often lasted only a couple of months, and training on privacy issues was minimal. One former contractor who had no qualms about the work says listening in on real-world users was “absolutely hilarious.”
In 2015, the same year Apple Chief Executive Officer Tim Cook called privacy a “fundamental human right,” Apple’s machines were processing more than a billion requests a week. By then, users could turn on a feature so they no longer had to push a button on the iPhone to activate the voice assistant; it was always listening. Deep in its user agreement legalese, Apple said voice data might be recorded and analyzed to improve Siri, but nowhere did it mention that fellow humans might listen. “I felt extremely uncomfortable overhearing people,” says one of the former contractors, especially given how often the recordings were of children.
Ten former Apple executives in the Siri division say they didn’t and still don’t see this system as a violation of privacy. These former executives say recordings were disassociated from Apple user IDs, and they assumed users understood the company was processing their audio clips, so what did it matter if humans helped with the processing? “We felt emotionally safe, that this was the right thing to do,” says John Burkey, who worked in Siri’s advanced development group until 2016. “It wasn’t spying. It was, ‘This [Siri request] doesn’t work. Let’s fix it.’ It’s the same as when an app crashes and asks if you want to send the report to Apple. This is just a voice bug.”
The difference between this system and a bug on a MacBook, of course, is that MacOS clearly asks users if they’d like to submit a report directly after a program crashes. It’s an opt-in prompt for each malfunction, as opposed to Siri’s blanket consent. Current and former contractors say most Siri requests are banal—“play a Justin Bieber song,” “where’s the nearest McDonald’s”—but they also recall hearing extremely graphic messages and lengthy racist or homophobic rants. A former data analyst who worked on Siri transcriptions for several years says workers in Cork swapped horror stories during smoke breaks. A current analyst, asked to recount the most outrageous clip to come through CrowdCollect, says it was akin to a scene from Fifty Shades of Grey.
Apple has said less than 0.2% of Siri requests undergo human analysis, and former managers dismiss the contractors’ accounts as overemphases on mere rounding errors. “ ‘Oh, I heard someone having sex’ or whatever. You also hear people farting and sneezing—there’s all kind of noise out there when you turn a microphone on,” says Tom Gruber, a Siri co-founder who led its advanced development group through 2018. “It’s not like the machine has an intention to record people making certain kinds of sounds. It’s like a statistical fluke.”
By 2019, after Apple brought Siri to products such as its wireless headphones and HomePod speaker, it was processing 15 billion voice commands a month; 0.2% of 15 billion is still 30 million potential flukes a month, or 360 million a year. The risks of inadvertent recording grew along with the use cases, says Mike Bastian, a former principal research scientist on the Siri team who left Apple earlier this year. He cites the Apple Watch’s “raise to speak” feature, which automatically activates Siri when it detects a wearer’s wrist being lifted, as especially dicey. “There was a high false positive rate,” he says.