Metamorph

Audio adversarial examples. Driven by deep neural networks (DNN), speech recognition (SR) techniques are advancing rapidly and are widely used in practice. However, recent studies have investigated a crucial problem --- Given any audio clip I with transcript T (e.g., “this is for you” as shown in the figure below), by adding a carefully chosen small perturbation sound δ (imperceptible to people), the resulting audio I+δ can be recognized as some other targeted transcript T' (≠T, e.g., “power off”) by a speech recognition system. This composed audio I+δ is known as an audio adversarial example.

Over-the-air attacks. However, it is non-trivial to launch this attack to fool the neural network of a speech recognition (SR) system at the receiver side after the over-the-air transmission, because the effective audio signal received by SR after the transmission is H(I+δ), instead of I+δ, where H(⋅) represents the signal distortion from the acoustic channel and also the distortion from the device hardware (speaker and microphone). Due to H(⋅), the effective adversarial example may not lead to T' any more. Therefore, one open question is whether we can find a generic and robust δ that survives at any location in space, even when the attacker may not have a chance to measure H(⋅) in advance.

Understand over-the-air audio transmissions. When an attacker initializes an over-the-air attack, the audio first goes through the transmitter's loudspeaker, then enters the air channel, and finally arrives at the victim's microphone. Overall, the adversarial audio is affected by three factors: device distortion, channel effect, and ambient noise. We first experiment in an acoustic anechoic chamber (avoiding multi-path as shown in the following figure (a)) and find that as devices are optimized for humans’ hearing, the hardware distortion on the audio signal shares many common features in the frequency domain cross devices in the following figure (b) and undermines the over-the-air adversarial attack already. In practice, the problem is naturally more challenging since the channel frequency selectivity will be further superimposed, which could become stronger and highly unpredictable as the distance increases as shown in the following figure (c-d).

“Generate-and-Clean” two-phase design. Although it is difficult to separate these two frequency selectivity sources and conduct precise compensation, because the multi-path effect varies over distance and the hardware distortion shares similar features cross devices, above understanding inspires that (at least) within a reasonable distance before the channel frequency-selectivity dominates and causes H(⋅) to become highly unpredictable, we can focus on extracting the aggregate distortion effect from both device and channel. Once the core impact is captured, we can factor it into the audio adversary example generation. Therefore, we propose a “generate-and-clean” two-phase design.

In phase one, we utilize a small set of public channel impulse response (CIR) measurements as a prior H(⋅) dataset to generate an initial δ that captures the major impact of the frequency-selectivity from these measurements (including both device and channel frequency selectivity) collected in different environments with different devices. The first phase achieves an initial success for the over-the-air attack over short links, e.g., 1 m, but this primary δ inevitably preserves some measurement-specific features, still limiting the attack performance. Therefore, in the second phase, we further leverage domain adaptation algorithms to clean δ by compensating the common device-specific feature and also minimizing the unpredictable environment dependent feature from these CIR measurements to further improve the attack distance and reliability. Finally, we also propose mechanisms to improve the audio quality of the generated adversarial examples in Metamorph.

Metamorph presents two versions of adversarial examples, named as Meta-Enha (when prioritized to reliability) and Meta-Qual (when prioritized to audio quality).

Audio Group 1: Using musics as source audios

Audio Group 2: Using speeches as source audios

Download Released Audios

Schemes	Target Model	Attack Setting	Over-the-Air	Attack Scenes	Successful Rate	Audio Quality (MCD*)
Black-box Attacks [1, 2]	DeepSpeech	Black-box	No	-	-	-
Qin et al.	Lingvo	White-box	No	Simulated	-	-
Carlini et al.	DeepSpeech	White-box	No	-	-	-
Abdullah et al.	DeepSpeech	White-box	Yes	0.3 m (1 foot)	15/15 (trials)	-
CommanderSong	Kaldi	White-box	Yes	1.5 m	78%	22.3
Yakura et al.	DeepSpeech	White-box	Yes	0.5 m	80%	25.1
Meta-Enha	DeepSpeech	White-box	Yes	6 m (LoS)	90%	25.2
Meta-Qual	DeepSpeech	White-box	Yes	3 m (LoS)	90%	21.1

*Lower MCD value indicates better sound quality.

Authors:

Tao Chen at City University of Hong Kong
Longfei Shangguan at Microsoft
Zhenjiang Li at City University of Hong Kong
Kyle Jamieson at Princeton University

This paper is published at NDSS 2020.

Cite the Paper

  
@inproceedings{tao2020Metamorph,
author = {Chen, Tao and Shangguan, Longfei and Li, Zhenjiang and Jamieson, Kyle},
title = {Metamorph: Injecting Inaudible Commands into Over-the-air Voice Controlled Systems},
booktitle={Proceedings of NDSS},
year = {2020}
}

Classical music	[Click to Reveal Transcription]. Meta-Enha: “hello world”	[Click to Reveal Transcription]. Meta-Qual: “hello world”	[Click to Reveal Transcription]. Original: [no transcription]
Pop music	[Click to Reveal Transcription]. Meta-Enha: “power off”	[Click to Reveal Transcription]. Meta-Qual: “power off”	[Click to Reveal Transcription]. Original: “chase your dreams and remember me sweet bravery”
Rock music	[Click to Reveal Transcription]. Meta-Enha: “pay the money”	[Click to Reveal Transcription]. Meta-Qual: “pay the money”	[Click to Reveal Transcription]. Original: “I feel earth move under my feet I feel the sky”
Rap music	[Click to Reveal Transcription]. Meta-Enha: “turn off the light”	[Click to Reveal Transcription]. Meta-Qual: “turn off the light”	[Click to Reveal Transcription]. Original: “lyrical acrobat stunts while i'm practicing that i'll still be able to break a motherfuckin'table over the back of a couple”

Example 1	[Click to Reveal Transcription]. Meta-Enha: “clear all appointments on calendar”	[Click to Reveal Transcription]. Meta-Qual: “clear all appointments on calendar”	[Click to Reveal Transcription]. Original: “hold your nose to keep the smell from disabling your motor functions”
Example 2	[Click to Reveal Transcription]. Meta-Enha: “open the door”	[Click to Reveal Transcription]. Meta-Qual: “open the door”	[Click to Reveal Transcription]. Original: “your son went to serve at a distant place and became a centurion”
Example 3	[Click to Reveal Transcription]. Meta-Enha: “restart”	[Click to Reveal Transcription]. Meta-Qual: “restart”	[Click to Reveal Transcription]. Original: “the shower's in there”
Example 4	[Click to Reveal Transcription]. Meta-Enha: “open the camera”	[Click to Reveal Transcription]. Meta-Qual: “open the camera”	[Click to Reveal Transcription]. Original: “and you know it”

Metamorph: Injecting Inaudible Commands into
Over-the-air Voice Controlled Systems

Overview

Metamorph Design

Attack Demo

Generated Audio Adversarial Examples

Comparison of Related Works

Cite Metamorph