Learning to Hear While Walking

Introduction

Ego-noise separation is a fundamental capability for legged robot audition. When a robot walks, its motors, joints, foot contacts, and body vibration can dominate the onboard microphones, making it difficult to recognize surrounding sounds. This project studies how a robot can mine its own ego-noise from unlabeled recordings and adapt an audio separator for downstream zero-shot sound classification.

Example of ego-noise recorded by a humanoid robot, Unitree G1.

Example of ego-noise recorded by a quadruped robot, Unitree Go1.

Proposed Method

**Overview of the ego-noise separator.** First, the VAE encoder maps audio to a low-frame-rate latent representation. Adapt-DiT, described below, predicts a velocity field that moves the latent representation h of ego-noise-contaminated audio toward the latent representation of clean audio. Finally, the clean VAE decoder converts the clean latent representation back to a waveform.

Architecture of Adapt-DiT — **Architecture of Adaptive Diffusion Transformer (Adapt-DiT).** Adapt-DiT builds on a DiT trained to separate text-specified sounds and efficiently adapts it to the fixed task of ego-noise separation. Conditioned on the latent representation of mixture audio, Adapt-DiT predicts a velocity field from Gaussian noise toward the clean audio latent representation.

Prompt-anchered ego-noise mining — **Prompt-Anchered Ego-Noise Mining.** Robot recordings are first split into short audio clips. A pretrained VAE encoder and Transformer encoder map each clip to a latent representation so semantic similarity between clips can be measured. A pretrained ModernBERT text encoder maps ego-noise prompts to the same embedding space. Because the audio and text encoders are aligned, sounds and text with the same meaning are mapped to nearby vectors. The embeddings are then passed through Principal Component Analysis (PCA) to reduce factors other than whether a clip contains only ego-noise or also includes environmental sound. Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN), which is robust to imbalanced cluster sizes, groups semantically similar sounds. The cluster closest to the text-prompt embedding is selected as the ego-noise-only cluster. This framework automatically mines ego-noise-only clips from ego-noise recordings that may contain environmental sounds.

Training method of the ego-noise separator — **Training method of Ego-Noise Separator.** Automatically mined ego-noise is mixed at diverse ratios with sounds x(t) sampled from a large-scale sound-source dataset. The Ego-Noise Separator is trained to separate ego-noise and the remaining non-ego sound from the mixture.

Overall Evaluation

We evaluate ego-noise separation on environmental sounds mixed with robot ego-noise from Unitree G1 and Unitree Go1 at -6, 0, and +6 dB SNR. Tables report mean separation scores and zero-shot classification scores by method, robot, and SNR; the best value in each robot-SNR column is shown in bold.

Demo

Each example shows references first, followed by separated target estimates from each method.

CLAP Score is the CLAP audio embedding cosine similarity between the target and the separated estimate. SAJ Score is the overall separation judgment score. Higher is better.