A Vision-Based Subtitle Generator: Text Reconstruction via Subtle Vibrations from Videos

Yan Wang; Yingchong Wang; Xiuqi Zhang; Xiaoyu Ding

doi:10.20944/preprints202512.1053.v1

Submitted:

10 December 2025

Posted:

11 December 2025

You are already at the latest version

Abstract

Subtle vibrations induced in everyday objects by ambient sound, especially speech, carry rich acoustic cues that can potentially be transformed into meaningful text. This paper presents a Vision-based Subtitle Generator (VSG). This is the first attempt to recover text directly from high-speed videos of sound-induced object vibrations using a generative approach. To this end, VSG introduces a phase-based motion estimation (PME) technique that treats each pixel as an “independent microphone”, and extracts thousands of pseudo-acoustic signals from a single video. Meanwhile, the pretrained Hidden-unit Bidirectional Encoder Representations from Transformers (HuBERT) serves as the encoder of the proposed VSG-Transformer architecture, effectively transferring large-scale acoustic representation knowledge to the vibration-to-text task. These strategies significantly reduce reliance on large volumes of video data. Experimentally, text was generated from vibrations induced in a bag of chips by AISHELL-1 audio samples. Two VSG-Transformer variants with different parameter scales (Base and Large) achieved character error rates of 13.7 and 12.5%, respectively, demonstrating the remarkable effectiveness of the proposed generative approach. Furthermore, experiments using signal upsampling techniques showed that VSG-Transformer performance was promising even when low-frame-rate videos were used, highlighting the potential of the proposed VSG approach in scenarios featuring conventional cameras.

Keywords:

text reconstruction from vibrations

;

phase-based motion estimation (PME)

;

pretrained acoustic model

;

transformer

Subject:

Computer Science and Mathematics - Computer Vision and Graphics

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

A Vision-Based Subtitle Generator: Text Reconstruction via Subtle Vibrations from Videos

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe