Subtle vibrations induced in everyday objects by ambient sound, especially speech, carry rich acoustic cues that can potentially be transformed into meaningful text. This paper presents a Vision-based Subtitle Generator (VSG). This is the first attempt to recover text directly from high-speed videos of sound-induced object vibrations using a generative approach. To this end, VSG introduces a phase-based motion estimation (PME) technique that treats each pixel as an “independent microphone”, and extracts thousands of pseudo-acoustic signals from a single video. Meanwhile, the pretrained Hidden-unit Bidirectional Encoder Representations from Transformers (HuBERT) serves as the encoder of the proposed VSG-Transformer architecture, effectively transferring large-scale acoustic representation knowledge to the vibration-to-text task. These strategies significantly reduce reliance on large volumes of video data. Experimentally, text was generated from vibrations induced in a bag of chips by AISHELL-1 audio samples. Two VSG-Transformer variants with different parameter scales (Base and Large) achieved character error rates of 13.7 and 12.5%, respectively, demonstrating the remarkable effectiveness of the proposed generative approach. Furthermore, experiments using signal upsampling techniques showed that VSG-Transformer performance was promising even when low-frame-rate videos were used, highlighting the potential of the proposed VSG approach in scenarios featuring conventional cameras.