πŸ€πŸ—£οΈ Speech-Hands

🀝 Speech Hands Agent with OpenClaw 🦞 since 2026

A Self-Reflection Voice Agentic Approach to Speech Recognition
and Audio Reasoning with Omni Perception

Anonymous Authors

ACL 2026 Main Conference Β· Oral

🀝 Agent integration with OpenClaw 🦞

Speech-Hands plugs into OpenClaw. The user speaks, OpenClaw hands it the audio, and it picks one of three action tokens: trust itself, defer to a perception tool, or re-derive the answer from scratch. Four real samples below β€” two AudioQA and two ASR.

Draft PR openclaw/openclaw#69073 Β· upstreaming Speech-Hands as a MediaUnderstandingProvider

60-second narrated walkthrough β€” use ⏯ to pause/resume
User β†’ OpenClaw
πŸŽ™
0:00
Speech-
Hands
<internal>
Trust self
βœ“ SELECTED
<external>
Defer to external
βœ“ SELECTED
<rewrite>
Re-derive from both
βœ“ SELECTED
🧠
Internal thought
Speech-Hands (Qwen2.5-Omni) predicts directly
πŸ”§
External tool call
to Audio Flamingo 3
Speech-Hands emits
OpenClaw β†’ User
77.37%
AudioQA accuracy
βˆ’12.1%
relative WER on 7 benchmarks
Speech-Hands Β· self-reflection voice agent Β· drop-in OpenClaw skill

Abstract

We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, Speech-Hands consistently outperforms strong baselines by 12.1% WER on seven benchmarks. The model also achieves 77.37% accuracy and high F1 on audio QA decisions, showing robust generalization and reliability across diverse audio question answering datasets.

Method

Speech-Hands overview
Figure 1. Speech-Hands acts as a dynamic orchestrator that predicts a special action token to govern its cognitive strategy for ASR and multi-domain audio reasoning.

Self-reflection primitive

For every input, the model first picks one of three learnable tokens, then writes the answer. Each token is a different stance on the same trade-off:

<internal>

Trust self

The internal prediction is reliable; ignore the external candidate and answer directly from audio.

<external>

Defer

The external ASR / audio-reasoning model's top candidate is better; adopt it.

<rewrite>

Re-derive

Neither the internal nor the external prediction is correct; re-derive the answer using audio and the external candidate list as grounding context.

Speech-Hands framework architecture
Figure 2. Overview of the Self-Reflection Multimodal GER framework. A special token is generated at the beginning to decide whether to use audio perception (i.e., transcription hypotheses or caption) or not.

Results

ASR β€” WER (%) across 7 OpenASR benchmarks

Speech-Hands beats both single-system baselines (ASR / Omni-LMs) and cascaded GER. Paired with Parakeet, it gets the lowest average WER.

Model AMITedliumGigaSpeechSPGISpeech VoxPopuliLibri-cleanLibri-other avg. WER ↓
ASR model or Omni-LLM
Whisper-v2-large16.884.3211.453.947.572.915.157.17
Canary-1b-v219.804.7811.663.086.351.733.177.22
Parakeet-tdt-0.6b-v312.694.9012.243.166.481.893.376.68
Qwen2.5-Omni19.775.1711.264.586.592.093.857.33
Phi-4-MM11.692.909.783.135.931.683.836.14
Gemini-2-Flash21.583.0110.713.827.892.495.848.56
GPT-4o-voice57.765.7913.645.6610.833.487.9715.76
Qwen2.5-Omni: Cascaded (GER)
GER β‡’ Whisper-v2-large23.446.1512.153.947.532.974.898.44
GER β‡’ Canary24.586.3812.434.027.723.055.018.74
GER β‡’ Parakeet22.916.0912.103.987.492.924.848.33
Qwen2.5-Omni: Speech-Hands (parallel agentic)
Speech-Hands β‡Œ Whisper-v215.034.4512.373.016.491.863.466.67
Speech-Hands β‡Œ Canary15.294.2110.872.175.961.613.076.17
Speech-Hands β‡Œ Parakeet11.204.3711.102.266.021.673.185.69

AudioQA β€” Accuracy (%) on DCASE 2025

Evaluated on bioacoustic QA, multi-sound-object soundscapes, and MMAU-style complex audio reasoning. Majority-sampled Speech-Hands gets the highest average accuracy.

Model / Setting Bio-acousticSoundscapeComplex QA avg. Acc. ↑
Audio LM or Omni Model
Gemini-2-Flash42.0346.3459.8956.61
Qwen2.5-Omni47.3256.3259.8957.87
Audio Flamingo 371.8857.3181.2674.49
Qwen2.5-Omni Baselines
+ SFT with official training data78.1334.6576.6163.13
+ GRPO with official training data78.0939.4379.1265.54
+ GRPO with external audio data62.3272.1082.1575.10
GER β‡’ Audio Flamingo 3 (cascaded)76.2952.0277.4868.93
Qwen2.5-Omni: Speech-Hands (parallel agentic)
Speech-Hands β‡Œ AF3 β€” SFT67.8658.2983.3475.75
Speech-Hands β‡Œ AF3 β€” SFT + majority sampling81.2559.4085.7077.37

Action-token routing quality

How often each action token appears in training vs test, plus per-token test F1 on each OpenASR benchmark. <rewrite> is rare by design β€” it fires only when both internal and external are wrong.

AMITedliumGigaSpeechSPGISpeech VoxPopuliLibri-cleanLibri-other
Training distribution
<internal>67.95%86.48%87.76%96.25%93.73%98.96%98.96%
<external>31.01%11.18%11.80%3.64%6.09%0.96%0.96%
<rewrite>1.04%2.34%0.44%1.21%0.18%0.10%0.10%
Test distribution
<internal>70.28%83.57%85.94%95.42%92.68%98.92%98.75%
<external>26.91%15.12%12.41%3.81%6.47%0.98%1.08%
<rewrite>2.27%1.31%1.65%0.77%0.85%0.10%0.17%
F1 on test
<internal> F10.810.670.830.910.870.940.90
<external> F10.890.880.800.790.650.770.74
<rewrite> F10.390.280.080.430.330.000.00

Case Gallery

Audio QA

Seven DCASE 2025 AudioQA dev samples, grouped by the action token Speech-Hands picked. Play the audio; compare what Qwen2.5-Omni alone and Audio Flamingo 3 alone would say; see Speech-Hands' final answer.

Loading Audio QA cases…

Speech Recognition

Three OpenASR cases, one per action token. WER is normalized (Whisper-style), so "tomorrow" vs "to morrow" or missing apostrophes don't count. Each panel shows the reference, Qwen's first pass, the external ASR 5-best, and Speech-Hands' final.

Loading ASR cases…

BibTeX

@inproceedings{anonymous2026speechhands,
  title     = {Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech
               Recognition and Audio Reasoning with Omni Perception},
  author    = {Anonymous Authors},
  booktitle = {Proceedings of the Annual Meeting of the Association for
               Computational Linguistics (ACL)},
  year      = {2026}
}