ConvoBench | Voice Agent Paper Hunt

Showing 20 of 20 papers

Landmark

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu, et al.ICML20232100 citations

Whisper is trained on 680,000 hours of multilingual data, achieving robust speech recognition that generalizes well across domains and languages.

ASRfoundation-modelOpenAI

arXiv

Landmark

Google Duplex: An AI System for Accomplishing Real-World Tasks Over the Phone

Yaniv Leviathan, Yossi MatiasGoogle AI Blog20181250 citations

Google Duplex uses a recurrent neural network to conduct natural-sounding conversations over the phone for tasks like making restaurant reservations.

conversational-AIreal-worldGoogle

Link

SUPERB: Speech Processing Universal PERformance Benchmark

Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, et al.INTERSPEECH2021892 citations

A benchmark for evaluating speech processing capabilities across critical tasks like Automatic Speech Recognition, Keyword Spotting, Speaker Identification, Intent Classification, and Emotion Recognition.

benchmarkASRfoundation-model

arXiv

Landmark

GPT-4o: Omni-Modal Foundation Model

OpenAIOpenAI Blog2024890 citations

GPT-4o achieves human-like 232ms response latency for audio input, enabling natural real-time voice conversations with full-duplex capabilities.

multimodalreal-timeOpenAI

Link

SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

Meta AIarXiv2023456 citations

A foundational multilingual and multitask model that supports near-100 languages for speech-to-speech, speech-to-text, text-to-speech, and text-to-text translation.

multilingualmultimodaltranslation

arXiv

MT-Bench: Multi-Turn Benchmark for LLM Conversation

Various AuthorsarXiv2023423 citations

MT-Bench assesses LLMs in multi-turn dialogues, focusing on their capacity to maintain context and demonstrate reasoning skills across eight categories.

benchmarkmulti-turnLLM

arXiv

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

LMSYSarXiv2024234 citations

Chatbot Arena offers an open environment for evaluating LLMs based on human preferences through pairwise comparisons.

evaluationhuman-preferenceLLM

arXiv

Moshi: A Full-Duplex Speech-to-Speech Model

Kyutai LabsarXiv2024156 citations

Moshi enables simultaneous listening and speaking (full-duplex), processing speech directly without text intermediaries, achieving natural turn-taking.

full-duplexspeech-to-speechreal-time

arXiv

SLUE: Spoken Language Understanding Evaluation

Shang-Wen Li, Suwon Shon, Hao Tang, et al.ASRU2021156 citations

A benchmark suite covering tasks like Named Entity Recognition, Sentiment Analysis, and Automatic Speech Recognition for advancing conversational AI.

benchmarkSLUNER

arXiv

LLaMA-Omni: Seamless Speech Interaction with LLMs

Various AuthorsarXiv202489 citations

LLaMA-Omni is built on LLaMA-3.1-8B for low-latency, high-quality speech interaction, generating speech responses directly from speech instructions.

speech-to-speechLLMreal-time

arXiv

Survey on Recent Advances in Speech Language Models

Various AuthorsarXiv202489 citations

A comprehensive survey reviewing methodologies, architectural components, training approaches, and evaluation metrics for Speech Language Models.

surveySpeechLMmethodology

arXiv

DialogBench: Evaluating LLMs as Human-like Dialogue Systems

Various AuthorsNAACL202467 citations

DialogBench evaluates LLMs based on their ability to act as human-like dialogue systems, comprising 12 distinct dialogue tasks using GPT-4 generated evaluation instances.

benchmarkdialogueLLM

arXiv

VoiceAssistant-Eval: A Comprehensive Benchmark for AI Assistants

Various AuthorsarXiv202445 citations

A comprehensive benchmark comprising 10,497 curated examples spanning 13 task categories including natural sounds, music, spoken dialogue, multi-turn dialogue and role-play imitation.

benchmarkevaluationmultimodal

arXiv

Sparrow-1: Multilingual Audio Model for Real-Time Conversational Flow

Various AuthorsarXiv202434 citations

Sparrow-1 focuses on real-time conversational flow and 'floor transfer,' predicting when a system should listen, wait, or speak to mimic human conversation timing.

real-timeturn-takingmultilingual

arXiv

VocalBench: Benchmarking Vocal Conversational Abilities

Various AuthorsarXiv202431 citations

A benchmark designed to assess speech conversational abilities using 9,400 instances across semantic quality, acoustic performance, conversational abilities, and robustness.

benchmarkconversationspeech

arXiv

MiniMax Speech 2.5: Sub-250ms End-to-End Voice AI

MiniMaxMiniMax Blog202428 citations

MiniMax Speech 2.5 achieves end-to-end latency under 250 milliseconds, enabling truly real-time voice interactions.

latencyreal-timeTTS

Link

SOVA-Bench: Evaluating Generative Speech LLMs and Voice Assistants

Various AuthorsarXiv202423 citations

An evaluation system for generative speech LLMs that quantifies performance in general knowledge and the ability to recognize, understand, and generate speech flow.

benchmarkLLMvoice-assistant

arXiv

VoiceAgentEval: Evaluating LLMs for Expert-Level Outbound Calling

Various AuthorsarXiv202418 citations

A benchmark for evaluating LLMs in expert-level intelligent outbound calling scenarios with user simulation and dynamic evaluation methods.

benchmarkvoice-agentLLM

arXiv

SpeechR: Benchmarking Speech Reasoning in Large Audio-Language Models

Various AuthorsarXiv202415 citations

A benchmark to evaluate speech reasoning capabilities of large audio-language models in factual, procedural, and normative tasks.

reasoningaudio-LMbenchmark

arXiv

WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild

Various AuthorsarXiv202512 citations

A benchmark for end-to-end SpeechLLMs that addresses limitations of existing evaluations and provides comprehensive assessment in real-world speech interactions.

benchmarkSpeechLLMreal-world

arXiv