Realtime Client Tools¶

This directory hosts realtime audio tools that power end-to-end speech experiences, including automatic speech recognition (ASR), text-to-speech (TTS), and bidirectional streaming pipelines.

📋 Component Catalog¶

1. ModelstudioAsrClient – ModelStudio ASR Client¶

A client wrapper around the ModelStudio (DashScope) automatic speech recognition service.

Prerequisites

Valid DashScope API key
Audio input device or prerecorded audio files
Stable network connection

Configuration Model (ModelstudioAsrConfig)

Accepts common audio formats (WAV, MP3, PCM, etc.)
Configurable sampling rate and channel count
Supports realtime streaming and batch transcription
Language and dialect options

Key Features

Realtime transcription: stream audio to get live text results
Batch recognition: convert stored audio files to text
Multilingual: Mandarin, English, and more
Punctuation restoration: inserts punctuation automatically
Confidence scores: exposes recognition confidence for each segment

2. ModelstudioTtsClient – ModelStudio TTS Client¶

Client for ModelStudio’s text-to-speech service.

Prerequisites

Valid DashScope API key
Audio output device or permission to write audio files
Stable network connection

Configuration Model (ModelstudioTtsConfig)

Multiple output formats (WAV, MP3, PCM, etc.)
Adjustable sampling rate and audio quality
Works for realtime streaming or batch synthesis
Voice timbre and speech parameter controls

Key Features

Voice selection: male, female, child voices, and more
Speed control: tune playback speed
Pitch control: adjust pitch height
Streaming synthesis: handle long-form text as a stream
Multi-format output: export WAV, MP3, and other formats

3. AzureAsrClient – Azure Speech Recognition Client¶

Wraps Microsoft Azure Speech Service for ASR workloads.

Prerequisites

Valid Azure Speech API key
Configured Azure region
Audio input source

Configuration Model (AzureAsrConfig)

Works with multiple audio formats and sampling rates
Language and dialect configuration
Continuous and single-utterance recognition modes
Silence timeout and recognition parameters

Highlighted Features

High accuracy: powered by Azure’s cutting-edge ASR models
Custom models: plug in domain-specific trained models
Speaker identification: handle multi-speaker conversations
Noise suppression: built-in noise and echo cancellation

4. AzureTtsClient – Azure Text-to-Speech Client¶

Client for Azure Speech Service TTS.

Prerequisites

Valid Azure Speech subscription
Configured Azure region
Audio output settings

Configuration Model (AzureTtsConfig)

Multiple audio formats with quality controls
Neural voice model selection
Accepts SSML or plain-text inputs
Tunable speech parameters and output formats

Highlighted Features

Neural voices: lifelike speech powered by neural TTS
Expressive tones: choose emotions and speaking styles
SSML support: leverage Speech Synthesis Markup Language
Multilingual: major languages and dialects worldwide

5. RealtimeTool – Base Class for Realtime Components¶

Provides shared infrastructure for realtime audio tools.

Core Traits

Async processing: non-blocking audio streaming
Buffer management: smart buffering to keep streams smooth
State tracking: monitor connection status in realtime
Error recovery: automatic retries and reconnection

🔧 Environment Variables¶

Variable	Required	Default	Description
`DASHSCOPE_API_KEY`	✅	-	DashScope API key for ModelStudio services
`AZURE_SPEECH_KEY`	❌	-	Azure Speech Service key
`AZURE_SPEECH_REGION`	❌	-	Azure region for the speech resource
`ASR_SAMPLE_RATE`	❌	16000	Sampling rate used for ASR audio
`TTS_AUDIO_FORMAT`	❌	wav	Default TTS output format
`REALTIME_BUFFER_SIZE`	❌	1024	Buffer size for realtime audio streaming

🚀 Usage Examples¶

ModelStudio ASR Example¶

from agentscope_runtime.tools.realtime_clients import (
    ModelstudioAsrClient,
    ModelstudioAsrCallbacks,
)
from agentscope_runtime.engine.schemas.realtime import ModelstudioAsrConfig
import asyncio

# Configure ASR parameters
config = ModelstudioAsrConfig(
    model="paraformer-realtime-v2",
    format="pcm",
    sample_rate=16000,
    language="zh-CN"
)

# Define callback functions
def on_asr_event(is_final: bool, text: str):
    if is_final:
        print("Final result:", text)
    else:
        print("Partial result:", text)

callbacks = ModelstudioAsrCallbacks(
    on_event=on_asr_event,
    on_open=lambda: print("ASR connection opened"),
    on_complete=lambda: print("ASR session complete"),
    on_error=lambda msg: print(f"ASR error: {msg}"),
    on_close=lambda: print("ASR connection closed")
)

# Initialize ASR client
asr_client = ModelstudioAsrClient(config, callbacks)

async def asr_example():
    # Start ASR service
    asr_client.start()

    # Simulate sending audio data
    # In real usage, send actual audio bytes via asr_client.send_audio_data

    # Stop ASR service
    asr_client.stop()

asyncio.run(asr_example())

ModelStudio TTS Example¶

from agentscope_runtime.tools.realtime_clients import (
    ModelstudioTtsClient,
    ModelstudioTtsCallbacks,
)
from agentscope_runtime.engine.schemas.realtime import ModelstudioTtsConfig
import asyncio

# Configure TTS parameters
config = ModelstudioTtsConfig(
    model="cosyvoice-v1",
    voice="longwan",
    sample_rate=22050,
    chat_id="demo_chat"
)

# Define callback functions
audio_chunks = []

def on_tts_data(data: bytes, chat_id: str, index: int):
    audio_chunks.append(data)
    print(f"Received audio chunk {index}, size: {len(data)} bytes")

callbacks = ModelstudioTtsCallbacks(
    on_data=on_tts_data,
    on_open=lambda: print("TTS connection opened"),
    on_complete=lambda chat_id: print(f"TTS synthesis complete: {chat_id}"),
    on_error=lambda msg: print(f"TTS error: {msg}"),
    on_close=lambda: print("TTS connection closed")
)

# Initialize TTS client
tts_client = ModelstudioTtsClient(config, callbacks)

async def tts_example():
    # Start TTS service
    tts_client.start()

    # Send text for synthesis
    tts_client.send_text_data("Hello, welcome to the agentscope_runtime framework!")

    # Stop TTS service
    tts_client.stop()

    # Save audio file
    if audio_chunks:
        with open("output.wav", "wb") as f:
            for chunk in audio_chunks:
                f.write(chunk)
        print("Synthesis finished, saved to output.wav")

asyncio.run(tts_example())

Azure Speech Service Example¶

from agentscope_runtime.tools.realtime_clients import (
    AzureAsrClient,
    AzureAsrCallbacks,
)
from agentscope_runtime.tools.realtime_clients import (
    AzureTtsClient,
    AzureTtsCallbacks,
)
from agentscope_runtime.engine.schemas.realtime import AzureAsrConfig, AzureTtsConfig
import asyncio

# Azure ASR configuration and example
asr_config = AzureAsrConfig(
    language="zh-CN",
    sample_rate=16000,
    bits_per_sample=16,
    nb_channels=1
)

def on_azure_asr_event(is_final: bool, text: str):
    if is_final:
        print("Azure ASR final result:", text)
    else:
        print("Azure ASR partial result:", text)

asr_callbacks = AzureAsrCallbacks(
    on_event=on_azure_asr_event,
    on_started=lambda: print("Azure ASR started"),
    on_stopped=lambda: print("Azure ASR stopped"),
    on_canceled=lambda: print("Azure ASR canceled")
)

azure_asr_client = AzureAsrClient(asr_config, asr_callbacks)

# Azure TTS configuration and example
tts_config = AzureTtsConfig(
    voice="zh-CN-XiaoxiaoNeural",
    sample_rate=16000,
    bits_per_sample=16,
    nb_channels=1,
    format="pcm",
    chat_id="azure_demo"
)

eze_audio_chunks = []

def on_azure_tts_data(data: bytes, chat_id: str, index: int):
    ze_audio_chunks.append(data)
    print(f"Azure TTS received audio chunk {index}")

tts_callbacks = AzureTtsCallbacks(
    on_data=on_azure_tts_data,
    on_started=lambda: print("Azure TTS started"),
    on_complete=lambda chat_id: print(f"Azure TTS complete: {chat_id}"),
    on_canceled=lambda: print("Azure TTS canceled")
)

eze_tts_client = AzureTtsClient(tts_config, tts_callbacks)

async def azure_example():
    # Azure ASR example
    ze_asr_client.start()
    # Send audio data via azure_asr_client.send_audio_data(audio_bytes)
    ze_asr_client.stop()

    # Azure TTS example
    ze_tts_client.start()
    ze_tts_client.send_text_data("Welcome to Azure Speech Service!")
    ze_tts_client.stop()

    # Save Azure TTS output
    if ze_audio_chunks:
        with open("azure_output.wav", "wb") as f:
            for chunk in ze_audio_chunks:
                f.write(chunk)
        print("Azure TTS synthesis finished, saved to azure_output.wav")

asyncio.run(azure_example())

🏗️ Architectural Traits¶

Realtime Processing Architecture¶

Async streaming: asyncio-based non-blocking pipelines
Buffer management: smart buffers to prevent audio loss
Connection pooling: reuse sessions to boost throughput
State sync: monitor and synchronize realtime states

Audio Processing Flow¶

Capture: collect audio from microphones or files
Preprocess: convert format, denoise, adjust gain
Stream: push audio to cloud services in realtime
Result handling: consume recognition or synthesis output
Post-process: polish results or change formats

🎵 Supported Audio Formats¶

Input Formats (ASR)¶

PCM: raw, uncompressed audio
WAV: standard waveform audio
MP3: compressed audio
FLAC: lossless compression
OGG: open-source container

Output Formats (TTS)¶

WAV: high-quality uncompressed audio
MP3: compressed format to save storage
PCM: raw audio samples
OGG: open-source format

📦 Dependencies¶

dashscope.audio.asr: ModelStudio ASR SDK
dashscope.audio.tts: ModelStudio TTS SDK
azure-cognitiveservices-speech: Azure Speech SDK
pyaudio: audio device access
numpy: audio data utilities
asyncio: async runtime

⚠️ Notes and Best Practices¶

Audio Quality¶

Keep audio input clean to avoid noise interference
Use appropriate sampling rates (typically 16 kHz or 48 kHz)
Control end-to-end latency for better UX
Periodically check microphone status

Network and Performance¶

Ensure network stability to prevent streaming gaps
Tune buffer size to balance latency vs. robustness
Monitor API call frequency to stay within quotas
Implement reconnect logic for transient failures

Privacy and Security¶

Encrypt audio if it contains sensitive information
Follow data-protection regulations for speech data
Capture user consent and disclose usage
Delete unnecessary recordings routinely

Realtime Client Tools¶

📋 Component Catalog¶

1. ModelstudioAsrClient – ModelStudio ASR Client¶

2. ModelstudioTtsClient – ModelStudio TTS Client¶

3. AzureAsrClient – Azure Speech Recognition Client¶

4. AzureTtsClient – Azure Text-to-Speech Client¶

5. RealtimeTool – Base Class for Realtime Components¶

🔧 Environment Variables¶

🚀 Usage Examples¶

ModelStudio ASR Example¶

ModelStudio TTS Example¶

Azure Speech Service Example¶

🏗️ Architectural Traits¶

Realtime Processing Architecture¶

Audio Processing Flow¶

🎵 Supported Audio Formats¶

Input Formats (ASR)¶

Output Formats (TTS)¶

📦 Dependencies¶

⚠️ Notes and Best Practices¶

Audio Quality¶

Network and Performance¶

Privacy and Security¶

🔗 Related Components¶