Online Demo
Visit the official demo page to experience the modelβs capabilities firsthand.
API Quick Start
Minimal runnable curl example.
Key Information
Architecture
4B MTP
Engine-side RTF
β 0.0053
~19 seconds to transcribe 1 hour of audio
~19 seconds to transcribe 1 hour of audio
API Pricing
See pricing details
Core Capabilities
β‘ Extreme-speed inference
Introduces MTP (Multi-Token Prediction) technology. Predicting multiple tokens per step in parallel boosts throughput by 400% and cuts latency by 60% compared with traditional ASR. A 5-minute audio clip is fully transcribed in 1 second.
π― SOTA transcription accuracy
Deeply optimized on 4B parameters. Achieves industry-leading Chinese and English error rates across diverse scenarios including news, meetings, and noisy environments.
Use Cases
Voice Agents, large-scale transcription services, real-time captions / live streaming.API Endpoint
Speech Recognition (Streaming Output)
POST /v1/audio/asr/sseSubmit audio as Base64 once; the server streams transcription back over SSE. Supports PCM / OGG / MP3 / WAV, Chinese and English recognition, and the
enable_itn / prompt parameters.Pricing
Step Plan subscribers can use this model directly. See pricing & rate limits for full details.Quick Start
transcript.text.delta events incrementally and ends with transcript.text.done.
Related Resources
Demo Page
Product demo page.
Model Card
Model card with architecture and benchmark details.
Speech Recognition (Streaming Output) API
Full parameters, response events, and error handling.