Skip to main content
StepAudio 2.5 ASR is a 4B-parameter speech recognition model. It introduces Multi-Token Prediction (MTP) technology to predict multiple tokens per step in parallel, maintaining SOTA transcription accuracy while dramatically reducing serial wait cycles β€” a 5-minute audio clip can be fully transcribed in 1 second.

Online Demo

Visit the official demo page to experience the model’s capabilities firsthand.

API Quick Start

Minimal runnable curl example.

Key Information

Architecture

4B MTP

Engine-side RTF

β‰ˆ 0.0053
~19 seconds to transcribe 1 hour of audio

API Pricing

See pricing details

Core Capabilities

⚑ Extreme-speed inference

Introduces MTP (Multi-Token Prediction) technology. Predicting multiple tokens per step in parallel boosts throughput by 400% and cuts latency by 60% compared with traditional ASR. A 5-minute audio clip is fully transcribed in 1 second.

🎯 SOTA transcription accuracy

Deeply optimized on 4B parameters. Achieves industry-leading Chinese and English error rates across diverse scenarios including news, meetings, and noisy environments.

Use Cases

Voice Agents, large-scale transcription services, real-time captions / live streaming.

API Endpoint

Speech Recognition (Streaming Output)

POST /v1/audio/asr/sse
Submit audio as Base64 once; the server streams transcription back over SSE. Supports PCM / OGG / MP3 / WAV, Chinese and English recognition, and the enable_itn / prompt parameters.

Pricing

Step Plan subscribers can use this model directly. See pricing & rate limits for full details.

Quick Start

curl https://api.stepfun.ai/v1/audio/asr/sse \
  -H "Authorization: Bearer $STEP_API_KEY" \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "audio": {
      "data": "base64_encoded_audio",
      "input": {
        "transcription": {
          "model": "stepaudio-2.5-asr",
          "language": "zh",
          "enable_itn": true
        },
        "format": {
          "type": "pcm",
          "codec": "pcm_s16le",
          "rate": 16000,
          "bits": 16,
          "channel": 1
        }
      }
    }
  }'
The server emits transcript.text.delta events incrementally and ends with transcript.text.done.

Demo Page

Product demo page.

Model Card

Model card with architecture and benchmark details.

Speech Recognition (Streaming Output) API

Full parameters, response events, and error handling.