Skip to main content

Text to Speech

Learn how to turn text into lifelike spoken audio.

Overview

The Audio API provides a speech endpoint based on our TTS (text-to-speech) model. It comes with 6 built-in voices and can be used to:

  • Narrate a written blog post
  • Produce spoken audio in multiple languages
  • Give real-time audio output using streaming

Quickstart

The speech endpoint takes in three key inputs: the model, the text that should be turned into audio, and the voice to be used for the audio generation.

Generate Spoken Audio from Input Text

Python:

from pathlib import Path
from openai import OpenAI

client = OpenAI(
api_key = '$ROCKAPI_API_KEY',
base_url = 'https://api.rockapi.ru/openai/v1'
)

speech_file_path = Path(__file__).parent / "speech.mp3"

response = client.audio.speech.create(
model="tts-1",
voice="alloy",
input="Today is a wonderful day to build something people love!"
)

response.stream_to_file(speech_file_path)

By default, the endpoint will output an MP3 file of the spoken audio but it can also be configured to output any of our supported formats.

Audio Quality

For real-time applications, the standard tts-1 model provides the lowest latency but at a lower quality than the tts-1-hd model. Due to the way the audio is generated, tts-1 is likely to generate content that has more static in certain situations than tts-1-hd. In some cases, the audio may not have noticeable differences depending on your listening device and the individual person.

Voice Options

Experiment with different voices (alloy, echo, fable, onyx, nova, and shimmer) to find one that matches your desired tone and audience. The current voices are optimized for English.

Streaming real time audio

The Speech API provides support for real time audio streaming using chunk transfer encoding. This means that the audio is able to be played before the full file has been generated and made accessible.

from openai import OpenAI

client = OpenAI()

response = client.audio.speech.create(
model="tts-1",
voice="alloy",
input="Hello world! This is a streaming test.",
)

response.stream_to_file("output.mp3")

Supported Output Formats

The default response format is "mp3", but other formats like "opus", "aac", "flac", and "pcm" are available.

  • Opus: For internet streaming and communication, low latency.
  • AAC: For digital audio compression, preferred by YouTube, Android, iOS.
  • FLAC: For lossless audio compression, favored by audio enthusiasts for archiving.
  • WAV: Uncompressed WAV audio, suitable for low-latency applications to avoid decoding overhead.
  • PCM: Similar to WAV but containing the raw samples in 24kHz (16-bit signed, low-endian), without the header.

Supported Languages

The TTS model generally follows the Whisper model in terms of language support. Whisper supports the following languages and performs well despite the current voices being optimized for English:

Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.

You can generate spoken audio in these languages by providing the input text in the language of your choice.

FAQ

How can I control the emotional range of the generated audio?

There is no direct mechanism to control the emotional output of the audio generated. Certain factors may influence the output audio like capitalization or grammar but our internal tests with these have yielded mixed results.

Can I create a custom copy of my own voice?

No, this is not something we support.

Do I own the outputted audio files?

Yes, like with all outputs from our API, the person who created them owns the output. You are still required to inform end users that they are hearing audio generated by AI and not a real person talking to them.