PodCraft - AI Tutor x Pixio Documentation

Overview

PodCraft is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.

A core innovation of PodCraft is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. PodCraft employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.

The model can synthesize speech up to 90 minutes long with up to 4 distinct speakers, surpassing the typical 1-2 speaker limits of many prior models.

How to Use PodCraft

Compose a Script

Compose a script. The specific format for the script should be as follows:“Speaker 0: Hello there and today we are talking about this awesome new tool in Pixio called PodCraftSpeaker 1: Podcraft lets you generate podcasts with up to 4 speakersSpeaker 0: And it also allows you to upload custom voices.Speaker 1: Its really fast too!”You must start with Speaker 0. You can have up to 4 speakers.

Add Speakers

In the section below, add speakers from existing voices or select custom and upload an mp3 of a voice to reproduce.

Finetune Options

Finetune the options for Generation speed and CFG scale. Slower speed produces higher quality results.

Generate

Click generate. Depending on the length of the script, PodCraft can take up to 5 minutes.

​PodCraft: AI-Powered Conversational Audio

​Overview

​How to Use PodCraft

PodCraft: AI-Powered Conversational Audio

Overview

How to Use PodCraft