Looking for the right text-to-speech model? The 1.6 billion parameter model Dia might be the one for you. You’d also be surprised to hear that this model was created by two undergraduates and with zero funding! In this article, you’ll learn about the model, how to access and use the model and also see the results to really know what this model is capable of. Before using the model, it would be appropriate to get acquainted with it.
The models trained with the goal of having text as input and natural speech as output, are called text-to-speech models. The Dia-1.6B parameter model developed by Nari Labs belongs to the text-to-speech models family. This is an interesting model that is capable of generating realistic dialogue from a transcript. It’s also worth noting that the model can produce nonverbal communications like laugh, sneeze, whistle etc. Exciting isn’t it?
Two ways in which we can access the Dia-1.6B model:
The first one would require getting the API key and then integrating it in Google Colab with code. The latter is a no-code and allows us to interactively use Dia-1.6B.
The model is available on Hugging Face and can be run with the help of 10 GB of VRAM, provided by the T4 GPU in Google Colab notebook. We’ll demonstrate the same with a mini conversation.
Before we begin, let’s get our Hugging Face access token which will be required to run the code. Go to https://huggingface.co/settings/tokens and generate a key, if you don’t have one already.
Make sure to enable the following permissions:
Open a new notebook in Google Colab and add this key in the secrets (Name should be HF_Token):
Note: Switch to T4 GPU to run this notebook. Then only you’d be able to use the 10GB of VRAM, required for running this model.
Let’s now get our hands on the the model:
!git clone https://github.com/nari-labs/dia.git
!pip install ./dia
!pip install soundfile
After running the previous commands, restart the session before proceeding.
import soundfile as sf
from dia.model import Dia
import IPython.display as ipd
model = Dia.from_pretrained("nari-labs/Dia-1.6B")
text = "[S1] This is how Dia sounds. (laugh) [S2] Don't laugh too much. [S1] (clears throat) Do share your thoughts on the model."
output = model.generate(text)
sampling_rate = 44100 # Dia uses 44.1Khz sampling rate.
output_file="dia_sample.mp3"
sf.write(output_file, output, sampling_rate) # Saving the audio
ipd.Audio(output_file) # Displaying the audio
Output:
The speech is very human-like and the model is doing great with non-verbal communication. It’s worth noting that the results aren’t reproducible as there are no templates for the voices.
Note: You can try fixing the seed of the model to reproduce the results.
Let’s try to clone a voice using the model via Hugging Face spaces. Here we have an option to use the model directly on the using the online interface: https://huggingface.co/spaces/nari-labs/Dia-1.6B
Here you can pass the input text and additionally you can also use the ‘Audio Prompt’ to replicate the voice. I passed the audio we generated in the previous section.
The following text was passed as an input:
[S1] Dia is an open weights text to dialogue model.
[S2] You get full control over scripts and voices.
[S1] Wow. Amazing. (laughs)
[S2] Try it now on Git hub or Hugging Face.
I’ll let you be the judge, do you feel that the model has successfully captured and replicated the earlier voices?
Note: I got multiple errors while generating the speech using Hugging Face spaces, try changing the input text or audio prompt to get the model to work.
Here are a few things that you should keep in mind, while using Dia-1.6B:
The model results are very promising, especially when we see what it can do compared to the competition. The model’s biggest strength is its support for a wide range of non-verbal communication. The model has a distinct tone and speech feels natural, but on the other hand as it’s not fine-tuned on specific voices, it might not be easy to reproduce a particular voice. Like any other generative AI tool, this model should be used responsibly.
A. No, you’re not limited to just two speakers. While having two speakers (e.g., [S1] and [S2]) is common for simplicity, you can include more by labeling them as [S1], [S2], [S3], and so on. This is especially useful when simulating group dialogues, interviews, or multi-party conversations. Just be sure to clearly indicate who is speaking in your prompt so the model can correctly follow and generate coherent replies for each speaker. This flexibility allows for more dynamic and context-rich interactions.
A. No, Dia 1.6B is entirely free to use. It’s an open-access conversational model hosted on Hugging Face, which means there are no subscription fees or licensing costs involved. Whether you’re a student, developer, or researcher, you can access it without any upfront payment. This makes it a great choice for experimentation, prototyping, or educational use.
A. You can use Dia 1.6B directly through Hugging Face Spaces, which provides a web-based interface. This means you don’t need to set up Python environments, install libraries, or worry about GPU availability. Simply visit this page, and you can interact with the model instantly in your browser.
A. Yes, if you have specific data and want the model to perform better for your domain, Dia 1.6B can be fine-tuned. You’ll need some technical expertise and compute resources, or you can use Hugging Face’s training tools.
A. No hard limits are enforced by default, but Hugging Face Spaces may have rate or session time restrictions to manage server load.