5615

Experience real-time voice output from your ESP32 using Google TTS and a simple I2S audio amp. Speak instructions, time, alerts, or sensor data—no display needed! Overcome MCU limits with clever code chunking and WiFi streaming.

                                                            ESP32 Speech Function 

Prelude: Over the last 20 years, electronic projects have evolved dramatically. What was once purely electronics has now become hardware computing—a blend of computer science and information technology. Transistors have given way to powerful single-core and multi-core microprocessors, and the complex circuit diagrams of the past have been simplified into logic gates within these microprocessors.

In this context, displays of various shapes and sizes have become an integral part of most electronic projects. Even more strikingly, in the last 5-6 years, AI has advanced to the point where speech capabilities, once primarily the domain of more powerful single-board computers, are now accessible even at the microcontroller (MCU) level. This was unimaginable just a few years ago, given the limited computing power and memory of MCUs, which are designed for small form factors and low power consumption.

Visuals are great but at times speech capability makes project multiple times simplified. While visuals require two brain faculties - Seeing & comprehending ,The audio too requires hearing & comprehending. But the advantage of speech capability is that you can have it from distance unless you provide a big screen for the small electronic project. Audio output can be comprehended in dark room unlike visuals or light signals. Audio output can be amplified easily than the video output. 

Speech capability & AI: Speech capability in technology primarily has two dimensions: Text-to-Speech (TTS) and Speech-to-Text (STT). In this project, we’ll focus on TTS. A few years ago, I purchased a small Chinese speech MCU module that claimed to perform text-to-speech conversion. After countless nights of tinkering and finally cracking its mystery, the result was utterly frustrating. While it did convert text to speech, it only worked at the alphanumeric level. Instead of pronouncing entire words and forming sentences, it merely spelled out individual letters—like this: l e t t e r s.

Take my name, SOMNATH, for example: it pronounced each letter individually—S O M N A T H. By the time you heard the last letter, the first was long forgotten, making it impossible to recall and reconstruct the word phonetically. It was an utterly nonsensical module that wasted my time!
Analyzing this module, it was clear that it only manipulated 36 phonetic units: 26 letters of the alphabet and 10 numerics (0-9). In essence, if you recorded 36 small voice samples and programmed an MCU to play them back & forth as needed, you could create a similar module, though equally limited in functionality. In contrast, consider the Oxford English Dictionary, which contains around 600,000 words derived from those same 26 letters (excluding proper nouns like Tom, Dick, and Harry). Handling this vast vocabulary at the MCU level with appropriate voice modulation unquestionably requires AI.
On larger computers, you have software like espeak-ng or espeak that can perform text-to-speech conversion with limited voice modulation. However, AI-generated text-to-speech far outperforms these fixed software solutions in both naturalness and quality.

Project Schematic: The inexpensive MAX98357A I2S amplifier (mono) is connected to an ESP32.  There are also stereo models available, such as the UDA1334A, but for simplicity, we’re using the mono version here. An I2S amplifier requires three GPIO pins, which can be any pins except for 34 and 35 (these are input-only pins). A 4-ohm speaker is connected to the output—make sure to check the specifications on your board. Note that the speaker has ‘+’ and ‘-’ terminals; it’s important to connect them correctly. If reversed, the speaker output may be distorted.


BOM:


 | Item | Source | Cost
 | ESP 32 | amazon.com / robu.in / aliexpress.com | USD $6
| MAX98357A | amazon.com / robu.in / aliexpress.com | USD $3 to $5
| 4OHM Speaker | amazon.com / robu.in / aliexpress.com | USD $1
| 7805 Regulator | amazon.com / robu.in / aliexpress.com | USD $1
Operation: The code gets connected to Internet using WiFi authentications. Then it sends the input string  to a google TTS site to process it to speech. The longest sentence of comprises of 265 characters can speak in one shot. But upto 200 it can speak for sure. The same is shown in our first small sketch. The crux of the code is shown here - 


void playLongText(String text) {
  int maxChunkSize = 200; // can try with 265, Google TTS max length 200
  int startPos = 0;

  while (startPos < text.length()) {
    int endPos = min(startPos + maxChunkSize, (int)text.length()); // Cast text.length() to int
    String chunk = text.substring(startPos, endPos);

    // URL encode the text chunk
    chunk.replace(" ", "%20");

    // Construct the URL for Google TTS
    String tts_url = "http://translate.google.com/translate_tts?ie=UTF-8&q=" + chunk + "&tl=en&client=tw-ob";
    // Play the audio
    audio.connecttohost(tts_url.c_str());
    
    // Wait for the chunk to finish playing
    while (audio.isRunning()) {
      audio.loop();
    }

This does not mean that it will successfully speak up 200 characters long speech continuously one after another. Try it and it misses many times. This is so due to poor memory capacity of ESP32. However, the first text it always speaks properly. The second sketch takes string from serial console and speaks up. However, small sentences comprises of 3 to 4 words can speak continuously. You can try with different combinations in the 2nd sketch.

Aftermath: Overall the small text to speech conversion which was so far beyond the gambit of MCU level is now possible using gTTS and process it at low powered MCU level as small as ESP32!  Possible use of this feat is likely to be - 


  1. Voice output of small processes - like classification output, yes / no output.
  2. Speaking out roll calls.
  3. Speaking out instructions 
  4. Speaking clock - which can speak time every minute / half hourly or hourly etc.

Video links:

Link1 : https://youtu.be/LwBFdaJjhvc

Link2 : https://youtu.be/Sk2DlSlc1vo



Kolkata

Somnath Bera