Embed the widget on your own site

Deep Learning Speech Commands Recognition on ESP32

Train a neural network model in 10 minutes, and use it on ESP32 with MicroPython to control a light switch. Everything done in browser.

Hackster.io

Deep Learning Speech Commands Recognition on ESP32

Implementation.

  • Comments (12)

Tinkerdoodle DIY

Things used in this project

SG90 Micro-servo motor

SG90 Micro-servo motor

Software apps and online services

Tinkerdoodle online IDE

You can get a speech commands model with your own words and run it in 10 minutes!

The inexpensive ESP32 chips are so popular today. I would like to see how well it can run a deep neural network. The M5StickC is ESP32-powered, with a built-in microphone. This comes handy for a speech recognition project.

There are various tutorials on how to train and run a speech commands model on a ESP32. However, most of these tutorials train the model using the Google speech commands data set , which is a large data set but only has 20+ pre-defined speech commands. Also, the training must be done on a powerful machine, which can be a barrier for beginner makers.

So I decided to do things differently. The model training is split into two parts: base model training and custom model training. The base model is trained using the full Google speech commands data set, and it serves as the feature extractor for the custom model. The custom model is trained using TensorFlow.js in browser. It requires far less samples to train a custom model than a base model. You can get pretty good recognition with as few as 50 samples.

Further, the base model is compiled into a custom MicroPython firmware. The custom model is loaded dynamically as a Python module. The M5StickC is able to run one model inference in 220ms, which is pretty impressive.

I've shared the custom model training UI and the custom MicroPython firmware, so you can try it out with minimum coding at Tinkerdoodle online IDE.

The model training UI is shared at this Tinkerdoodle page . It is written in Javascript, so you can view the page source to check how it is done, if you have interest. In my demo video, I got a pretty good model to recognize two words ("dark" and "bright") in 5 minutes. You can do it too!

The MicroPython code, and instructions on how to flash the custom firmware etc, are available at this Jupyter notebook . Please note the stock firmware on M5StickC won't work.

The demo video also serves as a step-by-step tutorial.

Give it a try and let me know what you think!

Connect SG-90 micro servo to M5StickC

speech to text esp32

Model training using TensorFlow.js

Micropython program running on esp32, tinkerdoodle diy, related channels and tags.

Internet of Things

  • artificial intelligence
  • internet of things
  • machine learning
  • micropython
  • speech recognition

ESP32 Speech Synthesizer Experiment With XFS5152CE

license

Introduction: ESP32 Speech Synthesizer Experiment With XFS5152CE

ESP32 Speech Synthesizer Experiment With XFS5152CE

My recent microcontroller experiments have been about voice and AI. This reminds me of the text-to-speech synthesizer experiment I did a while back -- Arduino Speech Synthesizer Experiment with XFS5152CE

Being able to synthesize arbitrary text with a small chip -- XFS5152CE -- without connecting to a server, is definitely an advantage. Even though the synthesized English sounds robotic, Putonghua sounds pretty human-like to me.

In this post, I will demonstrate that experiment again, with some code refactoring enhancements. Therefore, the core of everything is basically the same as described in the video.

Step 1: The UI

The UI

DumbDisplay will be used as UI for this experiment.

With the UI, you will be able to trigger text-to-speech synthesis in two ways.

You can click on the News button to have a piece of headline news acquired from the news web service NewsApi .

The piece of headline news certainly contains some text about the headline, it might also contain an image [URL] to accompany the headline. Both the text as well as the image will be displayed. Since this experiment is about text-to-speech synthesis with XFS5152CE, you certainly will also hear the headline text synthesized.

You can also input your own text for synthesis. Simply click on the text area (text layer); a keyboard will pop up allowing you to enter whatever text you desire. After receiving the text as "feedback", XFS5152CE will be requested to synthesize the text you entered.

As mentioned previously, my feeling is that XFS5152CE synthesizes Putonghua better. Indeed, the UI provides you the option to acquire "English only" or "English & Chinese" news headlines. You switch between the options by clicking the English button.

To help visualize, you may want to watch the demo in the video Arduino Speech Synthesizer Experiment with XFS5152CE .

Step 2: Connections

Connections

The ESP32 board will be powered with an external 5V power source. The same power source will also be used to power the XFS5152CE board as well. Please note that the XFS5152CE board here is to be powered with 5V; some other XFS5152CE boards might be powered with 3.3V.

  • Connect 5V of ESP32 to the positive terminal of the 5V power source
  • Connect GND of ESP32 to the negative terminal of the 5V power source
  • Connect GPIO17 of ESP32 to RXD of XFS5152CE
  • Connect GIPO16 of ESP32 to TXD of XFS5152CE
  • Connect DC5V of XFS5152CE to the positive terminal of the 5V power source
  • Connect GND of XFS5152CE to the negative terminal of the 5V power source
  • Connect speaker pins of XFS5152CE to the two terminals of the speaker

You may notice that this XFS5152CE board only exposes the pins for UART. Indeed, in this experiment, ESP32 will communicate with XFS5152CE using UART.

Step 3: Preparation

Preparation

In order to be able to compile and run the sketch shown here, you will first need to install the  DumbDisplay Arduino library . Open your Arduino IDE; go to the menu item  Tools | Manage Libraries , and type "dumbdisplay" in the search box there.

On the other side -- your Android phone side -- you will need to install the  DumbDisplay Android app .

Step 4: The Sketch

You can down the sketch here .

To connect to DumbDisplay app, the sketch with make use of ESP32's Bluetooth support; i.e. connection is via Bluetooth and with name "ESP32".

To communicate with XFS5152CE, the sketch will make use of ESP32's UART2 with baud rate 115200.

The various UI actions are triggered by "feedbacks" for the different layers, and those "feedbacks" are handled by the "feedback" handler FeedbackHandler , registered to the different layers

Here is the "feedback" handler FeedbackHandler

It simply changes the "pending value" -- helper object to track new value set -- of different global variables according to which layer has "feedback".

If the langsButton layer has "feedback" (clicked), the englishOnly boolean "pending value" is toggled. Notice that the "pending value" is initially set to true.

If the newsButton layer has "feedback" (clicked), the requestNews boolean "pending value" is set to true.

If the textLayer has "feedback" with the text you entered for synthesizing, the text will be assigned to adhocText string "pending value".

In the loop block, setting of the "pending values" will be detected and "acknowledged" (i.e. handled)

As mentioned previously, headline news will be acquired from the news web service NewsApi , and in order to be able to call the service, you will need an API Key , which you can get from NewsApi .

To specify your API key, you define it with the macro NEWSAPI_API_KEY

And the API key is used to construct the news web service API endpoint like

A piece of headline news is acquired via DumbDisplay app in the subroutine HandleGetAnotherNews like

Please note that the parameters "pageSize", "category" and "country" are added to the endpoint, in order to customize what to return.

The result of the API call is a JSON like

DumbDisplay app will extract results from the JSON as "id-value" pairs. For example, the "id" of the entry for "title" is "article.0.title", and the "value" is "<headline text>". The extracted "id-value" pairs will be passed back to ESP32 as "feedbacks". Here is how the sketch reads the "id-value" pairs

After getting title , the subroutine SynthesizeVoice is called to synthesize the text

And in case an image is associated with that piece of headline news, [image at] imageUrl is downloaded to DumbDisplay app, and displayed like

Here is the subroutine SynthesizeVoice

As shown, the text to synthesize will first be converted to UTF16 format (from UTF8, since Arduino String is UTF8). The data (converted text) are sent to XFS5152CE via UART, with the needed headers.

Similarly, arbitrary text you entered for synthesizing is handled by the subroutine HandleAdhocText like

Step 5: Enjoy!

Have fun with this speech synthesizer experiment! Enjoy!

Peace be with you! May God bless you! Jesus loves you!

Recommendations

Tiny PC (Nostalgia Edition)

Woodworking Contest

Woodworking Contest

Microcontrollers Contest

Microcontrollers Contest

Outdoor Life Contest

Outdoor Life Contest

DroneBot Workshop Forums

Welcome to the Forums!

DroneBot Workshop Forums

  • Recent Posts

ESP32 I2S voice memo recorder with text to speech and language translation

' src=

This my latest. It's an ESP32 voice memo recorder with azure powered speech to text and language translation… looks like we can finally finish building that great big tower we started on all those centuries ago.

In this video, I give an explanation of how I made this, including a description of the code used.

The code is available from my github repo:

https://github.com/jonathanrandall/esp32_voice_memo

I thought I'd share it here because I was inspired by Bill's article and YouTube video on I2S. I was waiting in the car for my daughter to finish orchestra practice (it's a long practice and a long drive, so I can't drop her and come back), and I decided to watch the Dronebot workshop video on I2S. I thought it was really cool. So, I decided to do a project. 

jjs357

Looks like a neat project. I am considering adding Speech-To-Text functionality for Drone controller project I am working on that uses ESP32 WiFi to control a Tello Edu model drone. I will be pairing the ESP32 circuit to a Raspberry Pi that will provide the speech conversion. I am looking at the Google APIs for speech but there are a lot of account hoops to jump through. Maybe Azure is better in that regard. I may have more questions once I watch your video all of the way through.

@jjs357 Hi. It's quite simple to set up the Azure stuff, and i haven't supplied any credit card information. I'm not sure how long it will take for the free sign up credit to run out or if it gets renewed. I've been a member for about six months and used it in a few projects, but still the pricing for Azure doesn't seem very transparent. You can also try the facebook wit.ai, to see if they do text to speech.

@jonnyr Thanks for the encouragement. Based on what I saw here:

https://deepgram.com/built-with-deepgram/voice-controlled-car/

I think I will give Deepgram a first shot. No cost API key, no credit card needed, some free speech recognition credits. Plus lots of Python examples on Deepgram's blog. My use case is a Raspberry Pi providing the WiFi connection to Deepgram's speech recognition with a USB serial connection to my ESP32 device connected through its WiFi to the drone. Spoken word to text that is then fed to the drone command by command. Text commands are already implemented so I am hoping the speech to command text is a simple add-on. The question I have is what is the real time response like once a spoken command is uttered to when it can be sent to the drone. I don't expect joystick level response times of course.

  • DB1.5 2 years ago
  • Scary Owl to Scare of Pigeons 3 years ago

Currently viewing this topic 2 guests.

  • 4,187 Topics
  • 45.1 K Posts
  • 5,164 Members

Privacy Overview

DEV Community

DEV Community

Panlee123

Posted on Aug 5, 2022

Let's Talk Espressif ESP32-S3 Voice--Text-to-Speech (TTS)

We all know that Espressif's ESP32 module is very famous. Today, let's talk the Chinese speech synthesis routine in Espressif's voice assistant framework ESP-Skainet.

* Compile the original routine

Image description

Simplify the original routine and analyze

The original routine is roughly divided into two functions. The first function is to read the sentence "Lexin speech synthesis", and the other function is to read the text input through the serial port.

Image description

Summarize TTS is over-encapsulated, and to a certain extent it is destined to be not difficult to use. However, according to the routines that have been run, the audio still has the problem of pronunciation. For some mature tts solutions, there is still a certain gap in the tts of Espressif. This shortcoming may cause it to fail to be applied to commercial projects. middle. If the content of speech to text is involved in the project, on the one hand, it can be solved by sending text to receive PCM audio through the API capability provided by the cloud platform. On the other hand, if the vocabulary is limited, the corresponding audio can also be stored in the file system by means of voice splicing, and the specified content can be played through mapping and pieced together into a complete sentence. For example: "Alipay Collection", "Yuan", "One", "Ten", "Hundred", "Thousand", "Ten thousand" can basically achieve the Alipay voice broadcast function by piecing together the audio.

Top comments (0)

pic

Templates let you quickly answer FAQs or store snippets for re-use.

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink .

Hide child comments as well

For further actions, you may consider blocking this person and/or reporting abuse

sanditzz profile image

How to Use Shadcn UI with a React Project

opensourcee profile image

What's new in Frontend AI?

OpenSource - Aug 6

pltnvs profile image

High-Performance Storage Solution for Virtual Environment on Xinnor RAID Engine and Kioxia PCIe 5.0 Drives

Sergey Platonov - Aug 16

hkhelil profile image

How to Manage Kubernetes App Storage Like a Pro 📁

Hamdi KHELIL - Aug 16

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

  • Command Word
  • Edit on GitHub

Command Word 

Multinet command word recognition model .

MultiNet is a lightweight model designed to recognize multiple speech command words offline based on ESP32. Currently, up to 200 speech commands, including customized commands, are supported.

Support Chinese speech commands recognition

Support user-defined commands

Support adding / deleting / modifying commands during operation

Up to 200 commands are supported

It supports single recognition and continuous recognition

Lightweight and low resource consumption

Low delay, within 500ms

The model is partitioned separately to support users to apply OTA

The MultiNet input is the audio processed by the audio-front-end algorithm (AFE), with the format of 16 KHz, 16 bit and mono. By recognizing the audio signals, speech commands can be recognized.

Please refer to Models Benchmark to check models supported by Espressif SoCs.

For details on flash models, see Section Flashing Models .

Models ending with Q8 represents the 8 bit version of the model, which is more lightweight.

Commands Recognition Process 

Please see the flow diagram for commands recognition below:

speech_command-recognition-system

speech_command-recognition-system 

Speech Commands Customization Methods 

Mixed Chinese and English is not supported in command words.

The command word cannot contain Arabic numerals and special characters.

Please refer to Chinese version documentation for Chinese speech commands customization methods.

MultiNet7 customize speech commands 

MultiNet7 use phonemes for English speech commands. Please modify a text file model/multinet_model/fst/commands_en.txt by the following format:

# command_id,command_grapheme,command_phoneme 1 , tell me a joke , TfL Mm c qbK 2 , sing a song , Sgl c Sel

Column 1: command ID, it should start from 1 and cannot be set to 0.

Column 2: command_grapheme, the command sentence. It is recommended to use lowercase letters unless it is an acronym that is meant to be pronounced differently.

Column 3: command_phoneme, the phoneme sequence of the command which is optional. To fill this column, please use tool/multinet_g2p.py to do the Grapheme-to-Phoneme conversion and paste the results at the third column correspondingly (this is the recommended way).

If Column 3 is left empty, then an internal Grapheme-to-Phoneme tool will be called at runtime. But there might be a little accuracy drop in this way due the different Grapheme-to-Phoneme algorithms used.

MultiNet6 customize speech commands 

MultiNet6 use grapheme for English speech commands, you can add/modify speech commands by words directly. Please modify a text file model/multinet_model/fst/commands_en.txt by the following format:

# command_id,command_grapheme 1 , TELL ME A JOKE 2 , MAKE A COFFEE

Column 2: command_grapheme, the command sentence. It is recommended to use all capital letters.

The extra column in the default commands_en.txt is to keep it compatible with MultiNet7, there is no need to fill the third column when using MultiNet6.

MultiNet5 customize speech commands 

MultiNet5 use phonemes for English speech commands. For simplicity, we use characters to denote different phonemes. Please use tool/multinet_g2p.py to do the convention.

Via menuconfig

Navigate to idf.py menuconfig > ESP Speech Recognition > Add Chinese speech commands/Add English speech commands to add speech commands. For details, please refer to the example in ESP-Skainet. menuconfig_add_speech_commands  Please note that a single Command ID can correspond to more than one commands. For example, “da kai kong tiao” and “kai kong tiao” have the same meaning. Therefore, users can assign the same command id to these two commands and separate them with “,” (no space required before and after). Call the following API: /** * @brief Update the speech commands of MultiNet by menuconfig * * @param multinet The multinet handle * * @param model_data The model object to query * * @param langugae The language of MultiNet * * @return * - ESP_OK Success * - ESP_ERR_INVALID_STATE Fail */ esp_err_t esp_mn_commands_update_from_sdkconfig ( esp_mn_iface_t * multinet , const model_iface_data_t * model_data );

Customize Speech Commands Via API calls 

Alternatively, speech commands can be modified via API calls, this method works for MultiNet5, MultiNet6 and MultiNet7.

MutiNet5 requires the input command string to be phonemes, and MultiNet6 and MultiNet7 only accepts grapheme inputs to API calls.

Apply new changes, the add/remove/modify/clear actions will not take effect util this function is called.

/** * @brief Update the speech commands of MultiNet * * @Warning : Must be used after [ add / remove / modify / clear ] function , * otherwise the language model of multinet can not be updated . * * @return * - NULL Success * - others The list of error phrase which can not be parsed by multinet . */ esp_mn_error_t * esp_mn_commands_update (); Note The modifications will not be applied, thus not printed out, until you call esp_mn_commands_update() .

Add a new speech command, will return ESP_ERR_INVALID_STATE if the input string is not in the correct format.

/** * @brief Add one speech commands with command string and command ID * * @param command_id The command ID * @param string The command string of the speech commands * * @return * - ESP_OK Success * - ESP_ERR_INVALID_STATE Fail */ esp_err_t esp_mn_commands_add ( int command_id , char * string );

Remove a speech command, will return ESP_ERR_INVALID_STATE if the command does not exist.

/** * @brief Remove one speech commands by command string * * @param string The command string of the speech commands * * @return * - ESP_OK Success * - ESP_ERR_INVALID_STATE Fail */ esp_err_t esp_mn_commands_remove ( char * string );

Modify a speech command, will return ESP_ERR_INVALID_STATE if the command does not exist.

/** * @brief Modify one speech commands with new command string * * @param old_string The old command string of the speech commands * @param new_string The new command string of the speech commands * * @return * - ESP_OK Success * - ESP_ERR_INVALID_STATE Fail */ esp_err_t esp_mn_commands_modify ( char * old_string , char * new_string );

Clear all speech commands.

/** * @brief Clear all speech commands in linked list * * @return * - ESP_OK Success * - ESP_ERR_INVALID_STATE Fail */ esp_err_t esp_mn_commands_clear ( void );

Print cached speech commands, this function will print out all cached speech commands. Cached speech commands will be applied after esp_mn_commands_update() is called.

/** * @brief Print all commands in linked list . */ void esp_mn_commands_print ( void );

Print active speech commands, this function will print out all active speech commands.

/** * @brief Print all commands in linked list . */ void esp_mn_active_commands_print ( void );

Use MultiNet 

We suggest to use MultiNet together with audio front-end (AFE) in ESP-SR. For details, see Section AFE Introduction and Use .

After configuring AFE, users can follow the steps below to configure and run MultiNet.

Initialize MultiNet 

Load and initialize MultiNet. For details, see Section flash_model

Customize speech commands. For details, see Section Speech Commands Customization Methods

Run MultiNet 

Users can start MultiNet after enabling AFE and WakeNet, but must pay attention to the following limitations:

The frame length of MultiNet must be equal to the AFE fetch frame length

The audio format supported is 16 KHz, 16 bit, mono. The data obtained by AFE fetch is also in this format

Get the length of frame that needs to pass to MultiNet

int mu_chunksize = multinet -> get_samp_chunksize ( model_data ); mu_chunksize describes the short of each frame passed to MultiNet. This size is exactly the same as the number of data points per frame obtained in AFE.

Start the speech recognition

We send the data from AFE fetch to the following API: esp_mn_state_t mn_state = multinet -> detect ( model_data , buff );

The length of buff is mu_chunksize * sizeof(int16_t) .

MultiNet Output 

Speech command recognition must be used with WakeNet. After wake-up, MultiNet detection can start.

Afer running, MultiNet returns the recognition output of the current frame in real time mn_state , which is currently divided into the following identification states:

ESP_MN_STATE_DETECTING

Indicates that the MultiNet is detecting but the target speech command word has not been recognized.

ESP_MN_STATE_DETECTED

Indicates that the target speech command has been recognized. At this time, the user can call get_results interface to obtain the recognition results. esp_mn_results_t * mn_result = multinet -> get_results ( model_data ); The recognition result is stored in the return value of the get_result API in the following format: typedef struct { esp_mn_state_t state ; int num ; // The number of phrase in list , num <= 5. When num = 0 , no phrase is recognized . int phrase_id [ ESP_MN_RESULT_MAX_NUM ]; // The list of phrase id . float prob [ ESP_MN_RESULT_MAX_NUM ]; // The list of probability . } esp_mn_results_t ; where, state is the recognition status of the current frame num means the number of recognized commands, num <= 5, up to 5 possible results are returned phrase_id means the Phrase ID of speech commands prob means the recognition probability of the recognized entries, which is arranged from large to small Users can use phrase_id[0] and prob[0] get the recognition result with the highest probability.

ESP_MN_STATE_TIMEOUT

Indicates the speech commands has not been detected for a long time and will exit automatically and wait to be waked up again.

Single recognition mode and Continuous recognition mode: * Single recognition mode: exit the speech recognition when the return status is ESP_MN_STATE_DETECTED * Continuous recognition mode: exit the speech recognition when the return status is ESP_MN_STATE_TIMEOUT

Resource Occupancy 

For the resource occupancy for this model, see Resource Occupancy .

Provide feedback about this document

About Us       Contact

Home

ESP32 based Webserver for Text to Speech (TTS) Conversion

Have any questions? Ask our community

Recent Posts

Public Cloud- Investment

Global Investment on Public Cloud Services is Forecast to Double Between 2024 and 2028

GenAI-Enterprise

Tech Mahindra Launches VerifAI, a Comprehensive GenAI Validation Solution for Enterprises Globally

Open RAN-Investment

Open-RAN Investment by Operators to Surge in Next Five Years Reaching $39 Billion, Claims Researchers

Wipro-IoT

Wipro Opens New Smart and Connected IoT Experience Centre in Pune

MediaTek- Smart Connect

MediaTek India Rolls Out ‘MediaTek Connect Program’

ESP32 based Webserver for Text to Speech (TTS) Conversion

Text to speech ( TTS ) has been used in many applications like voice assistants, announcements, ATMs etc. It is also used to help children in learning speaking and blind people in communication. Today we will use ESP32 to build a TTS (Text to Speech) engine which can convert any number into speech. The number will be entered from a Webpage and speech will be generated from ESP32 attached speaker . Also check other ESP32 based IoT projects .

Required Components

  • ESP32 Microcontroller
  • An Amplifier Circuit
  • Regulated Power Supply
  • Connecting Wires

Circuit Diagram

We can drive the speakers directly by connecting it to ESP32 but the sound will have lot of noise. So here an amplifier circuit is used to get a clear and loud sound. Here an LM386 based amplifier circuit is used, circuit diagram for the same is given below:

LM386 Based Audio Amplifier Circuit

This circuit diagram has been taken from this link LM386 Based Audio Amplifier Circuit , learn more about the functioning of this amplifier by going through the link.

Connecting Speaker to ESP32

After making the above amplifier circuit, connect the ESP32 to the amplifier circuit as shown below. Connect the digital pin 25 of ESP32 to 10K resistor and ground of ESP32 to the ground of amplifier circuit.

ESP32 Text To Speech Converter Project Circuit Diagram

Programming ESP32 for Text To Speech

The programming part to convert text into speech using ESP32 is simple. Complete code with Demo video is given at the end of this tutorial.

First we have to install a library called as Talkie, it can be downloaded from here . And then add it to Arduino IDE by going to Sketch->Include Library->Add .ZIP Library.

Then start writing the code by including all the required libraries. Wifi.h , WifiClient.h are used to create a client to connect to ISP using Wi-Fi. WebServer.h is used to create the web server and ESPmDNS.h is used for local mDNS request.

Next define an object voice for Talkie.

In the next portion, we are defining the numbers (i.e one, two, three etc.). You can also add more words/phonemes by recording the sound for each one and converting them into hex code. There are various softwares available for converting them into hex like Binary Viewer .

Now define a function which can speak out any number between -999,999 and 999,999. This function can be found in the Talkie library. This function uses simple logics. Firstly it checks if the number is negative and if it is negative then it adds a “minus” before saying the word. If it is zero then it speaks zero.

Then it checks if the number is greater than or equal to thousand so that it can attach “thousands” before saying it. Then it checks remainder by dividing the number by 1000 to tell the digits after thousand. The same logic is used for hundreds or three digit number.

Now set up the Wi-Fi. Enter replace the name and password of your Wi-Fi in the code. Because we are using HTTP protocol so we enter 80 in the server() function. Because 80 is the default port number for HTTP.

Now define an array htmlResponse to get the input from webpage.

Next define a function handleRoot() to create an html page . The snprintf() function is used to produce the html page coded in it. Now start coding the html page by passing the htmlResponse array to it with its size. Firstly give the heading named as “Text To Speech”. Then create the textbox and button in HTML.

Next write the JavaScript code. Firstly define a variable Number for the input we are getting and then assign the value entered in text box to the Number . At the end send the webpage to server.

In the next portion define a function handleSave(). Here we convert the string into integer, because the input we are getting is a string and to speak it out we have to use it as an integer.

Now in setup() function, firstly, we define the digital pin 25 as output and keep it high. Then initialize the Wi-Fi using Wifi.begin() function and print some status messages. Then in next lines we start the server by calling the handleRoot() function and get the input from web page by calling the handleSave() function and finally print the message “ HTTP server started ”.

Text to Speech conversion using ESP32 via webpage

Finally upload the code into ESP32 using Arduino IDE and run it. Open the serial monitor Copy the IP address displayed in the serial monitor as shown in below image.

Text to Speech conversion using ESP32 via webpage

The webpage will look like this:

Webpage for Text to Speech conversion using ESP32

Enter any number and hit Speak button and the system will speak the number.

So this is how ESP32 can be used to convert text into speech and this webpage can be opened from anywhere in the world by forwarding the port in Wi-Fi router.

// Libraries #include <WiFi.h> #include <WiFiClient.h> #include <WebServer.h> #include <ESPmDNS.h> #include <Talkie.h>

Talkie voice;

const uint8_t spZERO[]     PROGMEM = {0x69, 0xFB, 0x59, 0xDD, 0x51, 0xD5, 0xD7, 0xB5, 0x6F, 0x0A, 0x78, 0xC0, 0x52, 0x01, 0x0F, 0x50, 0xAC, 0xF6, 0xA8, 0x16, 0x15, 0xF2, 0x7B, 0xEA, 0x19, 0x47, 0xD0, 0x64, 0xEB, 0xAD, 0x76, 0xB5, 0xEB, 0xD1, 0x96, 0x24, 0x6E, 0x62, 0x6D, 0x5B, 0x1F, 0x0A, 0xA7, 0xB9, 0xC5, 0xAB, 0xFD, 0x1A, 0x62, 0xF0, 0xF0, 0xE2, 0x6C, 0x73, 0x1C, 0x73, 0x52, 0x1D, 0x19, 0x94, 0x6F, 0xCE, 0x7D, 0xED, 0x6B, 0xD9, 0x82, 0xDC, 0x48, 0xC7, 0x2E, 0x71, 0x8B, 0xBB, 0xDF, 0xFF, 0x1F}; const uint8_t spONE[]      PROGMEM = {0x66, 0x4E, 0xA8, 0x7A, 0x8D, 0xED, 0xC4, 0xB5, 0xCD, 0x89, 0xD4, 0xBC, 0xA2, 0xDB, 0xD1, 0x27, 0xBE, 0x33, 0x4C, 0xD9, 0x4F, 0x9B, 0x4D, 0x57, 0x8A, 0x76, 0xBE, 0xF5, 0xA9, 0xAA, 0x2E, 0x4F, 0xD5, 0xCD, 0xB7, 0xD9, 0x43, 0x5B, 0x87, 0x13, 0x4C, 0x0D, 0xA7, 0x75, 0xAB, 0x7B, 0x3E, 0xE3, 0x19, 0x6F, 0x7F, 0xA7, 0xA7, 0xF9, 0xD0, 0x30, 0x5B, 0x1D, 0x9E, 0x9A, 0x34, 0x44, 0xBC, 0xB6, 0x7D, 0xFE, 0x1F}; const uint8_t spTWO[]      PROGMEM = {0x06, 0xB8, 0x59, 0x34, 0x00, 0x27, 0xD6, 0x38, 0x60, 0x58, 0xD3, 0x91, 0x55, 0x2D, 0xAA, 0x65, 0x9D, 0x4F, 0xD1, 0xB8, 0x39, 0x17, 0x67, 0xBF, 0xC5, 0xAE, 0x5A, 0x1D, 0xB5, 0x7A, 0x06, 0xF6, 0xA9, 0x7D, 0x9D, 0xD2, 0x6C, 0x55, 0xA5, 0x26, 0x75, 0xC9, 0x9B, 0xDF, 0xFC, 0x6E, 0x0E, 0x63, 0x3A, 0x34, 0x70, 0xAF, 0x3E, 0xFF, 0x1F}; const uint8_t spTHREE[]    PROGMEM = {0x0C, 0xE8, 0x2E, 0x94, 0x01, 0x4D, 0xBA, 0x4A, 0x40, 0x03, 0x16, 0x68, 0x69, 0x36, 0x1C, 0xE9, 0xBA, 0xB8, 0xE5, 0x39, 0x70, 0x72, 0x84, 0xDB, 0x51, 0xA4, 0xA8, 0x4E, 0xA3, 0xC9, 0x77, 0xB1, 0xCA, 0xD6, 0x52, 0xA8, 0x71, 0xED, 0x2A, 0x7B, 0x4B, 0xA6, 0xE0, 0x37, 0xB7, 0x5A, 0xDD, 0x48, 0x8E, 0x94, 0xF1, 0x64, 0xCE, 0x6D, 0x19, 0x55, 0x91, 0xBC, 0x6E, 0xD7, 0xAD, 0x1E, 0xF5, 0xAA, 0x77, 0x7A, 0xC6, 0x70, 0x22, 0xCD, 0xC7, 0xF9, 0x89, 0xCF, 0xFF, 0x03}; const uint8_t spFOUR[]     PROGMEM = {0x08, 0x68, 0x21, 0x0D, 0x03, 0x04, 0x28, 0xCE, 0x92, 0x03, 0x23, 0x4A, 0xCA, 0xA6, 0x1C, 0xDA, 0xAD, 0xB4, 0x70, 0xED, 0x19, 0x64, 0xB7, 0xD3, 0x91, 0x45, 0x51, 0x35, 0x89, 0xEA, 0x66, 0xDE, 0xEA, 0xE0, 0xAB, 0xD3, 0x29, 0x4F, 0x1F, 0xFA, 0x52, 0xF6, 0x90, 0x52, 0x3B, 0x25, 0x7F, 0xDD, 0xCB, 0x9D, 0x72, 0x72, 0x8C, 0x79, 0xCB, 0x6F, 0xFA, 0xD2, 0x10, 0x9E, 0xB4, 0x2C, 0xE1, 0x4F, 0x25, 0x70, 0x3A, 0xDC, 0xBA, 0x2F, 0x6F, 0xC1, 0x75, 0xCB, 0xF2, 0xFF}; const uint8_t spFIVE[]     PROGMEM = {0x08, 0x68, 0x4E, 0x9D, 0x02, 0x1C, 0x60, 0xC0, 0x8C, 0x69, 0x12, 0xB0, 0xC0, 0x28, 0xAB, 0x8C, 0x9C, 0xC0, 0x2D, 0xBB, 0x38, 0x79, 0x31, 0x15, 0xA3, 0xB6, 0xE4, 0x16, 0xB7, 0xDC, 0xF5, 0x6E, 0x57, 0xDF, 0x54, 0x5B, 0x85, 0xBE, 0xD9, 0xE3, 0x5C, 0xC6, 0xD6, 0x6D, 0xB1, 0xA5, 0xBF, 0x99, 0x5B, 0x3B, 0x5A, 0x30, 0x09, 0xAF, 0x2F, 0xED, 0xEC, 0x31, 0xC4, 0x5C, 0xBE, 0xD6, 0x33, 0xDD, 0xAD, 0x88, 0x87, 0xE2, 0xD2, 0xF2, 0xF4, 0xE0, 0x16, 0x2A, 0xB2, 0xE3, 0x63, 0x1F, 0xF9, 0xF0, 0xE7, 0xFF, 0x01}; const uint8_t spSIX[]      PROGMEM = {0x04, 0xF8, 0xAD, 0x4C, 0x02, 0x16, 0xB0, 0x80, 0x06, 0x56, 0x35, 0x5D, 0xA8, 0x2A, 0x6D, 0xB9, 0xCD, 0x69, 0xBB, 0x2B, 0x55, 0xB5, 0x2D, 0xB7, 0xDB, 0xFD, 0x9C, 0x0D, 0xD8, 0x32, 0x8A, 0x7B, 0xBC, 0x02, 0x00, 0x03, 0x0C, 0xB1, 0x2E, 0x80, 0xDF, 0xD2, 0x35, 0x20, 0x01, 0x0E, 0x60, 0xE0, 0xFF, 0x01}; const uint8_t spSEVEN[]    PROGMEM = {0x0C, 0xF8, 0x5E, 0x4C, 0x01, 0xBF, 0x95, 0x7B, 0xC0, 0x02, 0x16, 0xB0, 0xC0, 0xC8, 0xBA, 0x36, 0x4D, 0xB7, 0x27, 0x37, 0xBB, 0xC5, 0x29, 0xBA, 0x71, 0x6D, 0xB7, 0xB5, 0xAB, 0xA8, 0xCE, 0xBD, 0xD4, 0xDE, 0xA6, 0xB2, 0x5A, 0xB1, 0x34, 0x6A, 0x1D, 0xA7, 0x35, 0x37, 0xE5, 0x5A, 0xAE, 0x6B, 0xEE, 0xD2, 0xB6, 0x26, 0x4C, 0x37, 0xF5, 0x4D, 0xB9, 0x9A, 0x34, 0x39, 0xB7, 0xC6, 0xE1, 0x1E, 0x81, 0xD8, 0xA2, 0xEC, 0xE6, 0xC7, 0x7F, 0xFE, 0xFB, 0x7F}; const uint8_t spEIGHT[]    PROGMEM = {0x65, 0x69, 0x89, 0xC5, 0x73, 0x66, 0xDF, 0xE9, 0x8C, 0x33, 0x0E, 0x41, 0xC6, 0xEA, 0x5B, 0xEF, 0x7A, 0xF5, 0x33, 0x25, 0x50, 0xE5, 0xEA, 0x39, 0xD7, 0xC5, 0x6E, 0x08, 0x14, 0xC1, 0xDD, 0x45, 0x64, 0x03, 0x00, 0x80, 0x00, 0xAE, 0x70, 0x33, 0xC0, 0x73, 0x33, 0x1A, 0x10, 0x40, 0x8F, 0x2B, 0x14, 0xF8, 0x7F}; const uint8_t spNINE[]     PROGMEM = {0xE6, 0xA8, 0x1A, 0x35, 0x5D, 0xD6, 0x9A, 0x35, 0x4B, 0x8C, 0x4E, 0x6B, 0x1A, 0xD6, 0xA6, 0x51, 0xB2, 0xB5, 0xEE, 0x58, 0x9A, 0x13, 0x4F, 0xB5, 0x35, 0x67, 0x68, 0x26, 0x3D, 0x4D, 0x97, 0x9C, 0xBE, 0xC9, 0x75, 0x2F, 0x6D, 0x7B, 0xBB, 0x5B, 0xDF, 0xFA, 0x36, 0xA7, 0xEF, 0xBA, 0x25, 0xDA, 0x16, 0xDF, 0x69, 0xAC, 0x23, 0x05, 0x45, 0xF9, 0xAC, 0xB9, 0x8F, 0xA3, 0x97, 0x20, 0x73, 0x9F, 0x54, 0xCE, 0x1E, 0x45, 0xC2, 0xA2, 0x4E, 0x3E, 0xD3, 0xD5, 0x3D, 0xB1, 0x79, 0x24, 0x0D, 0xD7, 0x48, 0x4C, 0x6E, 0xE1, 0x2C, 0xDE, 0xFF, 0x0F}; const uint8_t spTEN[]      PROGMEM = {0x0E, 0x38, 0x3C, 0x2D, 0x00, 0x5F, 0xB6, 0x19, 0x60, 0xA8, 0x90, 0x93, 0x36, 0x2B, 0xE2, 0x99, 0xB3, 0x4E, 0xD9, 0x7D, 0x89, 0x85, 0x2F, 0xBE, 0xD5, 0xAD, 0x4F, 0x3F, 0x64, 0xAB, 0xA4, 0x3E, 0xBA, 0xD3, 0x59, 0x9A, 0x2E, 0x75, 0xD5, 0x39, 0x6D, 0x6B, 0x0A, 0x2D, 0x3C, 0xEC, 0xE5, 0xDD, 0x1F, 0xFE, 0xB0, 0xE7, 0xFF, 0x03}; const uint8_t spELEVEN[]   PROGMEM = {0xA5, 0xEF, 0xD6, 0x50, 0x3B, 0x67, 0x8F, 0xB9, 0x3B, 0x23, 0x49, 0x7F, 0x33, 0x87, 0x31, 0x0C, 0xE9, 0x22, 0x49, 0x7D, 0x56, 0xDF, 0x69, 0xAA, 0x39, 0x6D, 0x59, 0xDD, 0x82, 0x56, 0x92, 0xDA, 0xE5, 0x74, 0x9D, 0xA7, 0xA6, 0xD3, 0x9A, 0x53, 0x37, 0x99, 0x56, 0xA6, 0x6F, 0x4F, 0x59, 0x9D, 0x7B, 0x89, 0x2F, 0xDD, 0xC5, 0x28, 0xAA, 0x15, 0x4B, 0xA3, 0xD6, 0xAE, 0x8C, 0x8A, 0xAD, 0x54, 0x3B, 0xA7, 0xA9, 0x3B, 0xB3, 0x54, 0x5D, 0x33, 0xE6, 0xA6, 0x5C, 0xCB, 0x75, 0xCD, 0x5E, 0xC6, 0xDA, 0xA4, 0xCA, 0xB9, 0x35, 0xAE, 0x67, 0xB8, 0x46, 0x40, 0xB6, 0x28, 0xBB, 0xF1, 0xF6, 0xB7, 0xB9, 0x47, 0x20, 0xB6, 0x28, 0xBB, 0xFF, 0x0F}; const uint8_t spTWELVE[]   PROGMEM = {0x09, 0x98, 0xDA, 0x22, 0x01, 0x37, 0x78, 0x1A, 0x20, 0x85, 0xD1, 0x50, 0x3A, 0x33, 0x11, 0x81, 0x5D, 0x5B, 0x95, 0xD4, 0x44, 0x04, 0x76, 0x9D, 0xD5, 0xA9, 0x3A, 0xAB, 0xF0, 0xA1, 0x3E, 0xB7, 0xBA, 0xD5, 0xA9, 0x2B, 0xEB, 0xCC, 0xA0, 0x3E, 0xB7, 0xBD, 0xC3, 0x5A, 0x3B, 0xC8, 0x69, 0x67, 0xBD, 0xFB, 0xE8, 0x67, 0xBF, 0xCA, 0x9D, 0xE9, 0x74, 0x08, 0xE7, 0xCE, 0x77, 0x78, 0x06, 0x89, 0x32, 0x57, 0xD6, 0xF1, 0xF1, 0x8F, 0x7D, 0xFE, 0x1F}; const uint8_t spTHIR_[]    PROGMEM = {0x04, 0xA8, 0xBE, 0x5C, 0x00, 0xDD, 0xA5, 0x11, 0xA0, 0xFA, 0x72, 0x02, 0x74, 0x97, 0xC6, 0x01, 0x09, 0x9C, 0xA6, 0xAB, 0x30, 0x0D, 0xCE, 0x7A, 0xEA, 0x6A, 0x4A, 0x39, 0x35, 0xFB, 0xAA, 0x8B, 0x1B, 0xC6, 0x76, 0xF7, 0xAB, 0x2E, 0x79, 0x19, 0xCA, 0xD5, 0xEF, 0xCA, 0x57, 0x08, 0x14, 0xA1, 0xDC, 0x45, 0x64, 0x03, 0x00, 0xC0, 0xFF, 0x03}; const uint8_t spFIF_[]     PROGMEM = {0x08, 0x98, 0x31, 0x93, 0x02, 0x1C, 0xE0, 0x80, 0x07, 0x5A, 0xD6, 0x1C, 0x6B, 0x78, 0x2E, 0xBD, 0xE5, 0x2D, 0x4F, 0xDD, 0xAD, 0xAB, 0xAA, 0x6D, 0xC9, 0x23, 0x02, 0x56, 0x4C, 0x93, 0x00, 0x05, 0x10, 0x90, 0x89, 0x31, 0xFC, 0x3F}; const uint8_t sp_TEEN[]    PROGMEM = {0x09, 0x58, 0x2A, 0x25, 0x00, 0xCB, 0x9F, 0x95, 0x6C, 0x14, 0x21, 0x89, 0xA9, 0x78, 0xB3, 0x5B, 0xEC, 0xBA, 0xB5, 0x23, 0x13, 0x46, 0x97, 0x99, 0x3E, 0xD6, 0xB9, 0x2E, 0x79, 0xC9, 0x5B, 0xD8, 0x47, 0x41, 0x53, 0x1F, 0xC7, 0xE1, 0x9C, 0x85, 0x54, 0x22, 0xEC, 0xFA, 0xDB, 0xDD, 0x23, 0x93, 0x49, 0xB8, 0xE6, 0x78, 0xFF, 0x3F}; const uint8_t spTWENTY[]   PROGMEM = {0x0A, 0xE8, 0x4A, 0xCD, 0x01, 0xDB, 0xB9, 0x33, 0xC0, 0xA6, 0x54, 0x0C, 0xA4, 0x34, 0xD9, 0xF2, 0x0A, 0x6C, 0xBB, 0xB3, 0x53, 0x0E, 0x5D, 0xA6, 0x25, 0x9B, 0x6F, 0x75, 0xCA, 0x61, 0x52, 0xDC, 0x74, 0x49, 0xA9, 0x8A, 0xC4, 0x76, 0x4D, 0xD7, 0xB1, 0x76, 0xC0, 0x55, 0xA6, 0x65, 0xD8, 0x26, 0x99, 0x5C, 0x56, 0xAD, 0xB9, 0x25, 0x23, 0xD5, 0x7C, 0x32, 0x96, 0xE9, 0x9B, 0x20, 0x7D, 0xCB, 0x3C, 0xFA, 0x55, 0xAE, 0x99, 0x1A, 0x30, 0xFC, 0x4B, 0x3C, 0xFF, 0x1F}; const uint8_t spT[]        PROGMEM = {0x01, 0xD8, 0xB6, 0xDD, 0x01, 0x2F, 0xF4, 0x38, 0x60, 0xD5, 0xD1, 0x91, 0x4D, 0x97, 0x84, 0xE6, 0x4B, 0x4E, 0x36, 0xB2, 0x10, 0x67, 0xCD, 0x19, 0xD9, 0x2C, 0x01, 0x94, 0xF1, 0x78, 0x66, 0x33, 0xEB, 0x79, 0xAF, 0x7B, 0x57, 0x87, 0x36, 0xAF, 0x52, 0x08, 0x9E, 0x6B, 0xEA, 0x5A, 0xB7, 0x7A, 0x94, 0x73, 0x45, 0x47, 0xAC, 0x5A, 0x9C, 0xAF, 0xFF, 0x07}; const uint8_t spHUNDRED[]  PROGMEM = {0x04, 0xC8, 0x7E, 0x5C, 0x02, 0x0A, 0xA8, 0x62, 0x43, 0x03, 0xA7, 0xA8, 0x62, 0x43, 0x4B, 0x97, 0xDC, 0xF2, 0x14, 0xC5, 0xA7, 0x9B, 0x7A, 0xD3, 0x95, 0x37, 0xC3, 0x1E, 0x16, 0x4A, 0x66, 0x36, 0xF3, 0x5A, 0x89, 0x6E, 0xD4, 0x30, 0x55, 0xB5, 0x32, 0xB7, 0x31, 0xB5, 0xC1, 0x69, 0x2C, 0xE9, 0xF7, 0xBC, 0x96, 0x12, 0x39, 0xD4, 0xB5, 0xFD, 0xDA, 0x9B, 0x0F, 0xD1, 0x90, 0xEE, 0xF5, 0xE4, 0x17, 0x02, 0x45, 0x28, 0x77, 0x11, 0xD9, 0x40, 0x9E, 0x45, 0xDD, 0x2B, 0x33, 0x71, 0x7A, 0xBA, 0x0B, 0x13, 0x95, 0x2D, 0xF9, 0xF9, 0x7F}; const uint8_t spTHOUSAND[] PROGMEM = {0x0C, 0xE8, 0x2E, 0xD4, 0x02, 0x06, 0x98, 0xD2, 0x55, 0x03, 0x16, 0x68, 0x7D, 0x17, 0xE9, 0x6E, 0xBC, 0x65, 0x8C, 0x45, 0x6D, 0xA6, 0xE9, 0x96, 0xDD, 0xDE, 0xF6, 0xB6, 0xB7, 0x5E, 0x75, 0xD4, 0x93, 0xA5, 0x9C, 0x7B, 0x57, 0xB3, 0x6E, 0x7D, 0x12, 0x19, 0xAD, 0xDC, 0x29, 0x8D, 0x4F, 0x93, 0xB4, 0x87, 0xD2, 0xB6, 0xFC, 0xDD, 0xAC, 0x22, 0x56, 0x02, 0x70, 0x18, 0xCA, 0x18, 0x26, 0xB5, 0x90, 0xD4, 0xDE, 0x6B, 0x29, 0xDA, 0x2D, 0x25, 0x17, 0x8D, 0x79, 0x88, 0xD4, 0x48, 0x79, 0x5D, 0xF7, 0x74, 0x75, 0xA1, 0x94, 0xA9, 0xD1, 0xF2, 0xED, 0x9E, 0xAA, 0x51, 0xA6, 0xD4, 0x9E, 0x7F, 0xED, 0x6F, 0xFE, 0x2B, 0xD1, 0xC7, 0x3D, 0x89, 0xFA, 0xB7, 0x0D, 0x57, 0xD3, 0xB4, 0xF5, 0x37, 0x55, 0x37, 0x2E, 0xE6, 0xB2, 0xD7, 0x57, 0xFF, 0x0F}; const uint8_t spAND[]      PROGMEM = {0xA9, 0x6B, 0x21, 0xB9, 0x22, 0x66, 0x9F, 0xAE, 0xC7, 0xE1, 0x70, 0x7B, 0x72, 0xBB, 0x5B, 0xDF, 0xEA, 0x56, 0xBB, 0x5C, 0x65, 0xCB, 0x66, 0xC5, 0x3D, 0x67, 0xD7, 0xAB, 0x6D, 0x2E, 0x64, 0x30, 0x93, 0xEE, 0xB1, 0xCD, 0x3D, 0x92, 0xB9, 0x9A, 0xDA, 0xB2, 0x8E, 0x40, 0x12, 0x9A, 0x6A, 0xEB, 0x96, 0x8F, 0x78, 0x98, 0xB3, 0x2A, 0xB4, 0xD3, 0x48, 0xAA, 0x2F, 0x7D, 0xA7, 0x7B, 0xFB, 0x0C, 0x73, 0x71, 0x5C, 0xCE, 0x6E, 0x5C, 0x52, 0x6C, 0x73, 0x79, 0x9A, 0x13, 0x4B, 0x89, 0x45, 0xE9, 0x6E, 0x49, 0x42, 0xA9, 0x57, 0xFF, 0x3F}; const uint8_t spMINUS[]    PROGMEM = {0xE6, 0x28, 0xC4, 0xF8, 0x44, 0x9A, 0xFB, 0xCD, 0xAD, 0x8D, 0x2A, 0x4E, 0x4A, 0xBC, 0xB8, 0x8C, 0xB9, 0x8A, 0xA9, 0x48, 0xED, 0x72, 0x87, 0xD3, 0x74, 0x3B, 0x1A, 0xA9, 0x9D, 0x6F, 0xB3, 0xCA, 0x5E, 0x8C, 0xC3, 0x7B, 0xF2, 0xCE, 0x5A, 0x5E, 0x35, 0x66, 0x5A, 0x3A, 0xAE, 0x55, 0xEB, 0x9A, 0x57, 0x75, 0xA9, 0x29, 0x6B, 0xEE, 0xB6, 0xD5, 0x4D, 0x37, 0xEF, 0xB5, 0x5D, 0xC5, 0x95, 0x84, 0xE5, 0xA6, 0xFC, 0x30, 0xE0, 0x97, 0x0C, 0x0D, 0x58, 0x40, 0x03, 0x1C, 0xA0, 0xC0, 0xFF, 0x03};

/* Say any number between -999,999 and 999,999 */ void sayNumber(long n) {   if (n<0) {     voice.say(spMINUS);     sayNumber(-n);   } else if (n==0) {     voice.say(spZERO);   } else {     if (n>=1000) {       int thousands = n / 1000;       sayNumber(thousands);       voice.say(spTHOUSAND);       n %= 1000;       if ((n > 0) && (n<100)) voice.say(spAND);     }     if (n>=100) {       int hundreds = n / 100;       sayNumber(hundreds);       voice.say(spHUNDRED);       n %= 100;       if (n > 0) voice.say(spAND);     }     if (n>19) {       int tens = n / 10;       switch (tens) {         case 2: voice.say(spTWENTY); break;         case 3: voice.say(spTHIR_); voice.say(spT); break;         case 4: voice.say(spFOUR); voice.say(spT);  break;         case 5: voice.say(spFIF_);  voice.say(spT); break;         case 6: voice.say(spSIX);  voice.say(spT); break;         case 7: voice.say(spSEVEN);  voice.say(spT); break;         case 8: voice.say(spEIGHT);  voice.say(spT); break;         case 9: voice.say(spNINE);  voice.say(spT); break;       }       n %= 10;     }     switch(n) {       case 1: voice.say(spONE); break;       case 2: voice.say(spTWO); break;       case 3: voice.say(spTHREE); break;       case 4: voice.say(spFOUR); break;       case 5: voice.say(spFIVE); break;       case 6: voice.say(spSIX); break;       case 7: voice.say(spSEVEN); break;       case 8: voice.say(spEIGHT); break;       case 9: voice.say(spNINE); break;       case 10: voice.say(spTEN); break;       case 11: voice.say(spELEVEN); break;       case 12: voice.say(spTWELVE); break;       case 13: voice.say(spTHIR_); voice.say(sp_TEEN); break;       case 14: voice.say(spFOUR); voice.say(sp_TEEN);break;       case 15: voice.say(spFIF_); voice.say(sp_TEEN); break;       case 16: voice.say(spSIX); voice.say(sp_TEEN); break;       case 17: voice.say(spSEVEN); voice.say(sp_TEEN); break;       case 18: voice.say(spEIGHT); voice.say(sp_TEEN); break;       case 19: voice.say(spNINE); voice.say(sp_TEEN); break;     }   } } // WiFi network const char* ssid     = "CircuitLoop"; const char* password = "circuitdigest101";

WebServer server ( 80 );

char htmlResponse[3000];

void handleRoot() {   snprintf ( htmlResponse, 3000, "<!DOCTYPE html>\ <html lang=\"en\">\   <head>\     <meta charset=\"utf-8\">\     <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">\   </head>\   <body>\           <h1>Text To Speech</h1>\           <input type='text' name='msg' id='msg' size=7 autofocus> Number \           <div>\           <br><button id=\"speak_button\">Speak</button>\           </div>\     <script src=\"https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js\"></script>\         <script>\       var Number;\       $('#speak_button').click(function(e){\         e.preventDefault();\         Number = $('#msg').val();\                $.get('/save?Number=' + Number , function(data){\           console.log(data);\         });\       });\           </script>\   </body>\ </html>

Related Articles

IoT Smart Garage Door Opener using ESP32 and Arduino IDE

IoT based Smart Garage Door Opener using ESP32

ESP32 Surveillance CCTV Camera

DIY Surveillance CCTV Camera using ESP32CAM and VLC Player

ESP32-S3-SoC-BOX-3 development kit

High-performance ESP32-S3-SoC-BOX-3 Provides a Solution for Developers Looking to Create Edge AI and HMI Applications

IoT based Robotic Arm using ESP32

Web controlled IoT based Robotic Arm using ESP32

Sending SMS using ESP32

Sending SMS Alerts with ESP32 using Twilio

Sending Emails using ESP32 and SMTP Server

Sending Email Alerts using ESP32 via SMTP Server

Sir, can we change the….

Sir, can we change the English text into any other language like Telugu, Panjabi, Gujarati. If Yes can you please tell me how??

I am trying to use your…

I am trying to use your information to use the esp32 speak. I am having a hard time with the conflicting instructions. It says the complete code is at the bottom and it is not complete. Some of the partial exerts of code on the top are missing. Could you help by sending the complete code. Thanks I appreciate it, Retro

It seems the code on bottom…

It seems the code on bottom is missing some of the code on top. Could you send the complete code to me? thanks.

Respeted Sir, …

Respeted Sir, i nerella ome, Research scholar(ECE) in SR UNIVERSITY-WARANGAL. Your explanation is very good about text to speech converter. I am working on weather announcement system using ESP32. I converted customized speech in to hex code using audacity and binary viewer,but customized audio is not playing. can you pls explain the sequence of steps for playing customized audio using talkie library

  • Log in to post comments

Recent Projects

Public Cloud- Investment

Logo

Make Your ESP32 Talk Like It’s The 80s Again

speech to text esp32

80s-era electronic speech certainly has a certain retro appeal to it, but it can sometimes be a useful data output method since it can be implemented on very little hardware. [luc] demonstrates this with a talking thermometer project that requires no display and no special hardware to communicate temperatures to a user.

Back in the day, there were chips like the Votrax SC-01A that could play phonemes (distinct sounds that make up a language) on demand. These would be mixed and matched to create identifiable words, in that distinctly synthesized Speak & Spell manner that is so charming-slash-uncanny.

speech to text esp32

Nowadays, even hobbyist microcontrollers have more than enough processing power and memory to do a similar job entirely in software, which is exactly what [luc]’s talking thermometer project does. All this is done with the Talkie library , originally written for the Arduino and updated for the ESP32 and other microcontrollers. With it, one only needs headphones or a simple audio amplifier and speaker to output canned voice data from a project.

[luc] uses it to demonstrate how to communicate to a user in a hands-free manner without needing a display, and we also saw this output method in an electric unicycle which had a talking speedometer (judged to better allow the user to keep their eyes on the road, as well as minimizing the parts count.)

Would you like to listen to an authentic, somewhat-understandable 80s-era text-to-speech synthesizer? You’re in luck, because we can show you an authentic vintage MicroVox unit in action . Give it a listen, and compare it to a demo of the Talkie library in the video below.

speech to text esp32

20 thoughts on “ Make Your ESP32 Talk Like It’s The 80s Again ”

I like it :) Years ago I played with one of those phoneme chips driven by a 6511AQ running FORTH. This looks like a good thing as I have a friend with low sight.

Changes for ESP-32: Original code was spending too much time in an interrupt handler which the ESP32 hates. And use of the DAC instead of PWM because—why not.

” too much time in an interrupt handler which the ESP32 hates ” they all do.

Interesting. I put together a ESP32 with an audio playback device and pre-programmed it with audio segments for numerical readout. Not a difficult task as it’s just playback and driving the files dynamically. I didn’t know about the talkie library. Might play with that next. Like the retro feel.

“Back in the day, there were chips like the Votrax SC-01A that could play phonemes (distinct sounds that make up a language) on demand. These would be mixed and matched to create identifiable words, in that distinctly synthesized Speak & Spell manner that is so charming-slash-uncanny.”

Just…..Wow. No, no, no.

Speak and Spell sounds were not synthesized it played back very, very lossily compressed audio. All the sounds were audio recordings made by Mitch Carr, a radio announcer that TI hired to record the samples. The distinct electronic sound of the speak and spell was compression artifacts.

Using a Votrax to string Phonemes together was a completely different sound. And not that “distinct synthesized speak & spell manner”.

Why is it so hard for HAD to find people who know the most basic things about what they are writing about. Do you really think it’s better to make things up in your head about topics you know nothing about rather than spend 30 seconds with Google?

And to be clear, Speak and Spell played back complete words and phrases. Nothing was done with speak and spell by stringing together Phonemes.

Spend a few seconds more, and you’ll find that the Speak and Spell used the TI TMC0280 (also known as the TMS5100) to synthesize the spoken words. The TMC0280 used phonemes stored in ROM to synthesize the words. The English words were recorded in Dallas then processed into phonemes. Other languages were recorded in Nice, France then sent to Dallas to be converted to the needed phoneme data and sent back for corrections and further work.

The Speak and Spell did not merely use prerecorded sounds. It used a synthesizer that worked from phonemes generated from recordings.

Modern synthesizers do something similar. They start with a large body of recordings from while individual sounds and sound sequences are clipped. These are then used to make the phonemes that are used in the synthesizer. They generally use more than simple phonemes – they will include phoneme pairs in order to make the transistions more natural.

Speak and Spell: https://en.wikipedia.org/wiki/Speak_%26_Spell_(toy) TI speech synthesis chips: https://en.wikipedia.org/wiki/Texas_Instruments_LPC_Speech_Chips Linear predictive coding: https://en.wikipedia.org/wiki/Linear_predictive_coding

Soc Rat is 100% correct. The Speak and Spell truly did “merely use prerecorded sounds”. It did NOT use phonemes in any way. The speech part of the ROM only contained entire words. For the S&S to gain any vocabulary, you had to buy additional cartridges which contained more complete words. We all agree that phonemes allow for arbitrary words to be generated. Many have tried to hack the S&S to say new words, but nothing short of generating new and complete LPC-encoded words/phrases has ever been successful.

Hack-a-day even covered another project on this topic: https://hackaday.com/2012/12/03/teaching-the-speak-spell-four-and-more-letter-words/

If someone does want to use a phoneme-based speech synthesizer on the ESP32, definitely look at the ESP8266SAM Arduino library: https://github.com/earlephilhower/ESP8266SAM

Soc Rat is 100% correct. The Speak and Spell truly did “merely use prerecorded sounds”. It did NOT use phonemes in any way. The speech part of the ROM only contained entire words. For the S&S to say anything different, you had to buy additional cartridges which contained more complete words. We all agree that phonemes allow for arbitrary words to be generated. Many have tried to hack the S&S to say new words, but nothing short of generating new and complete LPC-encoded words/phrases has ever been successful.

Hack-a-day even covered another project on this topic over ten years ago where the author endeavored to add bad words to the vocabulary. They had to record, encode, and put the new sampled words into a new ROM.

What you’re describing as the way “modern synthesizers” work was broadly true 10-20 years ago, but these days, state of the art systems tend to use deep learning based synthesis: https://en.wikipedia.org/wiki/Deep_learning_speech_synthesis

Yes, this struck me as well — Speak & Spell was LPC, not synthesized e.g. Votrax or SP0256 etc. Thanks for the Mitch Carr tidbit. I did make a quick-and-dirty SP0256-AL2 simulation once with a BluePill. It was simply recordings of the phonemes rather than implementing the actual digital filter. Still, it worked surprisingly well, and low resources.

To be fair they aren’t any worse than ChatGPT. Hmmmm….

While you could get a chip from Radio Shack that would speak with phonemes, Atari 8-bit computers had a program called SAM (software automatic mouth). No hardware was needed and it was fairly understandable. Reminded me of an old man with sinus congestion speaking.

Yes! My exposure to the SAM program was on the C64. Good memories. By the way, the code has now been ported to the ESP8266/ESP32 in the form of the ESP8266SAM Arduino library. It works great, complete with the software knobs you need to recreate your favorite voices like the little elf, strange alien, little old lady, and of course the stuffy old man!

Talkie looks like a pretty nifty library, although the “Danger Danger” from the example on github made me think of Lost In Space.. “Danger Will Robinson! Danger!” :D

Instant smile in my face when the thing started talking, I feel old now :)

Seems like these guys don’t remember SAM: Software Automatic Mouth

The quality of the TI/LPC TTS is not up to current standard if you were to develop a modern product (Like some Bose BT speakers, they are just lame in 2023)

With cheap storage (SD card) and online neural TTS, any MCU can play pre-recorded sentences, indistinguishable from actual human, in any languages.

What we are missing is a quality lightweight/open TTS easy to implement on embedded system (like not Android etc..) to be able to generate any text on the fly. Anyway at the end, if you are running on an ESP32 it would be actually better to simply stream from online TTS….

I’ll just leave this here: https://www.youtube.com/watch?v=t8wyUsaDAyI

Is the code available? Does this use a version of Talkie? Would it work with MAX98357 I2S amplifier? Thanks

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Please be kind and respectful to help make the comments section excellent. ( Comment Policy )

This site uses Akismet to reduce spam. Learn how your comment data is processed .

Never miss a hack

If you missed it.

speech to text esp32

Australia Didn’t Invent WiFi, Despite What You’ve Heard

speech to text esp32

This Is Not A Laptop, It’s A KVM Combo

speech to text esp32

A Modern Take On An Old Language

speech to text esp32

Austraila’s Controlled Loads Are In Hot Water

speech to text esp32

Laser Cutters: Where’s The Point?

Our columns.

speech to text esp32

FLOSS Weekly Episode 797: Coreutils — Don’t Rm -r Up The Tree

speech to text esp32

Portable Router Build: Finding An LTE Modem

speech to text esp32

Hackaday Links: August 18, 2024

speech to text esp32

Retrotechtacular: Powerline Sagging And Stringing In The 1950s

speech to text esp32

Your Data In The Cloud

speech to text esp32

By using our website and services, you expressly agree to the placement of our performance, functionality and advertising cookies. Learn more

Get the Reddit app

ESP32 is a series of low cost, low power system on a chip microcontrollers with integrated Wi-Fi and dual-mode Bluetooth. The ESP32 series employs either a Tensilica Xtensa LX6, Xtensa LX7 or a RiscV processor, and both dual-core and single-core variations are available. It includes in-built antenna switches, RF balun, power amplifier, low-noise receive amplifier, filters, and power management modules as well.

I built a Text to Speech WiFi Speaker

By continuing, you agree to our User Agreement and acknowledge that you understand the Privacy Policy .

Enter the 6-digit code from your authenticator app

You’ve set up two-factor authentication for this account.

Enter a 6-digit backup code

Create your username and password.

Reddit is anonymous, so your username is what you’ll go by here. Choose wisely—because once you get a name, you can’t change it.

Reset your password

Enter your email address or username and we’ll send you a link to reset your password

Check your inbox

An email with a link to reset your password was sent to the email address associated with your account

Choose a Reddit account to continue

ESP32 Forum

Skip to content

  • Unanswered topics
  • Active topics
  • Board index English Forum Explore General Discussion

Esp32 and audio (text to speech)

Post by rankit0092 » Mon Sep 17, 2018 11:44 am

Code: Select all

Re: Esp32 and audio (text to speech)

Post by chromebin » Mon Sep 17, 2018 2:11 pm

User avatar

Post by Vader_Mester » Mon Sep 17, 2018 2:29 pm

Post by Deouss » Mon Sep 17, 2018 4:31 pm

Post by bobolink » Mon Sep 17, 2018 5:40 pm

Post by artium » Sat Sep 05, 2020 9:35 pm

Deouss wrote: ↑ Mon Sep 17, 2018 4:31 pm I heard those are one of the best: Flite: a small, fast run time synthesis engine (c++) University of Edinburgh Festival Speech Synthesis System (pure C)

Return to “General Discussion”

  • English Forum
  •     Explore
  •        News
  •        General Discussion
  •        FAQ
  •     Documentation
  •        Documentation
  •        Sample Code
  •     Discussion Forum
  •        Hardware
  •        ESP-IDF
  •        ESP-BOX
  •        ESP-ADF
  •        ESP-MDF
  •        ESP-WHO
  •        ESP-SkaiNet
  •        ESP32 Arduino
  •        IDEs for ESP-IDF
  •        ESP-AT
  •        ESP IoT Solution
  •        ESP RainMaker
  •        Rust
  •        ESP8266
  •        Report Bugs
  •        Showcase
  • Chinese Forum 中文社区
  •     活动区
  •        乐鑫活动专区
  •     讨论区
  •        全国大学生物联网设计竞赛乐鑫答疑专区
  •        ESP-IDF 中文讨论版
  •        《ESP32-C3 物联网工程开发实战》书籍讨论版
  •        中文文档讨论版
  •        ESP-AT 中文讨论版
  •        ESP-BOX 中文讨论版
  •        ESP IoT Solution 中文讨论版
  •        ESP-ADF 中文讨论版
  •        ESP Mesh 中文讨论版
  •        ESP Cloud 中文讨论版
  •        ESP-WHO 中文讨论版
  •        ESP-SkaiNet 中文讨论版
  •        ESP 生产支持讨论版
  •        硬件问题讨论
  •        项目展示

Who is online

Users browsing this forum: Bing [Bot] , Google [Bot] and 34 guests

  • All times are UTC
  • Delete cookies

Espressif Systems is a fabless semiconductor company providing cutting-edge low power WiFi SoCs and wireless solutions for wireless communications and Internet of Things applications. ESP8266EX and ESP32 are some of our products.

  • Espressif Homepage
  • ESP8266EX Official Forum
  • ESP8266 Community Forum

Information

  • Terms of use
  • Privacy policy

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

Espressif intelligent voice assistant

espressif/esp-skainet

Folders and files.

NameName
379 Commits
workflows workflows

Repository files navigation

Esp-skainet [中文].

ESP-Skainet is Espressif's intelligent voice assistant, which currently supports the Wake Word Engine and Speech Commands Recognition.

ESP32-S3 is recommend to run speech commands recognition, which supports AI instructions and high-speed octal SPI PSRAM. The Latest models will be deployed on ESP32-S3 first.

ESP-Skainet supports the development of wake word detection and speech commands recognition applications based around Espressif Systems' ESP32 series chip in the most convenient way. With ESP-Skainet, you can easily build up wake word detection and speech command recognition applications.

In general, the ESP-Skainet features will be supported, as shown below:

overview

Input Voice Stream

The input audio stream can come from any way of providing voice, such as MIC, wav/pcm files in flash/SD Card.

Wake Word Engine

Espressif wake word engine WakeNet is specially designed to provide a high performance and low memory footprint wake word detection algorithm for users, which enables devices always wait for wake words, such as "Alexa", “天猫精灵” (Tian Mao Jing Ling), and “小爱同学” (Xiao Ai Tong Xue).

Currently, Espressif has not only provided an official wake word "Hi, Lexin" to the public for free but also allows customized wake words. For details on how to customize your own wake words, please see Espressif Speech Wake Words Customization Process .

Speech Commands Recognition

Espressif's speech command recognition model MultiNet is specially designed to provide a flexible offline speech command recognition model. With this model, you can easily add your own speech commands, eliminating the need to train model again.

Currently, Espressif MultiNet supports up to 200 Chinese or English speech commands, such as “打开空调” (Turn on the air conditioner) and “打开卧室灯” (Turn on the bedroom light).

Audio Front End

Espressif Audio Front-End AFE integrates AEC (Acoustic Echo Cancellation), VAD (Voice Activity Detection),BSS (Blind Source Separation) and NS (Noise Suppression).

afe

Quick Start with ESP-Skainet

Hardware preparation.

To run ESP-Skainet, you need to have an ESP32 or ESP32-S3 development board which integrates an audio input module . Development board Support:

Example Name Latest Models Supported Board
MultiNet7 , , , , ,
MultiNet7 , , , ,
Wakenet9 , , , , ,
esp-tts-v1.7 , , , ,
,

On how to configure your applications, please refer to the README.md of each example.

Software Preparation

Esp-skainet.

Clone this project as follows:

ESP-IDF v4.4 and ESP-IDF v5.0 are supported. If you had already configured ESP-IDF before, and do not want to change your existing one, you can configure the IDF_PATH environment variable to the path to ESP-IDF.

For details on how to set up the ESP-IDF, please refer to Getting Started Guide for ESP-IDF release/v4.4 branch

The folder of examples contains some applications demonstrating the API features of ESP-Skainet.

Please start with the wake_word_detection example.

  • Navigate to one example folder `esp-skainet/examples/wake_word_detection).
  • Compile and flash the project.
  • Advanced users can add or modify speech commands by using the idf.py menuconfig command.

For details, please read the README file in each example.

View the Issues section on GitHub if you find a bug or have a feature request, please check existing Issues before opening a new one.

If you are interested in contributing to ESP-Skainet, please check the Contributions Guide .

Contributors 11

  • Python 2.4%

IMAGES

  1. ESP32 based Webserver for Text to Speech (TTS) Conversion

    speech to text esp32

  2. ESP32 + Android Text to Speech + MAX6675 & K-Type Thermocouple

    speech to text esp32

  3. ESP32

    speech to text esp32

  4. ESP32 based Webserver for Text to Speech (TTS) Conversion

    speech to text esp32

  5. I built a Text to Speech WiFi Speaker : r/esp32

    speech to text esp32

  6. ESP32 based Webserver for Text to Speech (TTS) Conversion

    speech to text esp32

COMMENTS

  1. Speech To Text using ESP32

    This video will guide you with how to convert any speech to text which you can further use for any of your projects.

  2. MhageGH/esp32_CloudSpeech

    Transcribe your voice by Google's Cloud Speech-to-Text API with esp32 arduino machine-learning ai esp32 voice-recognition cloud-speech-api m5stack-fire Readme MIT license Activity 107 stars 10 watching 27 forks Report repository

  3. GitHub

    ESP-SR Speech Recognition Framework Espressif ESP-SR helps users build AI speech solutions based on ESP32 or ESP32-S3 chips.

  4. ESP32

    Google Speech-to-Text enables developers to convert audio to text by applying powerful neural network models in an easy-to-use API. In this video, I'm showin...

  5. ESP32-Voice-Assistant-with-Speech-to-Text-Perplexity-AI-and ...

    A dedicated, low-cost AI voice assistant based on the ESP32 microcontroller. This project leverages Google Colab's free computing services for speech-to-text and text-to-speech processing, and integrates with the Perplexity AI API for intelligent conversation and query handling.

  6. Speech To Text using #ESP32

    To learn how to do Speech to Text using ESP32, watch out this tutorial video - https://youtu.be/VoanFTpCTU4

  7. Deep Learning Speech Commands Recognition on ESP32

    The M5StickC is ESP32-powered, with a built-in microphone. This comes handy for a speech recognition project. There are various tutorials on how to train and run a speech commands model on a ESP32. However, most of these tutorials train the model using the Google speech commands data set, which is a large data set but only has 20+ pre-defined ...

  8. ESP32 Speech Synthesizer Experiment With XFS5152CE

    ESP32 Speech Synthesizer Experiment With XFS5152CE: My recent microcontroller experiments have been about voice and AI. This reminds me of the text-to-speech synthesizer experiment I did a while back -- Arduino Speech Synthesizer Experiment with XFS5152CE Being able to synthesize arbitrary text with …

  9. ESP32 I2S voice memo recorder with text to speech and langua

    This my latest. It's an ESP32 voice memo recorder with azure powered speech to text and language translation… looks like we can finally finish buildin...

  10. Let's Talk Espressif ESP32-S3 Voice--Text-to-Speech (TTS)

    We all know that Espressif's ESP32 module is very famous. Today, let's talk the Chinese speech synthesis routine in Espressif's voice assistant framework ESP-Skainet.

  11. GitHub

    GitHub - Fayflimban/Voice-To-Text-ESP32-Setup: This repository contains two tasks. TASK 1 : In this task to convert audio to text, HTML, CSS, JS were used. Initially, a start HTML file was created that contains a link to link it with the speech-to-text page and contains the start button.

  12. Command Word

    MultiNet is a lightweight model designed to recognize multiple speech command words offline based on ESP32. Currently, up to 200 speech commands, including customized commands, are supported. The MultiNet input is the audio processed by the audio-front-end algorithm (AFE), with the format of 16 KHz, 16 bit and mono.

  13. ESP32 I2S voice memo recorder with azure powered speech-to-text and

    An ESP32 voice memo recorder with azure powered speech to text and language translation… looks like we can finally finish building that great big tower we started on all those centuries ago.

  14. ESP32 Tensorflow micro speech with the external microphone

    This tutorial covers how to use Tensorflow micro speech with ESP32 with an external microphone I2S. In other words, we want to customize…

  15. ESP32 based Webserver for Text to Speech (TTS) Conversion

    ESP32 based Webserver for Text to Speech (TTS) Conversion Text to speech ( TTS) has been used in many applications like voice assistants, announcements, ATMs etc. It is also used to help children in learning speaking and blind people in communication.

  16. Make Your ESP32 Talk Like It's The 80s Again

    Make Your ESP32 Talk Like It's The 80s Again. 80s-era electronic speech certainly has a certain retro appeal to it, but it can sometimes be a useful data output method since it can be ...

  17. ESP32 Speech Recognition using Tensorflow I2S Microphone

    ESP32 Speech Recognition using Tensorflow I2S Microphone. Postby survivingwithandroid » Sat Feb 27, 2021 2:10 pm. This project shows how to use Tensorflow microspeech with ESP32 and an external microphone. You can recognize several words to use in your projects:

  18. GitHub

    This project demonstrates speech synthesis on the ESP32. It performs the synthesis locally using the CMU Flite library, rather than offloading this task to cloud providers. For this project, Flite 2.2 (commit hash e9880474) was ported to esp-idf 3.2.2 framework and is now a set of reusable components that can be found in the "components ...

  19. I built a Text to Speech WiFi Speaker : r/esp32

    I built a TTS WiFi Speaker based on an ESP32. This project features a frontend that you can use to let the box read out loud anything. You can also send sounds from a predefined soundboard. I wrote a blogpost about it: https://joszuijderwijk.nl/barrybox. Do you know if you can use it with bluetooth speakers ? 84K subscribers in the esp32 ...

  20. "Talkie" text-to-speech for ESP-32

    Re: "Talkie" text-to-speech for ESP-32 Postby bobolink » Sat May 12, 2018 9:04 pm I converted the "Talkie" Library without knowing much about it. Someone in the General Discussion section of this forum asked for help in getting it to work on the ESP-32. But I researched the project today and it has a whole story.

  21. bootrino/esp32_text_to_speech

    This example plays audio generated from text HELLO, MY NAME IS SAM. using SAM - Tiny Speech Synthesizer. To run this example you need ESP32 with PSRAM such as a WROVERB esp32. Alternatively, you may be smart enough to work out how to reduce the RAM usage of this project so it runs on an ordinary esp32.

  22. Esp32 and audio (text to speech)

    Re: Esp32 and audio (text to speech) Postby Vader_Mester » Mon Sep 17, 2018 2:29 pm. A better approach would be to get a sample lybrary in a .wav formats, convert it down to 8kHz sample rate PCM (10bit or 8bit for speech is enough), and store every sample as a binary array in a header file, with every character having it's own array.

  23. GitHub

    ESP-Skainet [中文] ESP-Skainet is Espressif's intelligent voice assistant, which currently supports the Wake Word Engine and Speech Commands Recognition. ESP32-S3 is recommend to run speech commands recognition, which supports AI instructions and high-speed octal SPI PSRAM. The Latest models will be deployed on ESP32-S3 first.