Nowadays, users are accustomed to controlling devices or sending messages through voice recognition. The majority of current technologies record the user’s voice and send it to an external server for recognition. This is because large computational power is required to recognize the user's voice with high accuracy.
Recording and sending such data externally implicates serious privacy issues and limits the scalability of speech recognition usage. Due to security, privacy, and data confidentiality concerns, it is questionable if and how to apply speech recognition in many areas where such technology can be beneficial, such as automatic transcription of doctor-patient conversations and business meetings.
Our solution to this issue is end-to-end speech recognition on constrained environments such as IoT devices, with quality comparable to the current server-side speech recognition.
- Real-time speech recognition on Raspberry Pi 3
- For Korean characters, lower or comparable error rates* (CER) compared to Google Speech API and Facebook (wit.ai)
- *Based on news and audiobook dataset.
Current speech synthesis technology is divided into three categories: concatenative, parametric, and DL-based. Traditionally, concatenative synthesis has been the choice for creating natural speech output. However, concatenative synthesis can only synthesize one speaker’s voice, and in order to support more speakers, we need to record other speakers’ voices again and again.
In contrast, with parametric methods, it is possible to synthesize voices of various speakers through parameter adjustment. The downside here is that the voice created has a mechanical feel to it.
Meanwhile, deep learning-based methods have started to achieve ‘close to human voice’ level based on mean opinion scores (MOS). While the concatenative method requires up to a few dozen hours of speech from a single speaker, deep-learning based synthesis can learn the voice of a speaker from just several minutes of speech. These DL-based models are capable of synthesizing voices of various speakers. However, limitations remain that it can only mimic the voices of the speakers used in the training process, and large speech dataset with speaker labels are necessary.
Skelter Labs is researching synthesizing technologies with voices of various personalities, using speech datasets without speaker labels. By utilizing the intermediate results of our speech recognition technology, we separate the speech content and the speaker characteristics from the voice, and recover the original speech using the separated data. Using this technology, our objective is to create a generative model that can synthesize voices of various speakers with various emotion and tonality.