LexiconsĪpplications can provide custom word pronunciations for speech synthesis engines using methods provided by ISpContainerLexicon, ISpLexicon and ISpPhoneConverter. Applications can initialize and handle these real-time events using ISpNotifySource, ISpNotifySink, ISpNotifyTranslator, ISpEventSink, ISpEventSource, and ISpNotif圜allback. Applications can sync to real-time actions as they occur such as word boundaries, phoneme or viseme (mouth animation) boundaries or application custom bookmarks. For TTS, events are mostly used for synchronizing to the output speech. SAPI communicates with applications by sending events using standard callback mechanisms (Window Message, callback proc or Win32 Event). In addition to the ISpVoice interface, SAPI also provides many utility COM interfaces for the more advanced TTS applications. Also while speaking asynchronously, new text can be spoken by either immediately interrupting the current output (SPF_PURGEBEFORESPEAK), or by automatically appending the new text to the end of the current output. When speaking asynchronously (SPF_ASYNC), real-time status information such as speaking state and current text location can polled using ISpVoice::GetStatus. The IspVoice::Speak method can operate either synchronously (return only when completely finished speaking) or asynchronously (return immediately and speak as a background process). See the XML TTS Tutorial for more details. This synthesis markup, using standard XML format, is a simple but powerful way to customize the TTS speech, independent of the specific engine or voice currently in use. Special SAPI controls can also be inserted along with the input text to change real-time synthesis properties like voice, pitch, word emphasis, speaking rate and volume. In addition, the IspVoice interface also provides several methods for changing voice and synthesis properties such as speaking rate ISpVoice::SetRate, output volume ISpVoice::SetVolume and changing the current speaking voice ISpVoice::SetVoice Once an application has created an ISpVoice object (see Text-to-Speech Tutorial), the application only needs to call ISpVoice::Speak to generate speech output from some text data. Speech recognizers convert human spoken audio into readable text strings and files.Īpplications can control text-to-speech (TTS) using the ISpVoice Component Object Model (COM) interface. TTS systems synthesize text strings and files into spoken audio using synthetic voices. The two basic types of SAPI engines are text-to-speech (TTS) systems and speech recognizers. SAPI implements all the low-level details needed to control and manage the real-time operations of various speech engines. The SAPI API provides a high-level interface between an application and speech engines. This section covers the following topics:
The SAPI application programming interface (API) dramatically reduces the code overhead required for an application to use speech recognition and text-to-speech, making speech technology more accessible and robust for a wide range of applications.
This may be broken since Cortana has been decoupled from the system in update 20H1.Microsoft Speech API 5.3 Speech API Overview