Skip to content

v0.0.50

Compare
Choose a tag to compare
@aconchillo aconchillo released this 11 Dec 19:51
· 209 commits to main since this release
8e140b2

Added

  • Added GeminiMultimodalLiveLLMService. This is an integration for Google's Gemini Multimodal Live API, supporting:

    • Real-time audio and video input processing
    • Streaming text responses with TTS
    • Audio transcription for both user and bot speech
    • Function calling
    • System instructions and context management
    • Dynamic parameter updates (temperature, top_p, etc.)
  • Added AudioTranscriber utility class for handling audio transcription with Gemini models.

  • Added new context classes for Gemini:

    • GeminiMultimodalLiveContext
    • GeminiMultimodalLiveUserContextAggregator
    • GeminiMultimodalLiveAssistantContextAggregator
    • GeminiMultimodalLiveContextAggregatorPair
  • Added new foundational examples for GeminiMultimodalLiveLLMService:

    • 26-gemini-multimodal-live.py
    • 26a-gemini-multimodal-live-transcription.py
    • 26b-gemini-multimodal-live-video.py
    • 26c-gemini-multimodal-live-video.py
  • Added SimliVideoService. This is an integration for Simli AI avatars.
    (see https://www.simli.com)

  • Added NVIDIA Riva's FastPitchTTSService and ParakeetSTTService.
    (see https://www.nvidia.com/en-us/ai-data-science/products/riva/)

  • Added IdentityFilter. This is the simplest frame filter that lets through all incoming frames.

  • New STTMuteStrategy called FUNCTION_CALL which mutes the STT service during LLM function calls.

  • DeepgramSTTService now exposes two event handlers on_speech_started and on_utterance_end that could be used to implement interruptions. See new example examples/foundational/07c-interruptible-deepgram-vad.py.

  • Added GroqLLMService, GrokLLMService, and NimLLMService for Groq, Grok, and NVIDIA NIM API integration, with an OpenAI-compatible interface.

  • New examples demonstrating function calling with Groq, Grok, Azure OpenAI, Fireworks, and NVIDIA NIM: 14f-function-calling-groq.py, 14g-function-calling-grok.py, 14h-function-calling-azure.py, 14i-function-calling-fireworks.py, and 14j-function-calling-nvidia.py.

  • In order to obtain the audio stored by the AudioBufferProcessor you can now also register an on_audio_data event handler. The on_audio_data handler will be called every time buffer_size (a new constructor argument) is reached. If buffer_size is 0 (default) you need to manually get the audio as before using AudioBufferProcessor.merge_audio_buffers().

@audiobuffer.event_handler("on_audio_data")
async def on_audio_data(processor, audio, sample_rate, num_channels):
    await save_audio(audio, sample_rate, num_channels)
  • Added a new RTVI message called disconnect-bot, which when handled pushes an EndFrame to trigger the pipeline to stop.

Changed

  • STTMuteFilter now supports multiple simultaneous muting strategies.

  • XTTSService language now defaults to Language.EN.

  • SoundfileMixer doesn't resample input files anymore to avoid startup delays. The sample rate of the provided sound files now need to match the sample rate of the output transport.

  • Input frames (audio, image and transport messages) are now system frames. This means they are processed immediately by all processors instead of being queued internally.

  • Expanded the transcriptions.language module to support a superset of languages.

  • Updated STT and TTS services with language options that match the supported languages for each service.

  • Updated the AzureLLMService to use the OpenAILLMService. Updated the api_version to 2024-09-01-preview.

  • Updated the FireworksLLMService to use the OpenAILLMService. Updated the default model to accounts/fireworks/models/firefunction-v2.

  • Updated the simple-chatbot example to include a Javascript and React client example, using RTVI JS and React.

Removed

  • Removed AppFrame. This was used as a special user custom frame, but there's actually no use case for that.

Fixed

  • Fixed a ParallelPipeline issue that would cause system frames to be queued.

  • Fixed FastAPIWebsocketTransport so it can work with binary data (e.g. using the protobuf serializer).

  • Fixed an issue in CartesiaTTSService that could cause previous audio to be received after an interruption.

  • Fixed Cartesia, ElevenLabs, LMNT and PlayHT TTS websocket reconnection. Before, if an error occurred no reconnection was happening.

  • Fixed a BaseOutputTransport issue that was causing audio to be discarded after an EndFrame was received.

  • Fixed an issue in WebsocketServerTransport and FastAPIWebsocketTransport that would cause a busy loop when using audio mixer.

  • Fixed a DailyTransport and LiveKitTransport issue where connections were being closed in the input transport prematurely. This was causing frames queued inside the pipeline being discarded.

  • Fixed an issue in DailyTransport that would cause some internal callbacks to not be executed.

  • Fixed an issue where other frames were being processed while a CancelFrame was being pushed down the pipeline.

  • AudioBufferProcessor now handles interruptions properly.

  • Fixed a WebsocketServerTransport issue that would prevent interruptions with TwilioSerializer from working.

  • DailyTransport.capture_participant_video now allows capturing user's screen share by simply passing video_source="screenVideo".

  • Fixed Google Gemini message handling to properly convert appended messages to Gemini's required format.

  • Fixed an issue with FireworksLLMService where chat completions were failing by removing the stream_options from the chat completion options.