Skip to content

uhlive.stream.conversation

The Stream Conversation SDK API for human to human interactions.

This API is used to consume a real-time audio stream and get enriched transcription events.

The protocol is messages based and uses websockets as transport. You are free to use whatever websocket client library you like to communicate with the API, and use our SDK to encode/decode the messages.

Quickstart

First retrieve a one time access token with the Auth API.

Then use that token to build an authenticated URL, open a websocket connection to it with the websocket client library of your choice and instanciate a Conversation to join a conversation, generate audio stream messages and decode transcription and enrichment events.

As the API is asynchronous, streaming the audio and reading the returned events should be done in two different threads/tasks.

from uhlive.stream.Conversation import *

stream_h2h_url = build_conversation_url(token)

# The subcripttion identifier was given to you with your other credentials
# the conversation id can be any string you like. If a conversation by that name already exists in your subscription identifier domain
# it will join it as a new speaker, otherwise it will create it and join the speaker in.
# The speaker id helps you identify who is speaking.
conversation = Conversation("subscription_identifier", "a_conversation_id", "a_speaker_id")

Now you can connect and interact with the API:

Synchronous example:

import websocket as ws

socket = ws.create_connection(stream_h2h_url, timeout=10)
socket.send(
    conversation.join(
        model="fr",
        interim_results=False,
        rescoring=True,
        origin=int(time.time() * 1000),
        country="fr",
    )
)
# check we didn't get an error on join
reply = conversation.receive(socket.recv())
assert isinstance(reply, Ok)

Asynchronous example:

from aiohttp import ClientSession

async def main(uhlive_client, uhlive_secret):
    async with ClientSession() as session:
        async with session.ws_connect(stream_h2h_url) as socket:
            await socket.send_str(
                conversation.join(
                    model="fr",
                    interim_results=False,
                    rescoring=True,
                    origin=int(time.time() * 1000),
                    country="fr",
                )
            )
            # check we didn't get an error on join
            msg = await socket.receive()
            reply = conversation.receive(msg.data)
            assert isinstance(reply, Ok)

As you can see, the I/O is cleanly decoupled from the protocol handling: the Conversation object is only used to create the messages to send to the API and to decode the received messages as Event objects.

See the complete examples in the source distribution.

Conversation

Conversation(identifier: str, conversation_id: str, speaker: str)

To join a conversation on the API, you need a Conversation object.

You can only have one Conversation per connection (socket) otherwise you risk unexpected behavior (and exceptions!).

Create a Conversation.

Parameters:

  • identifier (str) –

    is the identifier you got when you subscribed to the service;

  • conversation_id (str) –

    is the conversation you wish to join,

  • speaker (str) –

    is your alias in the conversation, to identify you and your events

left property

left

Did the server confirm we left the conversation?

join

join(
    model: str = "fr",
    country: str = "fr",
    readonly: bool = False,
    interim_results: bool = True,
    rescoring: bool = True,
    origin: int = 0,
    audio_codec: str = "linear",
) -> str

Join the conversation.

Parameters:

  • readonly (bool, default: False ) –

    if you are not going to stream audio, set it to True.

  • model (str, default: 'fr' ) –

    (if readonly is False) the ASR language model to be use to recognize the audio you will stream.

  • country (str, default: 'fr' ) –

    the iso 2 letter country code of the place where the speaker is.

  • interim_results (bool, default: True ) –

    (readonly = False only) should the ASR trigger interim result events?

  • rescoring (bool, default: True ) –

    (readonly = False only) should the ASR refine the final segment with a bigger Language Model? May give slightly degraded results for very short segments.

  • origin (int, default: 0 ) –

    The UNIX time, in milliseconds, to which the event timeline origin is set.

  • audio_codec (str, default: 'linear' ) –

    the speech audio codec of the audio data:

    • "linear": (default) linear 16 bit SLE raw PCM audio at 8khz;
    • "g711a": G711 a-law audio at 8khz;
    • "g711u": G711 μ-law audio at 8khz.

Returns:

  • str –

    The text websocket message to send to the server.

Raises:

  • ProtocolError –

    if still in a previously joined conversation.

leave

leave() -> str

Leave the current conversation.

It's a good idea to leave a conversation and continue to consume messages until you receive a SpeakerLeft event for your speaker, before you close the connection. Otherwise, you may miss parts of the transcription.

Returns:

  • str –

    The text websocket message to send to the server.

Raises:

send_audio_chunk

send_audio_chunk(chunk: bytes) -> bytes

Build an audio chunk for streaming.

Returns:

  • bytes –

    The binary websocket message to send to the server.

Raises: ProtocolError: if not currently in a converstation.

receive

receive(data: Union[str, bytes]) -> Event

Decode received websocket message.

The server only sends text messages.

Returns:

  • Event –

    The appropriate Event subclass instance.

ProtocolError

Bases: RuntimeError

Exception raised when a Conversation method is not available in the current state.

AudioSegmentDecoded

AudioSegmentDecoded(join_ref, ref, conversation, event, payload)

Bases: AudioSpeechDecoded

Final segment transcript event.

conversation property

conversation: str

The conversation identifier

speaker property

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

start property

start: int

Start time as Unix timestamp in millisecond, according to audio timeline.

end property

end: int

End time as Unix timestamp in millisecond, according to audio timeline.

length property

length: int

Event length in millisecond, according to audio timeline.

value property

value: str

Get the transcript of the whole segment as a string

lang property

lang: str

Natural Language of the speech.

As ISO 639-1 code.

country property

country: str

Country location of speaker.

As ISO 3166-1 code.

id property

id: str

The Utterance id identifies the speech utterance this event transcribes.

components property

components: List[Word]

Get the transcript of the whole segment as a list of timestamped words.

confidence property

confidence: float

The ASR confidence for this segment.

from_message staticmethod

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

AudioSpeechDecoded

AudioSpeechDecoded(join_ref, ref, conversation, event, payload)

Bases: TimeScopedEvent

The base class of all transcription events.

conversation property

conversation: str

The conversation identifier

speaker property

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

start property

start: int

Start time as Unix timestamp in millisecond, according to audio timeline.

end property

end: int

End time as Unix timestamp in millisecond, according to audio timeline.

length property

length: int

Event length in millisecond, according to audio timeline.

value property

value: str

Get the transcript of the whole segment as a string

lang property

lang: str

Natural Language of the speech.

As ISO 639-1 code.

country property

country: str

Country location of speaker.

As ISO 3166-1 code.

id property

id: str

The Utterance id identifies the speech utterance this event transcribes.

components property

components: List[Word]

Get the transcript of the whole segment as a list of timestamped words.

confidence property

confidence: float

The ASR confidence for this segment.

from_message staticmethod

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

AudioWordsDecoded

AudioWordsDecoded(join_ref, ref, conversation, event, payload)

Bases: AudioSpeechDecoded

Interim segment transcript event.

conversation property

conversation: str

The conversation identifier

speaker property

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

start property

start: int

Start time as Unix timestamp in millisecond, according to audio timeline.

end property

end: int

End time as Unix timestamp in millisecond, according to audio timeline.

length property

length: int

Event length in millisecond, according to audio timeline.

value property

value: str

Get the transcript of the whole segment as a string

lang property

lang: str

Natural Language of the speech.

As ISO 639-1 code.

country property

country: str

Country location of speaker.

As ISO 3166-1 code.

id property

id: str

The Utterance id identifies the speech utterance this event transcribes.

components property

components: List[Word]

Get the transcript of the whole segment as a list of timestamped words.

confidence property

confidence: float

The ASR confidence for this segment.

from_message staticmethod

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

EntityRecognized

EntityRecognized(join_ref, ref, conversation, event, payload)

Bases: TimeScopedEvent

The class for all entity annotation events.

conversation property

conversation: str

The conversation identifier

speaker property

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

start property

start: int

Start time as Unix timestamp in millisecond, according to audio timeline.

end property

end: int

End time as Unix timestamp in millisecond, according to audio timeline.

length property

length: int

Event length in millisecond, according to audio timeline.

entity_name property

entity_name: str

The name of the named entity found.

lang property

lang: str

Natural Language of the interpretation.

As ISO 639-1 code.

country property

country: str

Country location of speaker.

As ISO 3166-1 code.

display property

display: str | None

The well formatted form of the entity in the language (string).

source property

source: str

The transcript excerpt that was interpreted, as string.

value property

value: Any

The interpreted value in machine understandable form.

The exact type depends on the entity.

confidence property

confidence: float

The confidence of the interpretation.

from_message staticmethod

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

EntityReference

EntityReference(entity_name: str, speaker: str, id: str)

Reference to a unique previously found Entity in the conversation.

kind instance-attribute

kind: str = entity_name

The name of the Entity referenced.

speaker instance-attribute

speaker: str = speaker

The speaker identifier.

id instance-attribute

id: str = id

The id the referenced Entity.

Event

Event(join_ref, ref, conversation, event, payload)

Bases: object

The base class of all events.

conversation property

conversation: str

The conversation identifier

speaker property

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

from_message staticmethod

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

Ok

Ok(join_ref, ref, conversation, event, payload)

Bases: Event

API asynchronous command aknowledgements.

conversation property

conversation: str

The conversation identifier

speaker property

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

from_message staticmethod

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

RelationRecognized

RelationRecognized(join_ref, ref, conversation, event, payload)

Bases: TimeScopedEvent

The class for all Relation events.

Relations express a semantic relationship between two or more entities.

conversation property

conversation: str

The conversation identifier

speaker property

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

start property

start: int

Start time as Unix timestamp in millisecond, according to audio timeline.

end property

end: int

End time as Unix timestamp in millisecond, according to audio timeline.

length property

length: int

Event length in millisecond, according to audio timeline.

relation_name property

relation_name: str

The type of the relation.

lang property

lang: str

Natural Language of the interpretation.

As ISO 639-1 code.

confidence property

confidence: float

The confidence on the discovered relationship.

components property

components: List[EntityReference]

References to the Entities involved in this relationship.

from_message staticmethod

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

SpeakerJoined

SpeakerJoined(join_ref, ref, conversation, event, payload)

Bases: Event

A new speaker joined the conversation (after us).

conversation property

conversation: str

The conversation identifier

speaker property

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

timestamp property

timestamp: int

The UNIX time when the speaker joined the conversation.

interim_results property

interim_results: bool

Are interim results activated for this speaker?

rescoring property

rescoring: bool

Is rescoring enabled for this speaker?

lang property

lang: str

Natural Language of the speech.

As ISO 639-1 code.

country property

country: str

Country location of speaker.

As ISO 3166-1 code.

from_message staticmethod

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

SpeakerLeft

SpeakerLeft(join_ref, ref, conversation, event, payload)

Bases: Event

Event emitted by the associated speaker when they left the conversation.

conversation property

conversation: str

The conversation identifier

speaker property

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

timestamp property

timestamp: int

UNIX time when the speaker left the conversation.

from_message staticmethod

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

Tag

Tag(value: str, display: str, confidence: float)

A tag represents a behavioral feature found in the conversation.

value instance-attribute

value: str = value

The unique id of the Tag.

display instance-attribute

display: str = display

The human readable name of the Tag.

TagsSet

TagsSet(join_ref, ref, conversation, event, payload)

Bases: TimeScopedEvent

One or more tags were found on this time range.

conversation property

conversation: str

The conversation identifier

speaker property

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

start property

start: int

Start time as Unix timestamp in millisecond, according to audio timeline.

end property

end: int

End time as Unix timestamp in millisecond, according to audio timeline.

length property

length: int

Event length in millisecond, according to audio timeline.

lang property

lang: str

Natural Language of the interpretation.

As ISO 639-1 code.

country property

country: str

Country location of speaker.

As ISO 3166-1 code.

confidence property

confidence: float

Tagger confidence.

tags property

tags: List[Tag]

The tags that were found on this time range

from_message staticmethod

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

Unknown

Unknown(join_ref, ref, topic, event, payload)

Bases: Event

The server emitted an event unkown to this SDK. Time to upgrade!

conversation property

conversation: str

The conversation identifier

speaker property

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

from_message staticmethod

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

Word

Bases: dict

Timestamped word.

start property

start: int

Start time as Unix timestamp in millisecond, according to audio timeline.

end property

end: int

End time as Unix timestamp in millisecond, according to audio timeline.

length property

length: int

Word length in millisecond, according to audio timeline.

value property

value: str

Transcript token string for this word.

confidence property

confidence: float

ASR confidence for this word.

build_conversation_url

build_conversation_url(token: str) -> str

Make an authenticated URL to connect to the Conversation Service.