Skip to content

uhlive.stream.conversation

The Stream Conversation SDK API for human to human interactions.

This API is used to consume a real-time audio stream and get enriched transcription events.

The protocol is messages based and uses websockets as transport. You are free to use whatever websocket client library you like to communicate with the API, and use our SDK to encode/decode the messages.

Quickstart

First retrieve a one time access token with the Auth API.

Then use that token to build an authenticated URL, open a websocket connection to it with the websocket client library of your choice and instanciate a Conversation to join a conversation, generate audio stream messages and decode transcription and enrichment events.

As the API is asynchronous, streaming the audio and reading the returned events should be done in two different threads/tasks.

from uhlive.stream.Conversation import *

stream_h2h_url = build_conversation_url(token)

# The subcripttion identifier was given to you with your other credentials
# the conversation id can be any string you like. If a conversation by that name already exists in your subscription identifier domain
# it will join it as a new speaker, otherwise it will create it and join the speaker in.
# The speaker id helps you identify who is speaking.
conversation = Conversation("subscription_identifier", "a_conversation_id", "a_speaker_id")

Now you can connect and interact with the API:

Synchronous example:

import websocket as ws

socket = ws.create_connection(stream_h2h_url, timeout=10)
socket.send(
    conversation.join(
        model="fr",
        interim_results=False,
        rescoring=True,
        origin=int(time.time() * 1000),
        country="fr",
    )
)
# check we didn't get an error on join
reply = conversation.receive(socket.recv())
assert isinstance(reply, Ok)

Asynchronous example:

from aiohttp import ClientSession

async def main(uhlive_client, uhlive_secret):
    async with ClientSession() as session:
        async with session.ws_connect(stream_h2h_url) as socket:
            await socket.send_str(
                conversation.join(
                    model="fr",
                    interim_results=False,
                    rescoring=True,
                    origin=int(time.time() * 1000),
                    country="fr",
                )
            )
            # check we didn't get an error on join
            msg = await socket.receive()
            reply = conversation.receive(msg.data)
            assert isinstance(reply, Ok)

As you can see, the I/O is cleanly decoupled from the protocol handling: the Conversation object is only used to create the messages to send to the API and to decode the received messages as Event objects.

See the complete examples in the source distribution.

Conversation

Conversation(identifier: str, conversation_id: str, speaker: str)

To join a conversation on the API, you need a Conversation object.

You can only have one Conversation per connection (socket) otherwise you risk unexpected behavior (and exceptions!).

Create a Conversation.

Parameters:

  • identifier (str) –

    is the identifier you got when you subscribed to the service;

  • conversation_id (str) –

    is the conversation you wish to join,

  • speaker (str) –

    is your alias in the conversation, to identify you and your events

left property

left

Did the server confirm we left the conversation?

join

join(model: str = 'fr', country: str = 'fr', readonly: bool = False, interim_results: bool = True, rescoring: bool = True, origin: int = 0, audio_codec: str = 'linear') -> str

Join the conversation.

Parameters:

  • readonly (bool, default: False ) –

    if you are not going to stream audio, set it to True.

  • model (str, default: 'fr' ) –

    (if readonly is False) the ASR language model to be use to recognize the audio you will stream.

  • country (str, default: 'fr' ) –

    the iso 2 letter country code of the place where the speaker is.

  • interim_results (bool, default: True ) –

    (readonly = False only) should the ASR trigger interim result events?

  • rescoring (bool, default: True ) –

    (readonly = False only) should the ASR refine the final segment with a bigger Language Model? May give slightly degraded results for very short segments.

  • origin (int, default: 0 ) –

    The UNIX time, in milliseconds, to which the event timeline origin is set.

  • audio_codec (str, default: 'linear' ) –

    the speech audio codec of the audio data:

    • "linear": (default) linear 16 bit SLE raw PCM audio at 8khz;
    • "g711a": G711 a-law audio at 8khz;
    • "g711u": G711 μ-law audio at 8khz.

Returns:

  • str –

    The text websocket message to send to the server.

Raises:

  • ProtocolError –

    if still in a previously joined conversation.

leave

leave() -> str

Leave the current conversation.

It's a good idea to leave a conversation and continue to consume messages until you receive a SpeakerLeft event for your speaker, before you close the connection. Otherwise, you may miss parts of the transcription.

Returns:

  • str –

    The text websocket message to send to the server.

Raises:

send_audio_chunk

send_audio_chunk(chunk: bytes) -> bytes

Build an audio chunk for streaming.

Returns:

  • bytes –

    The binary websocket message to send to the server.

Raises: ProtocolError: if not currently in a converstation.

receive

receive(data: Union[str, bytes]) -> Event

Decode received websocket message.

The server only sends text messages.

Returns:

  • Event –

    The appropriate Event subclass instance.

ProtocolError

Bases: RuntimeError

Exception raised when a Conversation method is not available in the current state.

EntityFound

EntityFound(join_ref, ref, topic, event, payload)

Bases: TimeScopedEvent

The class for all entity annotation events.

topic property

topic: str

The conversation identifier

speaker property

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

start property

start: int

Start time as Unix timestamp in millisecond, according to audio timeline.

end property

end: int

End time as Unix timestamp in millisecond, according to audio timeline.

length property

length: int

Event length in millisecond, according to audio timeline.

entity_name property

entity_name: str

The name of the named entity found.

lang property

lang: str

Natural Language of the interpretation.

As ISO 639-1 code.

country property

country: str

Country location of speaker.

As ISO 3166-1 code.

canonical property

canonical: str

The well formatted form of the entity in the language (string).

original property

original: str

The transcript excerpt that was interpreted, as string.

value property

value: Any

The interpreted value in machine understandable form.

The exact type depends on the entity.

confidence property

confidence: float

The confidence of the interpretation.

from_message staticmethod

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

EntityReference

EntityReference(entity_name: str, speaker: str, start: int)

Reference to a unique previously found Entity in the conversation.

kind instance-attribute

kind: str = entity_name

The name of the Entity referenced.

speaker instance-attribute

speaker: str = speaker

The speaker identifier.

start instance-attribute

start: int = start

The UNIX start time of the referenced Entity.

Event

Event(join_ref, ref, topic, event, payload)

Bases: object

The base class of all events.

topic property

topic: str

The conversation identifier

speaker property

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

from_message staticmethod

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

Ok

Ok(join_ref, ref, topic, event, payload)

Bases: Event

API asynchronous command aknowledgements.

topic property

topic: str

The conversation identifier

speaker property

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

from_message staticmethod

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

RelationFound

RelationFound(join_ref, ref, topic, event, payload)

Bases: TimeScopedEvent

The class for all Relation events.

Relations express a semantic relationship between two or more entities.

topic property

topic: str

The conversation identifier

speaker property

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

start property

start: int

Start time as Unix timestamp in millisecond, according to audio timeline.

end property

end: int

End time as Unix timestamp in millisecond, according to audio timeline.

length property

length: int

Event length in millisecond, according to audio timeline.

relation_name property

relation_name: str

The type of the relation.

lang property

lang: str

Natural Language of the interpretation.

As ISO 639-1 code.

confidence property

confidence: float

The confidence on the discovered relationship.

members property

members: List[EntityReference]

References to the Entities involved in this relationship.

from_message staticmethod

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

SegmentDecoded

SegmentDecoded(join_ref, ref, topic, event, payload)

Bases: SpeechDecoded

Final segment transcript event.

topic property

topic: str

The conversation identifier

speaker property

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

start property

start: int

Start time as Unix timestamp in millisecond, according to audio timeline.

end property

end: int

End time as Unix timestamp in millisecond, according to audio timeline.

length property

length: int

Event length in millisecond, according to audio timeline.

transcript property

transcript: str

Get the transcript of the whole segment as a string

lang property

lang: str

Natural Language of the speech.

As ISO 639-1 code.

country property

country: str

Country location of speaker.

As ISO 3166-1 code.

utterance_id property

utterance_id: str

The Utterance id identifies the speech utterance this event transcribes.

words property

words: List[Word]

Get the transcript of the whole segment as a list of timestamped words.

confidence property

confidence: float

The ASR confidence for this segment.

from_message staticmethod

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

SpeakerJoined

SpeakerJoined(join_ref, ref, topic, event, payload)

Bases: Event

A new speaker joined the conversation (after us).

topic property

topic: str

The conversation identifier

speaker property

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

timestamp property

timestamp: int

The UNIX time when the speaker joined the conversation.

interim_results property

interim_results: bool

Are interim results activated for this speaker?

rescoring property

rescoring: bool

Is rescoring enabled for this speaker?

lang property

lang: str

Natural Language of the speech.

As ISO 639-1 code.

country property

country: str

Country location of speaker.

As ISO 3166-1 code.

from_message staticmethod

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

SpeakerLeft

SpeakerLeft(join_ref, ref, topic, event, payload)

Bases: Event

Event emitted by the associated speaker when they left the conversation.

topic property

topic: str

The conversation identifier

speaker property

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

timestamp property

timestamp: int

UNIX time when the speaker left the conversation.

from_message staticmethod

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

SpeechDecoded

SpeechDecoded(join_ref, ref, topic, event, payload)

Bases: TimeScopedEvent

The base class of all transcription events.

topic property

topic: str

The conversation identifier

speaker property

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

start property

start: int

Start time as Unix timestamp in millisecond, according to audio timeline.

end property

end: int

End time as Unix timestamp in millisecond, according to audio timeline.

length property

length: int

Event length in millisecond, according to audio timeline.

transcript property

transcript: str

Get the transcript of the whole segment as a string

lang property

lang: str

Natural Language of the speech.

As ISO 639-1 code.

country property

country: str

Country location of speaker.

As ISO 3166-1 code.

utterance_id property

utterance_id: str

The Utterance id identifies the speech utterance this event transcribes.

words property

words: List[Word]

Get the transcript of the whole segment as a list of timestamped words.

confidence property

confidence: float

The ASR confidence for this segment.

from_message staticmethod

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

Tag

Tag(uuid: str, label: str)

A tag represents a behavioral feature found in the conversation.

uuid instance-attribute

uuid: str = uuid

The unique id of the Tag.

label instance-attribute

label: str = label

The human readable name of the Tag.

TagsFound

TagsFound(join_ref, ref, topic, event, payload)

Bases: TimeScopedEvent

One or more tags were found on this time range.

topic property

topic: str

The conversation identifier

speaker property

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

start property

start: int

Start time as Unix timestamp in millisecond, according to audio timeline.

end property

end: int

End time as Unix timestamp in millisecond, according to audio timeline.

length property

length: int

Event length in millisecond, according to audio timeline.

lang property

lang: str

Natural Language of the interpretation.

As ISO 639-1 code.

country property

country: str

Country location of speaker.

As ISO 3166-1 code.

confidence property

confidence: float

Tagger confidence.

tags property

tags: List[Tag]

The tags that were found on this time range

from_message staticmethod

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

Unknown

Unknown(join_ref, ref, topic, event, payload)

Bases: Event

The server emitted an event unkown to this SDK. Time to upgrade!

topic property

topic: str

The conversation identifier

speaker property

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

from_message staticmethod

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

Word

Bases: dict

Timestamped word.

start property

start: int

Start time as Unix timestamp in millisecond, according to audio timeline.

end property

end: int

End time as Unix timestamp in millisecond, according to audio timeline.

length property

length: int

Word length in millisecond, according to audio timeline.

word property

word: str

Transcript token string for this word.

confidence property

confidence: float

ASR confidence for this word.

WordsDecoded

WordsDecoded(join_ref, ref, topic, event, payload)

Bases: SpeechDecoded

Interim segment transcript event.

topic property

topic: str

The conversation identifier

speaker property

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

start property

start: int

Start time as Unix timestamp in millisecond, according to audio timeline.

end property

end: int

End time as Unix timestamp in millisecond, according to audio timeline.

length property

length: int

Event length in millisecond, according to audio timeline.

transcript property

transcript: str

Get the transcript of the whole segment as a string

lang property

lang: str

Natural Language of the speech.

As ISO 639-1 code.

country property

country: str

Country location of speaker.

As ISO 3166-1 code.

utterance_id property

utterance_id: str

The Utterance id identifies the speech utterance this event transcribes.

words property

words: List[Word]

Get the transcript of the whole segment as a list of timestamped words.

confidence property

confidence: float

The ASR confidence for this segment.

from_message staticmethod

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

build_conversation_url

build_conversation_url(token: str) -> str

Make an authenticated URL to connect to the Conversation Service.