uhlive.stream.conversation

The Stream Conversation SDK API for human to human interactions.

This API is used to consume a real-time audio stream and get enriched transcription events.

The protocol is messages based and uses websockets as transport. You are free to use whatever websocket client library you like to communicate with the API, and use our SDK to encode/decode the messages.

Quickstart

First retrieve a one time access token with the Auth API.

Then use that token to build an authenticated URL, open a websocket connection to it with the websocket client library of your choice and instanciate a Conversation to join a conversation, generate audio stream messages and decode transcription and enrichment events.

As the API is asynchronous, streaming the audio and reading the returned events should be done in two different threads/tasks.

from uhlive.stream.Conversation import *

stream_h2h_url = build_conversation_url(token)

# The subcripttion identifier was given to you with your other credentials
# the conversation id can be any string you like. If a conversation by that name already exists in your subscription identifier domain
# it will join it as a new speaker, otherwise it will create it and join the speaker in.
# The speaker id helps you identify who is speaking.
conversation = Conversation("subscription_identifier", "a_conversation_id", "a_speaker_id")

Now you can connect and interact with the API:

Synchronous example:

import websocket as ws

socket = ws.create_connection(stream_h2h_url, timeout=10)
socket.send(
    conversation.join(
        model="fr",
        interim_results=False,
        rescoring=True,
        origin=int(time.time() * 1000),
        country="fr",
    )
)
# check we didn't get an error on join
reply = conversation.receive(socket.recv())
assert isinstance(reply, Ok)

Asynchronous example:

from aiohttp import ClientSession

async def main(uhlive_client, uhlive_secret):
    async with ClientSession() as session:
        async with session.ws_connect(stream_h2h_url) as socket:
            await socket.send_str(
                conversation.join(
                    model="fr",
                    interim_results=False,
                    rescoring=True,
                    origin=int(time.time() * 1000),
                    country="fr",
                )
            )
            # check we didn't get an error on join
            msg = await socket.receive()
            reply = conversation.receive(msg.data)
            assert isinstance(reply, Ok)

As you can see, the I/O is cleanly decoupled from the protocol handling: the Conversation object is only used to create the messages to send to the API and to decode the received messages as Event objects.

See the complete examples in the source distribution.

Conversation

Conversation(identifier: str, conversation_id: str, speaker: str)

To join a conversation on the API, you need a Conversation object.

You can only have one Conversation per connection (socket) otherwise you risk unexpected behavior (and exceptions!).

Create a Conversation.

Parameters:

identifier (str) –

is the identifier you got when you subscribed to the service;
conversation_id (str) –

is the conversation you wish to join,
speaker (str) –

is your alias in the conversation, to identify you and your events

left `property`

left

Did the server confirm we left the conversation?

join

join(model: str = 'fr', country: str = 'fr', readonly: bool = False, interim_results: bool = True, rescoring: bool = True, origin: int = 0, audio_codec: str = 'linear') -> str

Join the conversation.

Parameters:

readonly (bool, default: False ) –

if you are not going to stream audio, set it to True.
model (str, default: 'fr' ) –

(if readonly is False) the ASR language model to be use to recognize the audio you will stream.
country (str, default: 'fr' ) –

the iso 2 letter country code of the place where the speaker is.
interim_results (bool, default: True ) –

(readonly = False only) should the ASR trigger interim result events?
rescoring (bool, default: True ) –

(readonly = False only) should the ASR refine the final segment with a bigger Language Model? May give slightly degraded results for very short segments.
origin (int, default: 0 ) –

The UNIX time, in milliseconds, to which the event timeline origin is set.
audio_codec (str, default: 'linear' ) –
the speech audio codec of the audio data:
- "linear": (default) linear 16 bit SLE raw PCM audio at 8khz;
- "g711a": G711 a-law audio at 8khz;
- "g711u": G711 μ-law audio at 8khz.

Returns:

str –

The text websocket message to send to the server.

Raises:

ProtocolError –

if still in a previously joined conversation.

leave

leave() -> str

Leave the current conversation.

It's a good idea to leave a conversation and continue to consume messages until you receive a SpeakerLeft event for your speaker, before you close the connection. Otherwise, you may miss parts of the transcription.

Returns:

str –

The text websocket message to send to the server.

Raises:

ProtocolError –

if not currently in a converstation.

send_audio_chunk

send_audio_chunk(chunk: bytes) -> bytes

Build an audio chunk for streaming.

Returns:

bytes –

The binary websocket message to send to the server.

Raises: ProtocolError: if not currently in a converstation.

receive

receive(data: Union[str, bytes]) -> Event

Decode received websocket message.

The server only sends text messages.

Returns:

Event –

The appropriate Event subclass instance.

ProtocolError

Bases: RuntimeError

Exception raised when a Conversation method is not available in the current state.

EntityFound

EntityFound(join_ref, ref, topic, event, payload)

Bases: TimeScopedEvent

The class for all entity annotation events.

topic `property`

topic: str

The conversation identifier

speaker `property`

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

start `property`

start: int

Start time as Unix timestamp in millisecond, according to audio timeline.

end `property`

end: int

End time as Unix timestamp in millisecond, according to audio timeline.

length `property`

length: int

Event length in millisecond, according to audio timeline.

entity_name `property`

entity_name: str

The name of the named entity found.

lang `property`

lang: str

Natural Language of the interpretation.

As ISO 639-1 code.

country `property`

country: str

Country location of speaker.

As ISO 3166-1 code.

canonical `property`

canonical: str

The well formatted form of the entity in the language (string).

original `property`

original: str

The transcript excerpt that was interpreted, as string.

value `property`

value: Any

The interpreted value in machine understandable form.

The exact type depends on the entity.

confidence `property`

confidence: float

The confidence of the interpretation.

from_message `staticmethod`

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

EntityReference

EntityReference(entity_name: str, speaker: str, start: int)

Reference to a unique previously found Entity in the conversation.

kind `instance-attribute`

kind: str = entity_name

The name of the Entity referenced.

speaker `instance-attribute`

speaker: str = speaker

The speaker identifier.

start `instance-attribute`

start: int = start

The UNIX start time of the referenced Entity.

Event

Event(join_ref, ref, topic, event, payload)

Bases: object

The base class of all events.

topic `property`

topic: str

The conversation identifier

speaker `property`

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

from_message `staticmethod`

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

Ok

Ok(join_ref, ref, topic, event, payload)

Bases: Event

API asynchronous command aknowledgements.

topic `property`

topic: str

The conversation identifier

speaker `property`

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

from_message `staticmethod`

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

RelationFound

RelationFound(join_ref, ref, topic, event, payload)

Bases: TimeScopedEvent

The class for all Relation events.

Relations express a semantic relationship between two or more entities.

topic `property`

topic: str

The conversation identifier

speaker `property`

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

start `property`

start: int

Start time as Unix timestamp in millisecond, according to audio timeline.

end `property`

end: int

End time as Unix timestamp in millisecond, according to audio timeline.

length `property`

length: int

Event length in millisecond, according to audio timeline.

relation_name `property`

relation_name: str

The type of the relation.

lang `property`

lang: str

Natural Language of the interpretation.

As ISO 639-1 code.

confidence `property`

confidence: float

The confidence on the discovered relationship.

members `property`

members: List[EntityReference]

References to the Entities involved in this relationship.

from_message `staticmethod`

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

SegmentDecoded

SegmentDecoded(join_ref, ref, topic, event, payload)

Bases: SpeechDecoded

Final segment transcript event.

topic `property`

topic: str

The conversation identifier

speaker `property`

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

start `property`

start: int

Start time as Unix timestamp in millisecond, according to audio timeline.

end `property`

end: int

End time as Unix timestamp in millisecond, according to audio timeline.

length `property`

length: int

Event length in millisecond, according to audio timeline.

transcript `property`

transcript: str

Get the transcript of the whole segment as a string

lang `property`

lang: str

Natural Language of the speech.

As ISO 639-1 code.

country `property`

country: str

Country location of speaker.

As ISO 3166-1 code.

utterance_id `property`

utterance_id: str

The Utterance id identifies the speech utterance this event transcribes.

words `property`

words: List[Word]

Get the transcript of the whole segment as a list of timestamped words.

confidence `property`

confidence: float

The ASR confidence for this segment.

from_message `staticmethod`

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

SpeakerJoined

SpeakerJoined(join_ref, ref, topic, event, payload)

Bases: Event

A new speaker joined the conversation (after us).

topic `property`

topic: str

The conversation identifier

speaker `property`

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

timestamp `property`

timestamp: int

The UNIX time when the speaker joined the conversation.

interim_results `property`

interim_results: bool

Are interim results activated for this speaker?

rescoring `property`

rescoring: bool

Is rescoring enabled for this speaker?

lang `property`

lang: str

Natural Language of the speech.

As ISO 639-1 code.

country `property`

country: str

Country location of speaker.

As ISO 3166-1 code.

from_message `staticmethod`

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

SpeakerLeft

SpeakerLeft(join_ref, ref, topic, event, payload)

Bases: Event

Event emitted by the associated speaker when they left the conversation.

topic `property`

topic: str

The conversation identifier

speaker `property`

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

timestamp `property`

timestamp: int

UNIX time when the speaker left the conversation.

from_message `staticmethod`

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

SpeechDecoded

SpeechDecoded(join_ref, ref, topic, event, payload)

Bases: TimeScopedEvent

The base class of all transcription events.

topic `property`

topic: str

The conversation identifier

speaker `property`

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

start `property`

start: int

Start time as Unix timestamp in millisecond, according to audio timeline.

end `property`

end: int

End time as Unix timestamp in millisecond, according to audio timeline.

length `property`

length: int

Event length in millisecond, according to audio timeline.

transcript `property`

transcript: str

Get the transcript of the whole segment as a string

lang `property`

lang: str

Natural Language of the speech.

As ISO 639-1 code.

country `property`

country: str

Country location of speaker.

As ISO 3166-1 code.

utterance_id `property`

utterance_id: str

The Utterance id identifies the speech utterance this event transcribes.

words `property`

words: List[Word]

Get the transcript of the whole segment as a list of timestamped words.

confidence `property`

confidence: float

The ASR confidence for this segment.

from_message `staticmethod`

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

Tag

Tag(uuid: str, label: str)

A tag represents a behavioral feature found in the conversation.

uuid `instance-attribute`

uuid: str = uuid

The unique id of the Tag.

label `instance-attribute`

label: str = label

The human readable name of the Tag.

TagsFound

TagsFound(join_ref, ref, topic, event, payload)

Bases: TimeScopedEvent

One or more tags were found on this time range.

topic `property`

topic: str

The conversation identifier

speaker `property`

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

start `property`

start: int

Start time as Unix timestamp in millisecond, according to audio timeline.

end `property`

end: int

End time as Unix timestamp in millisecond, according to audio timeline.

length `property`

length: int

Event length in millisecond, according to audio timeline.

lang `property`

lang: str

Natural Language of the interpretation.

As ISO 639-1 code.

country `property`

country: str

Country location of speaker.

As ISO 3166-1 code.

confidence `property`

confidence: float

Tagger confidence.

tags `property`

tags: List[Tag]

The tags that were found on this time range

from_message `staticmethod`

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

Unknown

Unknown(join_ref, ref, topic, event, payload)

Bases: Event

The server emitted an event unkown to this SDK. Time to upgrade!

topic `property`

topic: str

The conversation identifier

speaker `property`

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

from_message `staticmethod`

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

Word

Bases: dict

Timestamped word.

start `property`

start: int

Start time as Unix timestamp in millisecond, according to audio timeline.

end `property`

end: int

End time as Unix timestamp in millisecond, according to audio timeline.

length `property`

length: int

Word length in millisecond, according to audio timeline.

word `property`

word: str

Transcript token string for this word.

confidence `property`

confidence: float

ASR confidence for this word.

WordsDecoded

WordsDecoded(join_ref, ref, topic, event, payload)

Bases: SpeechDecoded

Interim segment transcript event.

topic `property`

topic: str

The conversation identifier

speaker `property`

speaker: str

The speaker whose speech triggered this event.

All events are relative to a speaker.

start `property`

start: int

Start time as Unix timestamp in millisecond, according to audio timeline.

end `property`

end: int

End time as Unix timestamp in millisecond, according to audio timeline.

length `property`

length: int

Event length in millisecond, according to audio timeline.

transcript `property`

transcript: str

Get the transcript of the whole segment as a string

lang `property`

lang: str

Natural Language of the speech.

As ISO 639-1 code.

country `property`

country: str

Country location of speaker.

As ISO 3166-1 code.

utterance_id `property`

utterance_id: str

The Utterance id identifies the speech utterance this event transcribes.

words `property`

words: List[Word]

Get the transcript of the whole segment as a list of timestamped words.

confidence `property`

confidence: float

The ASR confidence for this segment.

from_message `staticmethod`

from_message(message)

Private method to instantiate the right type of event from the raw websocket message.

build_conversation_url

build_conversation_url(token: str) -> str

Make an authenticated URL to connect to the Conversation Service.

uhlive.stream.conversation

Quickstart

Conversation

left property

join

leave

send_audio_chunk

receive

ProtocolError

EntityFound

topic property

speaker property

start property

end property

length property

entity_name property

lang property

country property

canonical property

original property

value property

confidence property

from_message staticmethod

EntityReference

kind instance-attribute

speaker instance-attribute

start instance-attribute

Event

topic property

speaker property

from_message staticmethod

Ok

topic property

speaker property

from_message staticmethod

RelationFound

topic property

speaker property

start property

end property

length property

relation_name property

lang property

confidence property

members property

from_message staticmethod

SegmentDecoded

topic property

speaker property

start property

end property

length property

transcript property

lang property

country property

utterance_id property

words property

confidence property

from_message staticmethod

SpeakerJoined

topic property

speaker property

timestamp property

interim_results property

rescoring property

lang property

country property

from_message staticmethod

SpeakerLeft

topic property

speaker property

timestamp property

from_message staticmethod

SpeechDecoded

topic property

speaker property

start property

end property

length property

transcript property

left `property`

topic `property`

speaker `property`

start `property`

end `property`

length `property`

entity_name `property`

lang `property`

country `property`

canonical `property`

original `property`

value `property`

confidence `property`

from_message `staticmethod`

kind `instance-attribute`

speaker `instance-attribute`

start `instance-attribute`

topic `property`

speaker `property`

from_message `staticmethod`

topic `property`

speaker `property`

from_message `staticmethod`

topic `property`

speaker `property`

start `property`

end `property`

length `property`

relation_name `property`

lang `property`

confidence `property`

members `property`

from_message `staticmethod`

topic `property`

speaker `property`

start `property`

end `property`

length `property`

transcript `property`

lang `property`

country `property`

utterance_id `property`

words `property`

confidence `property`

from_message `staticmethod`

topic `property`

speaker `property`

timestamp `property`

interim_results `property`

rescoring `property`

lang `property`

country `property`

from_message `staticmethod`

topic `property`

speaker `property`

timestamp `property`

from_message `staticmethod`

topic `property`

speaker `property`

start `property`

end `property`

length `property`

transcript `property`

lang `property`

country `property`

utterance_id `property`

words `property`

confidence `property`

from_message `staticmethod`

uuid `instance-attribute`

label `instance-attribute`

topic `property`

speaker `property`

start `property`

end `property`

length `property`

lang `property`

country `property`

confidence `property`

tags `property`