uhlive.stream.conversation
The Stream Conversation SDK API for human to human interactions.
This API is used to consume a real-time audio stream and get enriched transcription events.
The protocol is messages based and uses websockets as transport. You are free to use whatever websocket client library you like to communicate with the API, and use our SDK to encode/decode the messages.
Quickstart
First retrieve a one time access token with the Auth API.
Then use that token to build an authenticated URL, open a websocket connection to it with the websocket client library
of your choice and instanciate a Conversation
to join a conversation, generate
audio stream messages and decode transcription and enrichment events.
As the API is asynchronous, streaming the audio and reading the returned events should be done in two different threads/tasks.
from uhlive.stream.Conversation import *
stream_h2h_url = build_conversation_url(token)
# The subcripttion identifier was given to you with your other credentials
# the conversation id can be any string you like. If a conversation by that name already exists in your subscription identifier domain
# it will join it as a new speaker, otherwise it will create it and join the speaker in.
# The speaker id helps you identify who is speaking.
conversation = Conversation("subscription_identifier", "a_conversation_id", "a_speaker_id")
Now you can connect and interact with the API:
Synchronous example:
import websocket as ws
socket = ws.create_connection(stream_h2h_url, timeout=10)
socket.send(
conversation.join(
model="fr",
interim_results=False,
rescoring=True,
origin=int(time.time() * 1000),
country="fr",
)
)
# check we didn't get an error on join
reply = conversation.receive(socket.recv())
assert isinstance(reply, Ok)
Asynchronous example:
from aiohttp import ClientSession
async def main(uhlive_client, uhlive_secret):
async with ClientSession() as session:
async with session.ws_connect(stream_h2h_url) as socket:
await socket.send_str(
conversation.join(
model="fr",
interim_results=False,
rescoring=True,
origin=int(time.time() * 1000),
country="fr",
)
)
# check we didn't get an error on join
msg = await socket.receive()
reply = conversation.receive(msg.data)
assert isinstance(reply, Ok)
As you can see, the I/O is cleanly decoupled from the protocol handling: the Conversation
object is only used
to create the messages to send to the API and to decode the received messages as Event
objects.
See the complete examples in the source distribution.
Conversation
To join a conversation on the API, you need a Conversation
object.
You can only have one Conversation
per connection (socket) otherwise you risk
unexpected behavior (and exceptions!).
Create a Conversation
.
Parameters:
-
identifier
(str
) –is the identifier you got when you subscribed to the service;
-
conversation_id
(str
) –is the conversation you wish to join,
-
speaker
(str
) –is your alias in the conversation, to identify you and your events
join
join(
model: str = "fr",
country: str = "fr",
readonly: bool = False,
interim_results: bool = True,
rescoring: bool = True,
origin: int = 0,
audio_codec: str = "linear",
) -> str
Join the conversation.
Parameters:
-
readonly
(bool
, default:False
) –if you are not going to stream audio, set it to
True
. -
model
(str
, default:'fr'
) –(if
readonly
isFalse
) the ASR language model to be use to recognize the audio you will stream. -
country
(str
, default:'fr'
) –the iso 2 letter country code of the place where the speaker is.
-
interim_results
(bool
, default:True
) –(
readonly
=False
only) should the ASR trigger interim result events? -
rescoring
(bool
, default:True
) –(
readonly
=False
only) should the ASR refine the final segment with a bigger Language Model? May give slightly degraded results for very short segments. -
origin
(int
, default:0
) –The UNIX time, in milliseconds, to which the event timeline origin is set.
-
audio_codec
(str
, default:'linear'
) –the speech audio codec of the audio data:
"linear"
: (default) linear 16 bit SLE raw PCM audio at 8khz;"g711a"
: G711 a-law audio at 8khz;"g711u"
: G711 μ-law audio at 8khz.
Returns:
-
str
–The text websocket message to send to the server.
Raises:
-
ProtocolError
–if still in a previously joined conversation.
leave
Leave the current conversation.
It's a good idea to leave a conversation and continue to consume messages
until you receive a SpeakerLeft
event for your speaker, before you
close the connection. Otherwise, you may miss parts of the transcription.
Returns:
-
str
–The text websocket message to send to the server.
Raises:
-
ProtocolError
–if not currently in a converstation.
send_audio_chunk
Build an audio chunk for streaming.
Returns:
-
bytes
–The binary websocket message to send to the server.
Raises: ProtocolError: if not currently in a converstation.
ProtocolError
Bases: RuntimeError
Exception raised when a Conversation method is not available in the current state.
EntityFound
Bases: TimeScopedEvent
The class for all entity annotation events.
speaker
property
The speaker whose speech triggered this event.
All events are relative to a speaker.
value
property
The interpreted value in machine understandable form.
The exact type depends on the entity.
EntityReference
Reference to a unique previously found Entity in the conversation.
Event
Bases: object
The base class of all events.
speaker
property
The speaker whose speech triggered this event.
All events are relative to a speaker.
RelationFound
Bases: TimeScopedEvent
The class for all Relation events.
Relations express a semantic relationship between two or more entities.
speaker
property
The speaker whose speech triggered this event.
All events are relative to a speaker.
members
property
References to the Entities involved in this relationship.
SegmentDecoded
Bases: SpeechDecoded
Final segment transcript event.
SpeakerJoined
Bases: Event
A new speaker joined the conversation (after us).
speaker
property
The speaker whose speech triggered this event.
All events are relative to a speaker.
SpeakerLeft
Bases: Event
Event emitted by the associated speaker when they left the conversation.
speaker
property
The speaker whose speech triggered this event.
All events are relative to a speaker.
SpeechDecoded
Bases: TimeScopedEvent
The base class of all transcription events.
speaker
property
The speaker whose speech triggered this event.
All events are relative to a speaker.
utterance_id
property
The Utterance id identifies the speech utterance this event transcribes.
words
property
Get the transcript of the whole segment as a list of timestamped words.
Tag
TagsFound
Bases: TimeScopedEvent
One or more tags were found on this time range.
speaker
property
The speaker whose speech triggered this event.
All events are relative to a speaker.
Unknown
Bases: Event
The server emitted an event unkown to this SDK. Time to upgrade!
speaker
property
The speaker whose speech triggered this event.
All events are relative to a speaker.
Word
Bases: dict
Timestamped word.
WordsDecoded
Bases: SpeechDecoded
Interim segment transcript event.