Stream audio to Speech to Text engine

MB
Mišo Belica
Thu, Jun 18, 2020 1:56 PM

Hello,

a few days ago I started experimenting with PJSUA Python binding and now
I am able to receive the call and play some WAV file back. But my goal
is to receive the call and redirect the incoming RTP stream to Speech to
Text engine https://download.phonexia.com/docs/spe/#RTP_HTTP_streams
that gives me the text. As a response I will play WAV recordings back as
I get the utterances. The problem is that I am not able to find the
right approach. My idea is to modify the SDP payload in callback
"Call.onCallSdpCreated" to point to opened Speech-to-Text RTP stream but
I have no luck with it. I wanted to simply replace the port in the SDP
payload
, because the app is running on the same host. Can you maybe
suggest how to do that and if I am going the right direction? Or if
there is better way then point me towards it?

I already searched in the documentation, GitHub and StackOverflow and
the best I could find is
https://stackoverflow.com/questions/31023274/how-to-catch-and-translate-incoming-audio-stream-in-other-languages-for-an-ios-c
but that feels hacky and I have no idea how would I redirect the stream
from that file to socket. Also I need to receive more than one call and
I guess all would be mixed in the file so it's no the right way to go.

The code below is my poor try, but I always get some error from Swig.
For example: /TypeError: Attempt to append a non SwigPyObject/

    def onCallSdpCreated(self, prm: pj.OnCallSdpCreatedParam):
        log("SDP created: ", dir(prm.sdp.pjSdpSession))
        log("SDP created: ", prm.sdp.wholeSdp)
        prm.sdp.pjSdpSession.append("m=audio 10000 RTP/AVP 0 101")

Thanks in advance for the answer and thanks for your work :)

Hello, a few days ago I started experimenting with PJSUA Python binding and now I am able to receive the call and play some WAV file back. But my goal is to receive the call and redirect the incoming RTP stream to Speech to Text engine <https://download.phonexia.com/docs/spe/#RTP_HTTP_streams> that gives me the text. As a response I will play WAV recordings back as I get the utterances. The problem is that I am not able to find the right approach. My idea is to modify the SDP payload in callback "Call.onCallSdpCreated" to point to opened Speech-to-Text RTP stream but I have no luck with it. I wanted to simply *replace the port in the SDP payload*, because the app is running on the same host. Can you maybe suggest how to do that and if I am going the right direction? Or if there is better way then point me towards it? I already searched in the documentation, GitHub and StackOverflow and the best I could find is https://stackoverflow.com/questions/31023274/how-to-catch-and-translate-incoming-audio-stream-in-other-languages-for-an-ios-c but that feels hacky and I have no idea how would I redirect the stream from that file to socket. Also I need to receive more than one call and I guess all would be mixed in the file so it's no the right way to go. The code below is my poor try, but I always get some error from Swig. For example: /TypeError: Attempt to append a non SwigPyObject/     def onCallSdpCreated(self, prm: pj.OnCallSdpCreatedParam):         log("SDP created: ", dir(prm.sdp.pjSdpSession))         log("SDP created: ", prm.sdp.wholeSdp)         prm.sdp.pjSdpSession.append("m=audio 10000 RTP/AVP 0 101") Thanks in advance for the answer and thanks for your work :)
AW
Andreas Wehrmann
Thu, Jun 18, 2020 2:50 PM

On 18.06.20 15:56, Mišo Belica wrote:

Hello,

a few days ago I started experimenting with PJSUA Python binding and
now I am able to receive the call and play some WAV file back. But my
goal is to receive the call and redirect the incoming RTP stream to
Speech to Text engine
https://download.phonexia.com/docs/spe/#RTP_HTTP_streams that gives
me the text. As a response I will play WAV recordings back as I get
the utterances. The problem is that I am not able to find the right
approach. My idea is to modify the SDP payload in callback
"Call.onCallSdpCreated" to point to opened Speech-to-Text RTP stream
but I have no luck with it. I wanted to simply replace the port in
the SDP payload
, because the app is running on the same host. Can you
maybe suggest how to do that and if I am going the right direction? Or
if there is better way then point me towards it?

I already searched in the documentation, GitHub and StackOverflow and
the best I could find is
https://stackoverflow.com/questions/31023274/how-to-catch-and-translate-incoming-audio-stream-in-other-languages-for-an-ios-c
but that feels hacky and I have no idea how would I redirect the
stream from that file to socket. Also I need to receive more than one
call and I guess all would be mixed in the file so it's no the right
way to go.

The code below is my poor try, but I always get some error from Swig.
For example: /TypeError: Attempt to append a non SwigPyObject/

    def onCallSdpCreated(self, prm: pj.OnCallSdpCreatedParam):
        log("SDP created: ", dir(prm.sdp.pjSdpSession))
        log("SDP created: ", prm.sdp.wholeSdp)
        prm.sdp.pjSdpSession.append("m=audio 10000 RTP/AVP 0 101")

Thanks in advance for the answer and thanks for your work :)

First, let me say that I'm not familiar with the Python bindings, so I'm
not sure this will help...
Here's how I would do it in C (and maybe the functions necessary for my
solution are exported to Python...):

If I understand correctly, your app sits between a SIP caller and
something generating an RTP stream.
With the raw C API you can actually manually create Media Streams and
add them to the conference bridge.
That way you can simply connect the two streams on the confbridge and voila.

First you need to create a (UDP) transport, that the stream will use for
sending/receiving RTP packets:
https://www.pjsip.org/docs/latest-2/pjmedia/docs/html/group__PJMEDIA__TRANSPORT__UDP.htm

Create a media (RTP) stream pjmedia_stream_create():
https://www.pjsip.org/docs/latest-2/pjmedia/docs/html/group__PJMED__STRM.htm#ga67575c8e7b15e325b98ebaa89639b550

After creation, get the media port using pjmedia_stream_get_port():
https://www.pjsip.org/docs/latest-2/pjmedia/docs/html/group__PJMED__STRM.htm#gae3cb31df5aa921ef3085d5eb539af063

Add the media port to the confbridge with pjsua_conf_add_port():
https://www.pjsip.org/docs/latest-2/pjsip/docs/html/group__PJSUA__LIB__MEDIA.htm#ga833528c1019f4ab5c8fb216b4b5f788b

And then start the stream with pjmedia_stream_start():
https://www.pjsip.org/docs/latest-2/pjmedia/docs/html/group__PJMED__STRM.htm#ga93d59e3be009de86a3823303784d31a2

Start the underlying media transport with pjmedia_transport_media_start():
https://www.pjsip.org/docs/latest-2/pjmedia/docs/html/group__PJMEDIA__TRANSPORT.htm#ga74ab1c1b9b09d75865a231519bb58aa7

Connect the two streams using pjsua_conf_connect().

Tinkering with the SDP seems error prone to me, which is why I try to
avoid it whenever I can.

All the best,
Andreas

On 18.06.20 15:56, Mišo Belica wrote: > > Hello, > > a few days ago I started experimenting with PJSUA Python binding and > now I am able to receive the call and play some WAV file back. But my > goal is to receive the call and redirect the incoming RTP stream to > Speech to Text engine > <https://download.phonexia.com/docs/spe/#RTP_HTTP_streams> that gives > me the text. As a response I will play WAV recordings back as I get > the utterances. The problem is that I am not able to find the right > approach. My idea is to modify the SDP payload in callback > "Call.onCallSdpCreated" to point to opened Speech-to-Text RTP stream > but I have no luck with it. I wanted to simply *replace the port in > the SDP payload*, because the app is running on the same host. Can you > maybe suggest how to do that and if I am going the right direction? Or > if there is better way then point me towards it? > > I already searched in the documentation, GitHub and StackOverflow and > the best I could find is > https://stackoverflow.com/questions/31023274/how-to-catch-and-translate-incoming-audio-stream-in-other-languages-for-an-ios-c > but that feels hacky and I have no idea how would I redirect the > stream from that file to socket. Also I need to receive more than one > call and I guess all would be mixed in the file so it's no the right > way to go. > > The code below is my poor try, but I always get some error from Swig. > For example: /TypeError: Attempt to append a non SwigPyObject/ > >     def onCallSdpCreated(self, prm: pj.OnCallSdpCreatedParam): >         log("SDP created: ", dir(prm.sdp.pjSdpSession)) >         log("SDP created: ", prm.sdp.wholeSdp) >         prm.sdp.pjSdpSession.append("m=audio 10000 RTP/AVP 0 101") > > Thanks in advance for the answer and thanks for your work :) > First, let me say that I'm not familiar with the Python bindings, so I'm not sure this will help... Here's how I would do it in C (and maybe the functions necessary for my solution are exported to Python...): If I understand correctly, your app sits between a SIP caller and something generating an RTP stream. With the raw C API you can actually manually create Media Streams and add them to the conference bridge. That way you can simply connect the two streams on the confbridge and voila. First you need to create a (UDP) transport, that the stream will use for sending/receiving RTP packets: https://www.pjsip.org/docs/latest-2/pjmedia/docs/html/group__PJMEDIA__TRANSPORT__UDP.htm Create a media (RTP) stream pjmedia_stream_create(): https://www.pjsip.org/docs/latest-2/pjmedia/docs/html/group__PJMED__STRM.htm#ga67575c8e7b15e325b98ebaa89639b550 After creation, get the media port using pjmedia_stream_get_port(): https://www.pjsip.org/docs/latest-2/pjmedia/docs/html/group__PJMED__STRM.htm#gae3cb31df5aa921ef3085d5eb539af063 Add the media port to the confbridge with pjsua_conf_add_port(): https://www.pjsip.org/docs/latest-2/pjsip/docs/html/group__PJSUA__LIB__MEDIA.htm#ga833528c1019f4ab5c8fb216b4b5f788b And then start the stream with pjmedia_stream_start(): https://www.pjsip.org/docs/latest-2/pjmedia/docs/html/group__PJMED__STRM.htm#ga93d59e3be009de86a3823303784d31a2 Start the underlying media transport with pjmedia_transport_media_start(): https://www.pjsip.org/docs/latest-2/pjmedia/docs/html/group__PJMEDIA__TRANSPORT.htm#ga74ab1c1b9b09d75865a231519bb58aa7 Connect the two streams using pjsua_conf_connect(). Tinkering with the SDP seems error prone to me, which is why I try to avoid it whenever I can. All the best, Andreas
MB
Mišo Belica
Wed, Jun 24, 2020 11:38 AM

Thanks Andreas,

the Python bindings have the same API as the PJSUA2 C++ library. But
still the C API is even more low level so maybe I will need to use it
instead of the Python :(

You described it correctly. My APP receives the call, negotiates via SDP
and then should redirect the incoming RTP stream to STT engine. After
the call end, the app should close the RTP stream in the STT engine and
that all.

I checked the resources for you and I think you are right that
modification of the SDP is not a good idea. Also, after reading the
resources and references I found this
https://trac.pjsip.org/repos/wiki/FAQ#audio-man. Still I think it's
overkill to implement new port to just pass the traffic to another app.
I would like to somehow hook into the PJSUA2 initialization and instead
of creating new RTP socket in PJSUA2, I want to simply create it in the
STT engine and provide opened port to PJSUA2 instead. This RTP stream
will be redirected and PJSUA2 don't need to know about it.

The steps you provided helped me understand the PJSIP in more detail,
but unfortunately I still don't understand what do you suggest. Creating
the streams and registering them to conf. bridge is done automatically
in PJSUA2 and I am not sure if you suggest to create extra stream for
RTP and conf. bridge will be kind of proxy were RTP will go from one
stream to another. Do I understand it correctly or is it something else?

Thanks in advance for the clarification and have a nice day :)

Thanks Andreas, the Python bindings have the same API as the PJSUA2 C++ library. But still the C API is even more low level so maybe I will need to use it instead of the Python :( You described it correctly. My APP receives the call, negotiates via SDP and then should redirect the incoming RTP stream to STT engine. After the call end, the app should close the RTP stream in the STT engine and that all. I checked the resources for you and I think you are right that modification of the SDP is not a good idea. Also, after reading the resources and references I found this https://trac.pjsip.org/repos/wiki/FAQ#audio-man. Still I think it's overkill to implement new port to just pass the traffic to another app. I would like to somehow hook into the PJSUA2 initialization and instead of creating new RTP socket in PJSUA2, I want to simply create it in the STT engine and provide opened port to PJSUA2 instead. This RTP stream will be redirected and PJSUA2 don't need to know about it. The steps you provided helped me understand the PJSIP in more detail, but unfortunately I still don't understand what do you suggest. Creating the streams and registering them to conf. bridge is done automatically in PJSUA2 and I am not sure if you suggest to create extra stream for RTP and conf. bridge will be kind of proxy were RTP will go from one stream to another. Do I understand it correctly or is it something else? Thanks in advance for the clarification and have a nice day :)
AW
Andreas Wehrmann
Wed, Jun 24, 2020 12:27 PM

On 24.06.20 13:38, Mišo Belica wrote:

Thanks Andreas,

the Python bindings have the same API as the PJSUA2 C++ library. But
still the C API is even more low level so maybe I will need to use it
instead of the Python :(

You described it correctly. My APP receives the call, negotiates via
SDP and then should redirect the incoming RTP stream to STT engine.
After the call end, the app should close the RTP stream in the STT
engine and that all.

I checked the resources for you and I think you are right that
modification of the SDP is not a good idea. Also, after reading the
resources and references I found this
https://trac.pjsip.org/repos/wiki/FAQ#audio-man. Still I think it's
overkill to implement new port to just pass the traffic to another
app. I would like to somehow hook into the PJSUA2 initialization and
instead of creating new RTP socket in PJSUA2, I want to simply create
it in the STT engine and provide opened port to PJSUA2 instead. This
RTP stream will be redirected and PJSUA2 don't need to know about it.

The steps you provided helped me understand the PJSIP in more detail,
but unfortunately I still don't understand what do you suggest.
Creating the streams and registering them to conf. bridge is done
automatically in PJSUA2 and I am not sure if you suggest to create
extra stream for RTP and conf. bridge will be kind of proxy were RTP
will go from one stream to another. Do I understand it correctly or is
it something else?

Thanks in advance for the clarification and have a nice day :)

Hey there,

you're right, implementing a new media port would be overkill in this
case, because the functionality you need is already implemented in
another way.
I'll try to summarize what would be my solution, to give you an idea how
it's supposed to work:

  1. Create new RTP stream on a specific (UDP) port and set remote
    endpoint to your STT engine
  2. Setup STT so that it streams RTP to your local port
  3. Add the newly created RTP stream to the conference bridge
  4. When a call comes in and is established,
        use pjsua_conf_connect( conf_slot_of_call, conf_slot_of_stt_stream );
        and pjsua_conf_connect( conf_slot_of_stt_stream, conf_slot_of_call );
        to interconnect the audio streams.
  5. The rest should be handled by your STT I guess

All the best,
Andreas

On 24.06.20 13:38, Mišo Belica wrote: > Thanks Andreas, > > the Python bindings have the same API as the PJSUA2 C++ library. But > still the C API is even more low level so maybe I will need to use it > instead of the Python :( > > You described it correctly. My APP receives the call, negotiates via > SDP and then should redirect the incoming RTP stream to STT engine. > After the call end, the app should close the RTP stream in the STT > engine and that all. > > I checked the resources for you and I think you are right that > modification of the SDP is not a good idea. Also, after reading the > resources and references I found this > https://trac.pjsip.org/repos/wiki/FAQ#audio-man. Still I think it's > overkill to implement new port to just pass the traffic to another > app. I would like to somehow hook into the PJSUA2 initialization and > instead of creating new RTP socket in PJSUA2, I want to simply create > it in the STT engine and provide opened port to PJSUA2 instead. This > RTP stream will be redirected and PJSUA2 don't need to know about it. > > The steps you provided helped me understand the PJSIP in more detail, > but unfortunately I still don't understand what do you suggest. > Creating the streams and registering them to conf. bridge is done > automatically in PJSUA2 and I am not sure if you suggest to create > extra stream for RTP and conf. bridge will be kind of proxy were RTP > will go from one stream to another. Do I understand it correctly or is > it something else? > > Thanks in advance for the clarification and have a nice day :) Hey there, you're right, implementing a new media port would be overkill in this case, because the functionality you need is already implemented in another way. I'll try to summarize what would be my solution, to give you an idea how it's supposed to work: 1. Create new RTP stream on a specific (UDP) port and set remote endpoint to your STT engine 2. Setup STT so that it streams RTP to your local port 3. Add the newly created RTP stream to the conference bridge 4. When a call comes in and is established,     use pjsua_conf_connect( conf_slot_of_call, conf_slot_of_stt_stream );     and pjsua_conf_connect( conf_slot_of_stt_stream, conf_slot_of_call );     to interconnect the audio streams. 5. The rest should be handled by your STT I guess All the best, Andreas