I’ve created a Presentation that goes over these points as well:

I found this awesome work, using google’s translate api, to transcode the audio to text:

I’ve now used this at the final end of the process, to verify the text heard is what is expected!!

I’ve been working on this VOIP/SIP automation framework for a few months now. I started with a Cucumber framework, and then added on with some VOIP/SIP specific tools like SIPP and SIPCLI.

I got to where the test harness’ I built with these tools, would use Jenkins to push button (or on a schedule or build commit) drive traffic to a phone number… verify it reached it by acknowledgements sent back.  But what if the phone number was going to the wrong destination, and sent back acknowledgements?

At that point I used TollFreeForwarding.com’s technology to set a email alert as an endpoint on a phone number. For example, you call: 888-888-8888 and you get an IVR. you press 1, and are sent to a voicemail – you pass in audio and hang up. Then TollFreeForwarding.com emails the configured email on the account, the recording.

It was better, but it required a voicemail to email application at every end point. It also doesn’t verify that audio actually occurred on the call. What if no audio played back? Or there was significant jitter to not understand it?

To further this testing, I started thinking of recording the call and using some sort of analysis of the recording to verify it’s what was expected.  

This is my first draft at answering that need.  It can be improved.  But it’s a step in the right direction.

What I’m doing

  1. This automation dials a number, with a known IVR or greeting.  
  2. It does a packet capture during the recording
  3. It filters out the RTP channels from the packet capture and then creates a wav out of the pcap file.
  4. Once there is a wav file, it runs diagnostics on it… generating some visual graphs like the image on this blog… but more importantly (and more useful) it generates audio information that I use as a footprint for the audio playback.
  5. This audio is also sent to google who transcribes it and sends me back the text which is compared to the expected string.

Tools used

  1. sipp to drive an automated command line sip call
  2. tshark (command line version of wireshark)
  3. jenkins (for the GUI to drive and schedule these tests)
  4. sox (linux based audio conversion and analysis tool)
  5. some shell scripting

How it Works

The test has a parent job, that kicks off two sub jobs.  These sub jobs run simultaneously.  One does a phone call to a phone number with a recording Greeting/IVR.  The other job runs a shell script that maintains the test itself.  The second job uses tshark to record the packets and filter the rtp, then uses sox to convert the raw audio to a wav and do some analysis on the wav.

The Shell Script

First I set tshark to record for a specific duration, that I think will encompass the call:
tshark -a duration:20 -w /jenkins/userContent/sip_1call.pcap

I assign a variable to a tsark task to scan the RTP packets and find the hex value for the RTP packets (I learned these three parts from a online tutorial, but lost the bookmark):
ssrc=$(tshark -n -r /jenkins/userContent/sip_1call.pcap -R rtp -T fields -e rtp.ssrc -Eseparator=, | sort -u | awk ‘FNR ==1 {print}’)

The above would return a hex value like:

Which is followed by:
sudo tshark -n -r /jenkins/userContent/sip_1call.pcap -R rtp -R “rtp.ssrc == $ssrc” -T fields -e rtp.payload | tee payloads

The above looks for that Hex value captured previously, and holds that as a variable, payload.

Finally, we have a for statement in the shell script to convert the payload value from above, to a raw audio file:
for payload in `cat payloads`; do IFS=:; for byte in $payload; do printf “\\x$byte” >> /jenkins/userContent/sip_1call.raw; done; done

At this point I had a raw audio file. I found a linux tool called sox that was  a good fit for this conversion… so I installed it and added these lines into my script…
Sox is then invoked to convert the raw audio to a wav:
sox -t raw -r 8000 -v 4 -c 1 -U /jenkins/userContent/sip_1call.raw /jenkins/userContent/sip_1call.wav

Then I run a couple more Sox commands:
This one creates stats, which Jenkins captures in the log file of the test run:
sox /var/lib/jenkins/userContent/sip_audio_1call.wav -n stat

The stats generated will look like this:

Samples read:             15680
Length (seconds): 1.960000
Scaled by: 2147483647.0
Maximum amplitude: 0.425659
Minimum amplitude: -0.285034
Midline amplitude: 0.070313
Mean norm: 0.043354
Mean amplitude: -0.000055
RMS amplitude: 0.070984
Maximum delta: 0.243896
Minimum delta: 0.000000
Mean delta: 0.019919
RMS delta: 0.034190
Rough frequency: 613
Volume adjustment: 2.349

The two highlighted values seem to be consistent with the same audio.  At this point, that’s what the test assertion is based on.  I have a better plan in the works for a future upgrade to this test.  But for now, I’m using the rough frequency and max amplitude to determine the pass / fail criteria.

Is it perfect? No. It’s potential for false negatives. The rough frequency *could* change, but so far it hasn’t for the same audio I expect.

If your into spectrogram’s (and who isn’t?), then sox will also output one if you like, I end the shell script with this:

sox /jenkins/userContent/sip_1call.wav -n spectrogram -y 2 -l -o /jenkins/userContent/sip_1call.png

If anyone has any other tools that can pull out more data, please let me know.

The Upshot?

One shell script, called by Jenkins, running 3 tools gets this job done.

Verify Audio via Speech To Text

A few people approached me and mentioned rough frequency may not remain constant as the test call goes through different hops.  So I began to investigate this some more… I found this guy:

he had created a way to use a shell script to send audio files to google for transcription.

I modified his script a little to work for my needs, and added a text assertion.  If the text fails comparison then I exit the script with a error code, which forces jenkins to regard this as a total failure.

Here’s the part I added to the bottom of my previous script:
echo “1 – Translate with SOX – Convert WAV to FLAC with 16000”
sox /jenkins/userContent/sip_audio_1call.wav input.flac rate 16k
echo “2 – Submit to Google Voice API”
wget -q -U “Mozilla/5.0″ –post-file input.flac –header=”Content-Type: audio/x-flac; rate=16000” -O – “http://www.google.com/speech-api/v1/recognize?lang=en-us&client=chromium” > output.ret
echo “3 – Extract recognized text”

cat output.ret | sed ‘s/.*utterance”:”//’ | sed ‘s/”,”confidence.*//’ > output.txt
echo “4 – Display text”
a=`cat output.txt`
echo $a
if [ “$a” = “tollfreeforwarding.com” ];
        echo “Verified audio is tollfreeforwarding.com”
        echo “FAIL audio is not tollfreeforwarding.com”
        exit 666


In my scenario, I’ve seeded the phone greeting on the number that is called to be an announcement audio that says, “Toll Free Forwarding Dot Com”  which google turns correctly to “tollfreeforwarding.com” and I validate against that.


4 Responses

  1. Hi Brian,

    Great blog! It’s a treasure trove of useful information that I will be implementing as well.

    Do you plan on open-sourcing the suite? I’m sure I’ll be sending some pull requests in no time 🙂

Leave a Reply

Your email address will not be published.