Rebuilding an audio stream programmatically

For testing audio on a VOIP connection, the audio is managed through RTP packets. This audio can be rebuilt by sorting the RTP packets and outputting a raw audio file. Wireshark actually has a GUI to do all this fairly easily. It can detect VOIP and reveal the audio, but to do this from a programatic approach requires a different strategy.

One approach is to leverage the command line version of Wireshark, known as tshark. By using tshark, we can call various functions on saved pcaps, including rebuilding audio streams from VOIP sessions.

Shell Script Version

Many years ago I approached a problem of checking audio quality by rebuilding the VOIP audio through a shell script and comparing the recording played back from the destination to the original source audio. In other words, I had a number that when called would play an audio file. Comparing the audio file itself, to the recorded audio from the VOIP call, allowed for a comparison of call quality. Today, utilities like VOIP Monitor run their own metrics in very efficient ways. However, to my end, I needed something to work with my carrier call quality testing.

Initially I cobbled a shell script together, using some guidance online:

ssrc=$(tshark -n -r voip.pcap -R rtp -T fields -e rtp.ssrc -Eseparator=, | sort -u | awk 'FNR ==1 {print}')
 
echo "SSRC: "
echo $ssrc
 
sudo tshark -n -r voip.pcap -R rtp -R "rtp.ssrc == $ssrc" -T fields -e rtp.payload | tee payloads
 
for payload in `cat payloads`; do IFS=:; for byte in $payload; do printf "\\x$byte" >> server_script.raw; done; done
 
echo 'sox has converted pcap to wav file'
sudo sox -t raw -r 8000 -v 4 -c 1 -U server_script.raw server_script.wav

The shell script above requires tshark to be installed on the server. RTP packets are filtered and sorted by an ssrc field, which rebuilds the audio packets in the correct order. At the end, a raw audio file is produced, which is then converted to a wav (for convenance), using SOX.

It worked, but wasn’t very readable and I preferred a better solution using Python.

Python Version

Python has a library called “pyshark” that is a wrapper for tshark. Pyshark requires tshark to be installed prior to use.

import pyshark
 
rtp_list = []
cap = pyshark.FileCapture('voip.pcap', display_filter='rtp')
raw_audio = open('my_audio.raw','wb')
for i in cap:
    try:
        rtp = i[3]
        if rtp.payload:
             rtp_list.append(rtp.payload.split(":"))
    except:
        pass
 
for rtp_packet in rtp_list:
    packet = " ".join(rtp_packet)
    audio = bytearray.fromhex(packet)
    raw_audio.write(audio)

There might be a better way.  But that’s how I got it done in Python.  I used the pyshark FileCapture function to pull in the pcap and also filter on the RTP layer.

I iterate over the pcap file, and pull out the rtp index’ (rtp = i[3]).  If there is a rtp.payload, then I push the content to a list.  The raw output looks like “FF:FF:DE:AB:FF:”  Obeying the logic in the bash script, I split out on the colon and put the result into the array called rtp_list.

I had a list of lists at this point.  Each line in the file was it’s own index of a master list.  This is where I think this could be improved upon. For now though, this works fine enough.   My solution was to iterate over the master list and pull out each of the indexes (each line) and join them with a space between each hex value…

So it looks more like:

93 90 90 97 97 96 ee e8 95 ec e3 e3 e4 e4 fb f0 f9 fc ff f2 f9 db d6 f4 d7 56 77 72 7c 6d 6d 6e 14 11 13 1d 18 1b 1a 04 07 07 01 01 06 06 06 07 04 1a 1b 1f 16 11 16 61 7b 64 75 d3 cd f5 e7 e0 e6 eb ee e8 eb ec e8 ef e2 e8 e6 ed e2 c8 e2 fd d3 f0 4e 79 71 7e 62 72 7a 6e 51 78 66 dd 7a 72 54 76 4e d0 41 5b fd d5 dc f8 c4 fe e7 fc e7 e1 e7 e3 e3 e4 e3 e4 fd f8 f5 db c7 df d9 f1 f3 fb ed ec 94 90 90 9d 9e 9e 9e 9e 9e 99 9c 92 93 96 94 eb ed f9 c1 53 73 64 63 69 69 69 14 15 15 6a 53 53 5d 50 51 50 51 d5 d1 d3 d6 51 d5 54 5d 55 d5 d6 50 5f 5c 51 57 58 52 55 d6 d6 54 d7 d2 d8 da 55 d7 d7 5c 5d 53 52 5f 53 5d 53 5c 53 d4 54 51 df dc dc c4 c2 cb f7 f5 cc c3 c9 f4 f5 ce cb cf cf c4 d8 d8 d3 d5 d3 dd d0 55 5c 59 5c 5f 44 46 5d 5d 5e 5c 5e 5a 5e 5c 51 50 5d 51 d1 d6 d7 d8 d9 c7 c4 c7 c6 c0 cf c8 f5 c3 c1 c3 c3 c3 c3 d8 d8 da c5 c5 c0 d8 df dd c4 dd de df df de c5 dd d2 dc db c5 de dd db de dc d4 54 d5 d4 56 53 5c 5e 51 5c 51 5f 58 5a 44 44 44 40 4c 41 47 5a 15 69 6c 68 6d 61 60 67 79 7b 66 67 67 65 7a 7c 7e 64 70 7b 66 7f 73 7c 76 57 d5 56 f1 f8 f6 f8 e7 e0 e4 e6 e3 fb e4 fc c2 f7 c6 58 4f 74 7f 61 63 62 69 68 15 14 15 17 14 6a 15 17 68 63 14 6a 67 6d 68 7c 65 69 72 46 7d 7d d7 c6 5e d2 fb e4 fe e3 ea e8 ed 95 97 ef ee ea ea e9 ef e9 ef e1 fa e7 fb d8 de f1 dc 71 56 db 71 79 54 54 67 4a c1 4e 7d 41 41 72 40 d2 c1 f3 e6 e8 94 96 93 9c 9f 99 9a 85 85 9a 9a 9a 99 9c 9c 93 ea ea ed f0 d2 4f 78 6f 15 11 13 10 12 12 10 14 14 16 68 69 58 5f 43 45 5b 45 44 44 46 43 40 4f 47 47 40 44 5b 5c 5c 57 d7 d4 d2 d2 d3 d2 d9 dc d8 d8 d9 da c3 cd c3 ce f5 ca ce cc c0 c2 c1 c6 c6 da dc dd de dc d0 d0 56 5c 58 46 41 45 5b 58 45 58 58 59 44 5b 58 59 5c 53 55 d5 d6 d4 54 51 50 51 5c 5a 5c d5 5f 5d 54 55 51 d5 d7 d0 d0 d7 d7 d4 d5 d5 53 53 50 57 51 d7 57 51 56 57 54 51 d2 d4 51 5c 52 52 5a 5f 52 53 57 52 5e 52 5d 50 57 d3 d4 d4 d0 d0 d3 d0 50 50 51 46 46 45 42 4c 42 41 43 4e 75 4e 4c 41 40 44 5b 58 5a 53 52 d7 d0 d6 d6 dc 14 6b 6e 6a 17 69 6a 16 14 6b 6a 15 6d 61 7a 64 67 47 74 7d d4 c0 4d 75 da dc 7d 66 4d 78 11 14 69 12 1b 13 10 18 1f 10 11 13 11 63 6f 6b 64 40 48 61 59 ca 74 71 ca f5 71 d1 e6 f0 c0 e3 97 ef e2 97 9d 92 93 99 85 85 9e 9a 86 9b 9c 9a 9a 90 94 92 97 e7 e7 e2 e7 c6 f0 ed e4 ff ec e8 e1 e7 ec e3 e7 e7 f8 fd cc 57 42 72 67 6d 6e 68 15 16 16 17 13 13 14 15 14 6d 7d 74 42 56 f6 e4 f0 e0 94 ee eb 91 96 ef e9 96 e1 d8 e6 f2 65 46 ca 6b 11 4a 62 1a 10 7e 1d 04 68 6d 04 05 67 6a 1a 15

Finally the output above is saved to a file using the bytearray function.  This creates a RAW audio file.  Which has the audio from the call that was picked up during the pcap recording.

Leave a Comment