OLD | NEW |
1 #Conversational Speech generator tool | 1 #Conversational Speech generator tool |
2 | 2 |
3 Python tool to generate multiple-end audio tracks to simulate conversational | 3 Python tool to generate multiple-end audio tracks to simulate conversational |
4 speech with two or more participants. | 4 speech with two or more participants. |
5 | 5 |
6 The input to the tool is a directory containing a number of audio tracks and | 6 The input to the tool is a directory containing a number of audio tracks and |
7 a text file indicating how to time the sequence of speech turns (see the Example | 7 a text file indicating how to time the sequence of speech turns (see the Example |
8 section). | 8 section). |
9 | 9 |
10 Since the timing of the speaking turns is specified by the user, the generated | 10 Since the timing of the speaking turns is specified by the user, the generated |
11 tracks may not be suitable for testing scenarios in which there is unpredictable | 11 tracks may not be suitable for testing scenarios in which there is unpredictable |
12 network delay (e.g., end-to-end RTC assessment). | 12 network delay (e.g., end-to-end RTC assessment). |
13 | 13 |
14 Instead, the generated pairs can be used when the delay is constant (obviously | 14 Instead, the generated pairs can be used when the delay is constant (obviously |
15 including the case in which there is no delay). | 15 including the case in which there is no delay). |
16 For instance, echo cancellation in the APM module can be evaluated using two-end | 16 For instance, echo cancellation in the APM module can be evaluated using two-end |
17 audio tracks as input and reverse input. | 17 audio tracks as input and reverse input. |
18 | 18 |
19 By indicating negative and positive time offsets, one can reproduce cross-talk | 19 By indicating negative and positive time offsets, one can reproduce cross-talk |
20 and silence in the conversation. | 20 and silence in the conversation. |
21 | 21 |
22 IMPORTANT: **the whole code has not been landed yet.** | 22 IMPORTANT: **the whole code has not been landed yet.** |
23 | 23 |
24 ###Example | 24 ###Example |
25 | 25 |
26 For each end, there is a set of audio tracks, e.g., a1, a2 and a3 (speaker A) | 26 For each end, there is a set of audio tracks, e.g., a1, a2 and a3 (speaker A) |
27 and b1, b2 (speaker B). | 27 and b1, b2 (speaker B). |
28 The text file with the timing information may look like this: | 28 The text file with the timing information may look like this: |
29 ``` A a1 0 | 29 |
30 B b1 0 | 30 ``` |
31 A a2 100 | 31 A a1 0 |
32 B b2 -200 | 32 B b1 0 |
33 A a3 0 | 33 A a2 100 |
34 A a4 0``` | 34 B b2 -200 |
| 35 A a3 0 |
| 36 A a4 0 |
| 37 ``` |
| 38 |
35 The first column indicates the speaker name, the second contains the audio track | 39 The first column indicates the speaker name, the second contains the audio track |
36 file names, and the third the offsets (in milliseconds) used to concatenate the | 40 file names, and the third the offsets (in milliseconds) used to concatenate the |
37 chunks. | 41 chunks. |
38 | 42 |
39 Assume that all the audio tracks in the example above are 1000 ms long. | 43 Assume that all the audio tracks in the example above are 1000 ms long. |
40 The tool will then generate two tracks (A and B) that look like this: | 44 The tool will then generate two tracks (A and B) that look like this: |
41 | 45 |
42 ```Track A: | 46 **Track A** |
| 47 ``` |
43 a1 (1000 ms) | 48 a1 (1000 ms) |
44 silence (1100 ms) | 49 silence (1100 ms) |
45 a2 (1000 ms) | 50 a2 (1000 ms) |
46 silence (800 ms) | 51 silence (800 ms) |
47 a3 (1000 ms) | 52 a3 (1000 ms) |
48 a4 (1000 ms)``` | 53 a4 (1000 ms) |
| 54 ``` |
49 | 55 |
50 ```Track B: | 56 **Track B** |
| 57 ``` |
51 silence (1000 ms) | 58 silence (1000 ms) |
52 b1 (1000 ms) | 59 b1 (1000 ms) |
53 silence (900 ms) | 60 silence (900 ms) |
54 b2 (1000 ms) | 61 b2 (1000 ms) |
55 silence (2000 ms)``` | 62 silence (2000 ms) |
| 63 ``` |
56 | 64 |
57 The two tracks can be also visualized as follows (one characheter represents | 65 The two tracks can be also visualized as follows (one characheter represents |
58 100 ms, "." is silence and "*" is speech). | 66 100 ms, "." is silence and "*" is speech). |
59 | 67 |
60 ```t: 0 1 2 3 4 5 6 (s) | 68 ``` |
| 69 t: 0 1 2 3 4 5 6 (s) |
61 A: **********...........**********........******************** | 70 A: **********...........**********........******************** |
62 B: ..........**********.........**********.................... | 71 B: ..........**********.........**********.................... |
63 ^ 200 ms cross-talk | 72 ^ 200 ms cross-talk |
64 100 ms silence ^``` | 73 100 ms silence ^ |
| 74 ``` |
OLD | NEW |