The following 12 samples are from the listening test in this paper, and the first two samples are the ones analyzed into the fitness scape plots (Figure 7: the first sample for top row; the second sample for the bottom row). The prompts rest for around the first 5 seconds.
GPT2
output.gen_for_listening.gpt2-MN-lm-no-embeds.s66.t7619.prompt-64.1689246739.028448.wav
output.gen_for_listening.gpt2-MN-lm-no-embeds.s141.t3899.prompt-64.1689246746.2421086.wav
output.gen_for_listening.gpt2-MN-lm-no-embeds.s446.t3796.prompt-64.1689246753.1930459.wav
output.gen_for_listening.gpt2-MN-lm-no-embeds.s591.t4406.prompt-64.1689246756.2442229.wav
output.gen_for_listening.gpt2-MN-lm-no-embeds.s801.t4389.prompt-64.1689246759.9431634.wav
output.gen_for_listening.gpt2-MN-lm-no-embeds.s880.t3558.prompt-64.1689246763.4680882.wav
output.gen_for_listening.gpt2-MN-lm-no-embeds.s888.t4677.prompt-64.1689246767.1705449.wav
output.gen_for_listening.gpt2-MN-lm-no-embeds.s992.t4495.prompt-64.1689246770.742311.wav
output.gen_for_listening.gpt2-MN-lm-no-embeds.s1059.t5696.prompt-64.1689246773.986176.wav
output.gen_for_listening.gpt2-MN-lm-no-embeds.s1104.t2417.prompt-64.1689246777.9444382.wav
output.gen_for_listening.gpt2-MN-lm-no-embeds.s1124.t8701.prompt-64.1689246780.3382137.wav
output.gen_for_listening.gpt2-MN-lm-no-embeds.s1183.t4920.prompt-64.1689246786.181727.wav
GPT2-RE
output.gen_for_listening.gpt2-MN-lm-all-embeds-init-random.s66.t9619.prompt-64.1689246954.481215.wav
GPT2-SE
GT
output.gen_for_listening.GT.s66.t4675.prompt-64.1689247121.2217507.wav
output.gen_for_listening.GT.s141.t2555.prompt-64.1689247130.4810164.wav
output.gen_for_listening.GT.s446.t2099.prompt-64.1689247132.7888298.wav
output.gen_for_listening.GT.s591.t2056.prompt-64.1689247134.8781655.wav
output.gen_for_listening.GT.s801.t4983.prompt-64.1689247139.1909335.wav
output.gen_for_listening.GT.s880.t3659.prompt-64.1689247144.1085837.wav
output.gen_for_listening.GT.s888.t2192.prompt-64.1689247147.0149174.wav
output.gen_for_listening.GT.s992.t6129.prompt-64.1689247148.9846585.wav
output.gen_for_listening.GT.s1059.t2379.prompt-64.1689247153.4739404.wav
output.gen_for_listening.GT.s1104.t2639.prompt-64.1689247158.2874134.wav
output.gen_for_listening.GT.s1124.t5772.prompt-64.1689247160.6634374.wav
output.gen_for_listening.GT.s1183.t1657.prompt-64.1689247171.1830833.wav
Prompt: First 8 measures from ASAP dataset (Foscarin et al., 2020)
input.bach_fugue_bwv893.100.l0.gpt2-MN-lm-no-embeds.1677051380.389268.wav
Full Song: Piano performance from ASAP dataset (Foscarin et al., 2020)
bach_fugue_bwv893.1677051537.745666.wav
GPT2
output.bach_fugue_bwv893.100.l0.gpt2-MN-lm-no-embeds.1677051379.848664.wav
GPT2-RE
output.bach_fugue_bwv893.100.l0.gpt2-MN-lm-all-embeds-init-random.1677051414.538455.wav
GPT2-SE
output.bach_fugue_bwv893.100.l0.gpt2-MN-lm-all-embeds-init-time-pc.1677051443.5603802.wav
The following samples demonstrate three examples of step-by-step generation loops:
Single-loop generation from the given prompt using only GPT2-RE
Three-loop generation from multiple models where the first and third loops are from GPT2-RE and the second loop is from GPT2-SE to enhance repetition in the middle of the song;
Three-loop generation where each loop of generation from GPT2-RE is followed by human revision except for the last loop.
(“s.” in each parenthesis denotes seconds.)
Prompt (~ 5s.)
Generation Type
1st Output (< 15s.)
1st Revision (< 15s.)
2nd Output (< 30s.)
2nd Revision (< 30s.)
Final Output (< 60s.)
input.multi-model.s0.100.l0.1676961835.0617409.wav
Single-Loop