So it looks like newer generation diffsinger models now have linguistic models that take in tokens, word divisions and word durations where the output is encoder_out and x_masks which then feed to the duration.onnx model
Example below(please tell me the if zeroes are needed in the below example)
results = linguistic_model.run(None, {
"tokens":[[26, 1, 22, 35, 11]] ,
"word_div": [[3,2,0,0,0]],
"word_dur": [[48,24,0,0,0]]
})
Happy to get your thoughts, thank you!
So it looks like newer generation diffsinger models now have linguistic models that take in tokens, word divisions and word durations where the output is encoder_out and x_masks which then feed to the duration.onnx model
Example below(please tell me the if zeroes are needed in the below example)
results = linguistic_model.run(None, {
"tokens":[[26, 1, 22, 35, 11]] ,
"word_div": [[3,2,0,0,0]],
"word_dur": [[48,24,0,0,0]]
})
Happy to get your thoughts, thank you!