Multi feature fusion audio-visual joint speech separation algorithm based on Conv-TasNet

XU Liang; WANG Jing; YANG Wenjing; LUO Yiyu

doi:10.16798/j.issn.1003-0530.2021.10.002

XU Liang, WANG Jing, YANG Wenjing, LUO Yiyu. Multi feature fusion audio-visual joint speech separation algorithm based on Conv-TasNet[J]. JOURNAL OF SIGNAL PROCESSING, 2021, 37(10): 1799-1805. DOI: 10.16798/j.issn.1003-0530.2021.10.002

Citation:

Multi feature fusion audio-visual joint speech separation algorithm based on Conv-TasNet

Graphical Abstract

Abstract

Abstract

The audiovisual multimodal modeling has been verified to be effective in speech separation tasks. This paper proposes a speech separation model to improve the existing time-domain audio visual joint speech separation algorithm, and enhances the connection between audio and visual streams. Aiming at the situation that the existing audio-visual separation models are not highly integrated, authors propose a end to end model which combines audio features with additional input visual features multiple times in time domain, and adds the means of vertical weight sharing. The model was trained and evaluated on the GRID data set. Experiments show that compared with Conv-TasNet which only uses audio and Conv-TasNet combines with audio and video, the performance of our model is improved by 1.2 dB and 0.4 dB respectively.

FullText(HTML)

References (0)

Supplements (0)

Cited By

Multi feature fusion audio-visual joint speech separation algorithm based on Conv-TasNet

Abstract

Catalog

Export File

Citation

Format

Content