基本資料

出生地 : 台中市
學歷 :
台中市立忠明國小
台中市立漢口國中
國立臺中高級家事商業職業學校 資料處理科
國立台灣科技大學 資訊工程學系
國立台灣科技大學 資工所

論文研究

指導老師: 古鴻炎 博士
中文題目: 以聲學語言模型、全域變異數匹配及目標音框挑選作強化之語音轉換系統
英文題目: A Voice Conversion System Enhanced with Acoustic Language-model, Global Variance Matching, and Target Frame Selection
中文摘要:本論文研究了組合式之語音轉換方法來強化以GMM為基礎之語音轉換功能, 這種組合式方法包含了PPM聲學語言模型(ALM)、目標音框挑選(TFS)與全域變異數(GV)匹配等處理步驟, 我們實作了兩個組合式語音轉換方法,分別是ALM+TFS+GV法與ALM+GV+TFS法。 在訓練階段,我們使用訓練出的GMM之128個高斯混合的平均向量來建立近似音素之二元分類樹,再用此分類樹於訓練PPM聲學語言模型。 在轉換階段,我們依據ALM估計的機率去對輸入音框作近似音素的分段,然後各音框依其對應的近似音素去作單一高斯混合之頻譜對映, 接著再作TFS與GV匹配等處理,以便改善頻譜包絡過度平滑的問題。 TFS依轉換後音框的DCC係數,到目標語者訓練語料中挑選出距離最接近的音框DCC來做取代; GV匹配則是把一序列音框的DCC係數之變異數特性匹配到目標語者的變異數特性。 由客觀量測實驗的結果發現,我們轉換方法的平均DCC誤差距離會比基本轉換方法的大, 但變異數比值(VR)則會變高變好。 此外,從主觀聽測實驗的結果可知, 本論文所提出的語音轉換方法能夠提升轉換後語音的信號品質, 並且轉換出語音的音色也相當接近目標語者的。
英文摘要: In this thesis, a combination method for voice conversion is proposed to enhance the performance of GMM based voice conversion systems. The combination method includes the processing modules, PPM acoustic language-model (ALM), target frame selection (TFS), and global variance (GV) matching. Actually, we implement the two voice conversion methods: ALM+TFS+GV and ALM+GV+TFS. In training stage, we use the 128 mean vectors of Gaussian mixtures from a trained GMM to establish a quasi-phonetic symbol binary classification tree. Then, the tree is used to train ALM. In conversion stage, input voice frames are segmented according to the probabilities estimated by ALM. Next, each voice frame’s spectrum is mapped with a single Gaussian mixture that corresponds to this frame. Afterward, the two modules, TFS and GV, are executed in order to reduce the problem of over-smoothed spectral envelope. In TFS, a converted DCC (discrete cepstral coefficient) vector for an input frame is used to search the nearest frame from the target-speaker training frames, and the found DCC is taken to replace the converted. GV matching can adjust the DCC’s variance of a sequence of converted DCC to match the variance of the target-speaker’s training DCC vectors. According to the results of objective tests, the average DCC error of our method is larger than the baseline method.
However, the signal-quality index, variance ratio (VR), indicates our method is better. In addition, according to the results of perception tests, the converted speech by our method can obtain higher signal quality and higher timbre similarity than the baseline method.