王讚緯(TSAN-WEI,WANG)

基本資料


出生地:台北市

學歷:
  • 台北市立西門國小
  • 台北市立南門國中
  • 台北市立復興高中
  • 私立東吳大學資訊科學系
  • 國立臺灣科技大學資工所

論文研究


指導老師:
古鴻炎 博士

中文題目:
使用直方圖等化及目標音框挑選之語音轉換系統

英文題目:
A Voice Conversion System Using Histogram Equalization and Target Frame Selection

中文摘要:
本論文採用線性多變量迴歸(LMR)作為頻譜對映機制,並且加入頻譜係數的直方圖等化(histogram equalization, HEQ)及目標音框挑選(target frame selection, TFS)的處理,希望藉以改善高斯混合模型(GMM)頻譜對映常遇到的頻譜過度平滑的問題,以提昇轉換出的語音品質。另外,由於平行語料的取得是不易的,因此我們研究了從非平行語料去建構模擬的平行語料的方法,然後使用非平行語料去實作四種語音轉換系統,分別為LMR系統、LMR+TFS系統、HEQ+LMR系統與HEQ+LMR+TFS系統。在訓練階段我們使用音段式音框對齊,來建造模擬的平行語料,再用模擬的平行語料,去分別訓練四種系統的模型參數。作直方圖等化處理時,需先是把離散倒頻譜係數(DCC)轉換成主成分分析(PCA)係數,再把PCA係數轉換成累積密度函數(CDF)係數;而作目標音框挑選時則是依據一個音框的音段類別編號及LMR對映出的DCC向量,到目標語者相同音段類別所收集到的音框集中,去搜尋出距離較小的目標語者DCC向量,再用以取代原先對映出的DCC向量。在測試階段,由客觀DCC誤差距離量測的結果可以發現,加入直方圖等化的處理可使平均DCC誤差距離減小,但加入目標音框挑選後,反而會使平均DCC誤差變大。不過,依變異數比值(variance ratio, VR)與主觀聽測的結果可知,加入目標音框挑選的確可使語音品質提昇,並且平均DCC誤差距離變大不表示語音品質就會變差。

英文摘要:
Is this thesis, linear multivariate regression (LMR) is adopted for spectrum mapping. In addition, histogram equalization (HEQ) of spectral coefficients and target frame selection (TFS) are included to our system. We intend to solve the problem of spectral over-smoothing encountered by the conventional GMM (Gaussian mixture model) based mapping mechanism in order to improve the converted voice quality. Also, we notice that parallel training sentences are hard to prepare. Therefore we study a method to construct an imitative parallel corpus from a nonparallel corpus. Next, we use a nonparallel corpus to build four voice conversion systems: LMR, LMR+TFS, HEQ+LMR and HEQ+TFS. In the training stage, the method, segment-based frame alignment, is refined to construct the imitative parallel corpus. Then, the corpus is used to train the model parameters for the four voice conversion systems respectively. In the module for HEQ, discrete cepstral coefficients (DCC) are first transformed to principle-component-analysis (PCA) coefficients, and then transformed to cumulative-density-function (CDF) coefficient. In the module for TFS, a DCC vector obtained from LMR mapping and its segment-class number are used to search the corresponding frame set consisting of target-speaker frames belonging to the same segment class. Then, the DCC vector of a frame in the frame set that is nearest to the LMR mapped DCC vector is found and taken to replace the mapped DCC vector. In the conversion stage, it is seen that the HEQ module can decreases the average DCC error, but the TFS module causes the average DCC error being increased. However, the TFS module can really improve the converted voice quality according to the measure of variance ratio. Therefore, the increased average DCC error does not indicate the converted voice quality is worsened.