Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/32615
|
Title: | 中文新聞標題自動生成之研究 A Study on the Automatic Generation for Headlines of Chinese News Articles |
Authors: | 江珮翎 Chiang, Pei-ling |
Contributors: | 劉吉軒 陳光華 Liu, Jyi-Shane Chen, Kuang-hua 江珮翎 Chiang, Pei-ling |
Keywords: | 標題 自動生成 自然語言 生成 新聞標題 |
Date: | 2002 |
Issue Date: | 2009-09-17 13:51:52 (UTC+8) |
Abstract: | 在網路資訊爆炸的年代,資料的分析整理日趨重要,本論文之研究目標正是針對資料做標題生成的處理,為資料自動生成標題,進而將資料加值化,轉化為資訊。研究者首先閱讀英文相關文獻,分析整理後,認為中文的處理方式與英文有所差異,因此,在本論文中,提出與英文不同之中文前置作業與自動標題生成之方法。 研究者針對標題的自動生成提出了幾種特徵值考量,包括候選詞權重值,訓練標題-文本詞彙,標題長度的關係及詞組間距。本論文之研究分為兩階段,第一階段為訓練階段,將文件做前置處理與斷詞,接著訓練標題-文本詞彙與統計文件標題長度的機率。第二階段為執行階段,分析新文件之候選詞權重值,並參照訓練階段之標題-文本詞彙與標題長度之機率值參考表,考量詞組間距後自動為文件產生標題。本論文所採用的訓練文件集來源為1998年至1999年五種報紙,涵蓋不同主題,共84,211篇文件,而測試文件的實驗分為Outside Test與Inside Test兩部分。 研究者為實驗結果進行兩種評估,一為電腦評估,將自動生成之標題與記者所擬訂的標題比對後,計算出求準率、求全率與F1。Outside Test求準率為14.21%、求全率為11.43%、F1為12.67%。Inside Test求準率為15.84%、求全率為12.94%、F1為14.21%。實驗結果顯示,正確率方面與其他文獻之英文文件標題的生成結果(F1=3.2%~24%)相近,但與實際標題仍有差距,因此,在未來工作上,仍有很大的發展空間。二為人為評估,讓使用者在閱讀自動生成之標題後,加以評分。自動生成之標題的流暢度還算不錯。然總結來說,本論文之研究尚屬初始階段,盼未來能更加成熟,並可有更進一步的創新與改進。 As the number of digital documents on internet is growing up, analysis and organization of documents become quite important. In this thesis, we propose an approach for headline generation of documents. We can try our best to transfer the document data into information in some sense using the proposed approach. We review literature about the related topics, and present a different approach to deal with Chinese documents rather than English documents. We propose some approach to Chinese documents headline generation. The thesis is separate two steps, one is training step, and the other is execution step. On the first step, the documents were preprocessed. Secondly, we trained the probability of headline-text words, and headline’s length. And on the execution step, we analyzed scores of headline candidates and gap, then referred to the probability of headline-text words, and headline’s length, finally we automatically generate headline for documents. The training documents are selected from a test collection for information retrieval, CIRB. Totally 84,211 Chinese news articles published between 1998 and 1999 are selected. Testing documents has two parts, one is for outside test, and the other is for inside test. We conducted two evaluations, one is the automatic evaluation using metrics of presicion, recall and F1; the other is the human assessment. The precision of outside test is 14.21%、recall is 11.43%、F1 is 12.67%. And the precision of inside test is 15.84%、recall is 12.94%、F1 is 14.21%。The automatic evaluation result shows the accruacy is still not good enough, and the human assessment evaluation shows our approach can produce human-readable headlines. |
Reference: | 參考文獻 [1]Michele Banko, Vibhu O. Mittal, and Michael J. Witbrock. 2000.“Headline Generation Based on Statistical Translation”. 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, China, 1-8 October. [2]Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993 . “The mathematics of statistical machine translation: Parameter estimation”. Computational Linguistics, (2): 263-312. [3]Brown, Cocke, Della-Pietra, Della-Pietra, Jelinek, Lafferty, Mercer, Roossin. 1990. “A Statistical Approach to Machine Translation”. Computational Linguistics, 16(2) June. [4]Kuang-hua Chen and Hsin-Hsi Chen. 2001. “The Chinese Text Retrieval Tasks of NTCIR Workshop 2”. Proceedings of the Second NTCIR Workshop Meeting on Evaluation of Chinese & Japanese Text Retrieval and Text Summarization (NTCIR 2), pp. 51-72. [5]G. D. Forney. 1973. “The Viterbi Algorithm”. Proc of the IEEE, pp. 268-278. [6]Rong Jin and Alexander G. Hauptmann. 2001. “Headline Generation using a Training Corpus”. Second International Conference on Intelligent Text Text Processing and Computational Linguistics. [7]R. Jin and A. G. Hauptmann. 2000. “Title Generation for Spoken Broadcast News using a Training Corpus”.Proceedings of ICSLP 2000, Beijing China. [8]S. Katz. 1987. “Estimation of probabilities from sparse data for the language model component of a speech recognizer”. IEEE Transactions on Acoustics Speech and Signal Processing, pp. 24. [9]Paul E. Kennedy and Alexander G. Hauptmann. 2000. “Automatic Title Generation for EM”. Proceedings of the fifth ACM conference on Digital libraries. [10]G..J. McLachlan and K. E. Basford. 1988. Mixture Models. Marcel Dekker, NY. [11]M. Mitra, Amit Sighal, and Chris Buckley. 1997. “Automatic text summarization by paragraph extraction”. In Proceedings of the ACL’97/EACL’97 Workshop on Intelligent Scalable Text Summarization, Madrid, Spain. [12]Papineni, Kishore papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu. 2001. “IBM Research Division Technical Report”. RC22176(W0109-022), Yorktown Heights, New York. [13] Gernard Salton, A.Singhal, M. Mitra, and C. Buckley. 1997 .“Automatic text structuring and summary”. Info. Proc. And Management, 33(2):193-207. [14] T. Strzalkowski, J. Wang, and B.Wise. 1998. “A robust practical text summarization system”. In AAAI Intelligent Text Summarization Workshop, pp. 26-30, Stanford, CA. [15]M. Witbrock and V. Mittal. 1999. “Ultra-Summarization: A Statistical Approach to Generating Highly Condensed Non-Extractive Summaries”. Proceedings of SIGIR 99, Berkeley, CA, August. [16]David Zajic, Bonnie Dorr, and Richard Schwartz. 2002. “Automatic headline generation for newspaper stories”. In Proceedings of the Workshop on Text Summarization Postconference workshop of ACL-02, Philadelphia, PA. [17]陳光華。電子文獻資料主題分類之自動辨識。行政院國家科學委員會專題研究計畫成果報告,NSC 86-2621-E-002-025T,民國86年9月。 |
Description: | 碩士 國立政治大學 資訊科學學系 89753004 91 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0089753004 |
Data Type: | thesis |
Appears in Collections: | [資訊科學系] 學位論文
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|