政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/32615

English | 正體中文 | 简体中文 | Post-Print筆數 : 27 | Items with full text/Total items : 115260/146307 (79%)
Visitors : 54565145 Online Users : 8

RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.

Scope

please add "double quotation mark" for query phrases to get precise results

please goto advance search for comprehansive author search

Adv. Search

Home ‧ Login ‧ Upload ‧ Help ‧ About ‧ Administer

Goto mobile version

政大機構典藏 > 資訊學院 > 資訊科學系 > 學位論文 > Item 140.119/32615

Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/32615

Title:	中文新聞標題自動生成之研究 A Study on the Automatic Generation for Headlines of Chinese News Articles
Authors:	江珮翎 Chiang, Pei-ling
Contributors:	劉吉軒陳光華 Liu, Jyi-Shane Chen, Kuang-hua 江珮翎 Chiang, Pei-ling
Keywords:	標題自動生成自然語言生成新聞標題
Date:	2002
Issue Date:	2009-09-17 13:51:52 (UTC+8)
Abstract:	在網路資訊爆炸的年代，資料的分析整理日趨重要，本論文之研究目標正是針對資料做標題生成的處理，為資料自動生成標題，進而將資料加值化，轉化為資訊。研究者首先閱讀英文相關文獻，分析整理後，認為中文的處理方式與英文有所差異，因此，在本論文中，提出與英文不同之中文前置作業與自動標題生成之方法。研究者針對標題的自動生成提出了幾種特徵值考量，包括候選詞權重值，訓練標題-文本詞彙，標題長度的關係及詞組間距。本論文之研究分為兩階段，第一階段為訓練階段，將文件做前置處理與斷詞，接著訓練標題-文本詞彙與統計文件標題長度的機率。第二階段為執行階段，分析新文件之候選詞權重值，並參照訓練階段之標題-文本詞彙與標題長度之機率值參考表，考量詞組間距後自動為文件產生標題。本論文所採用的訓練文件集來源為1998年至1999年五種報紙，涵蓋不同主題，共84,211篇文件，而測試文件的實驗分為Outside Test與Inside Test兩部分。研究者為實驗結果進行兩種評估，一為電腦評估，將自動生成之標題與記者所擬訂的標題比對後，計算出求準率、求全率與F1。Outside Test求準率為14.21%、求全率為11.43%、F1為12.67%。Inside Test求準率為15.84%、求全率為12.94%、F1為14.21%。實驗結果顯示，正確率方面與其他文獻之英文文件標題的生成結果(F1=3.2%~24%)相近，但與實際標題仍有差距，因此，在未來工作上，仍有很大的發展空間。二為人為評估，讓使用者在閱讀自動生成之標題後，加以評分。自動生成之標題的流暢度還算不錯。然總結來說，本論文之研究尚屬初始階段，盼未來能更加成熟，並可有更進一步的創新與改進。 As the number of digital documents on internet is growing up, analysis and organization of documents become quite important. In this thesis, we propose an approach for headline generation of documents. We can try our best to transfer the document data into information in some sense using the proposed approach. We review literature about the related topics, and present a different approach to deal with Chinese documents rather than English documents. We propose some approach to Chinese documents headline generation. The thesis is separate two steps, one is training step, and the other is execution step. On the first step, the documents were preprocessed. Secondly, we trained the probability of headline-text words, and headline’s length. And on the execution step, we analyzed scores of headline candidates and gap, then referred to the probability of headline-text words, and headline’s length, finally we automatically generate headline for documents. The training documents are selected from a test collection for information retrieval, CIRB. Totally 84,211 Chinese news articles published between 1998 and 1999 are selected. Testing documents has two parts, one is for outside test, and the other is for inside test. We conducted two evaluations, one is the automatic evaluation using metrics of presicion, recall and F1; the other is the human assessment. The precision of outside test is 14.21%、recall is 11.43%、F1 is 12.67%. And the precision of inside test is 15.84%、recall is 12.94%、F1 is 14.21%。The automatic evaluation result shows the accruacy is still not good enough, and the human assessment evaluation shows our approach can produce human-readable headlines.
Reference:	參考文獻 [1]Michele Banko, Vibhu O. Mittal, and Michael J. Witbrock. 2000.“Headline Generation Based on Statistical Translation”. 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, China, 1-8 October. [2]Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993 . “The mathematics of statistical machine translation: Parameter estimation”. Computational Linguistics, (2): 263-312. [3]Brown, Cocke, Della-Pietra, Della-Pietra, Jelinek, Lafferty, Mercer, Roossin. 1990. “A Statistical Approach to Machine Translation”. Computational Linguistics, 16(2) June. [4]Kuang-hua Chen and Hsin-Hsi Chen. 2001. “The Chinese Text Retrieval Tasks of NTCIR Workshop 2”. Proceedings of the Second NTCIR Workshop Meeting on Evaluation of Chinese & Japanese Text Retrieval and Text Summarization (NTCIR 2), pp. 51-72. [5]G. D. Forney. 1973. “The Viterbi Algorithm”. Proc of the IEEE, pp. 268-278. [6]Rong Jin and Alexander G. Hauptmann. 2001. “Headline Generation using a Training Corpus”. Second International Conference on Intelligent Text Text Processing and Computational Linguistics. [7]R. Jin and A. G. Hauptmann. 2000. “Title Generation for Spoken Broadcast News using a Training Corpus”.Proceedings of ICSLP 2000, Beijing China. [8]S. Katz. 1987. “Estimation of probabilities from sparse data for the language model component of a speech recognizer”. IEEE Transactions on Acoustics Speech and Signal Processing, pp. 24. [9]Paul E. Kennedy and Alexander G. Hauptmann. 2000. “Automatic Title Generation for EM”. Proceedings of the fifth ACM conference on Digital libraries. [10]G..J. McLachlan and K. E. Basford. 1988. Mixture Models. Marcel Dekker, NY. [11]M. Mitra, Amit Sighal, and Chris Buckley. 1997. “Automatic text summarization by paragraph extraction”. In Proceedings of the ACL’97/EACL’97 Workshop on Intelligent Scalable Text Summarization, Madrid, Spain. [12]Papineni, Kishore papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu. 2001. “IBM Research Division Technical Report”. RC22176(W0109-022), Yorktown Heights, New York. [13] Gernard Salton, A.Singhal, M. Mitra, and C. Buckley. 1997 .“Automatic text structuring and summary”. Info. Proc. And Management, 33(2):193-207. [14] T. Strzalkowski, J. Wang, and B.Wise. 1998. “A robust practical text summarization system”. In AAAI Intelligent Text Summarization Workshop, pp. 26-30, Stanford, CA. [15]M. Witbrock and V. Mittal. 1999. “Ultra-Summarization: A Statistical Approach to Generating Highly Condensed Non-Extractive Summaries”. Proceedings of SIGIR 99, Berkeley, CA, August. [16]David Zajic, Bonnie Dorr, and Richard Schwartz. 2002. “Automatic headline 　　　　　generation for newspaper stories”. In Proceedings of the Workshop on Text Summarization Postconference workshop of ACL-02, Philadelphia, PA. [17]陳光華。電子文獻資料主題分類之自動辨識。行政院國家科學委員會專題研究計畫成果報告，NSC 86-2621-E-002-025T，民國86年9月。
Description:	碩士國立政治大學資訊科學學系 89753004 91
Source URI:	http://thesis.lib.nccu.edu.tw/record/#G0089753004
Data Type:	thesis
Appears in Collections:	[資訊科學系] 學位論文

Files in This Item:

File	Description	Size	Format
75300401.pdf		83Kb	Adobe PDF2	1428	View/Open
75300402.pdf		282Kb	Adobe PDF2	916	View/Open
75300403.pdf		183Kb	Adobe PDF2	1030	View/Open
75300404.pdf		332Kb	Adobe PDF2	1030	View/Open
75300405.pdf		80Kb	Adobe PDF2	859	View/Open
75300406.pdf		192Kb	Adobe PDF2	899	View/Open

All items in 政大典藏 are protected by copyright, with all rights reserved.

社群 sharing

著作權政策宣告 Copyright Announcement

1.本網站之數位內容為國立政治大學所收錄之機構典藏，無償提供學術研究與公眾教育等公益性使用，惟仍請適度，合理使用本網站之內容，以尊重著作權人之權益。商業上之利用，則請先取得著作權人之授權。
The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

2.本網站之製作，已盡力防止侵害著作權人之權益，如仍發現本網站之數位內容有侵害著作權人權益情事者，請權利人通知本網站維護人員(nccur@nccu.edu.tw)，維護人員將立即採取移除該數位著作等補救措施。
NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.

DSpace Software Copyright © 2002-2004 MIT & Hewlett-Packard / Enhanced by NTU Library IR team Copyright © - Feedback