English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  Items with full text/Total items : 113648/144635 (79%)
Visitors : 51663327      Online Users : 520
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    政大機構典藏 > 資訊學院 > 資訊科學系 > 學位論文 >  Item 140.119/32615
    Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/32615


    Title: 中文新聞標題自動生成之研究
    A Study on the Automatic Generation for Headlines of Chinese News Articles
    Authors: 江珮翎
    Chiang, Pei-ling
    Contributors: 劉吉軒
    陳光華

    Liu, Jyi-Shane
    Chen, Kuang-hua

    江珮翎
    Chiang, Pei-ling
    Keywords: 標題
    自動生成
    自然語言
    生成
    新聞標題
    Date: 2002
    Issue Date: 2009-09-17 13:51:52 (UTC+8)
    Abstract: 在網路資訊爆炸的年代,資料的分析整理日趨重要,本論文之研究目標正是針對資料做標題生成的處理,為資料自動生成標題,進而將資料加值化,轉化為資訊。研究者首先閱讀英文相關文獻,分析整理後,認為中文的處理方式與英文有所差異,因此,在本論文中,提出與英文不同之中文前置作業與自動標題生成之方法。
    研究者針對標題的自動生成提出了幾種特徵值考量,包括候選詞權重值,訓練標題-文本詞彙,標題長度的關係及詞組間距。本論文之研究分為兩階段,第一階段為訓練階段,將文件做前置處理與斷詞,接著訓練標題-文本詞彙與統計文件標題長度的機率。第二階段為執行階段,分析新文件之候選詞權重值,並參照訓練階段之標題-文本詞彙與標題長度之機率值參考表,考量詞組間距後自動為文件產生標題。本論文所採用的訓練文件集來源為1998年至1999年五種報紙,涵蓋不同主題,共84,211篇文件,而測試文件的實驗分為Outside Test與Inside Test兩部分。
    研究者為實驗結果進行兩種評估,一為電腦評估,將自動生成之標題與記者所擬訂的標題比對後,計算出求準率、求全率與F1。Outside Test求準率為14.21%、求全率為11.43%、F1為12.67%。Inside Test求準率為15.84%、求全率為12.94%、F1為14.21%。實驗結果顯示,正確率方面與其他文獻之英文文件標題的生成結果(F1=3.2%~24%)相近,但與實際標題仍有差距,因此,在未來工作上,仍有很大的發展空間。二為人為評估,讓使用者在閱讀自動生成之標題後,加以評分。自動生成之標題的流暢度還算不錯。然總結來說,本論文之研究尚屬初始階段,盼未來能更加成熟,並可有更進一步的創新與改進。
    As the number of digital documents on internet is growing up, analysis and organization of documents become quite important. In this thesis, we propose an approach for headline generation of documents. We can try our best to transfer the document data into information in some sense using the proposed approach. We review literature about the related topics, and present a different approach to deal with Chinese documents rather than English documents.
    We propose some approach to Chinese documents headline generation. The thesis is separate two steps, one is training step, and the other is execution step. On the first step, the documents were preprocessed. Secondly, we trained the probability of headline-text words, and headline’s length. And on the execution step, we analyzed scores of headline candidates and gap, then referred to the probability of headline-text words, and headline’s length, finally we automatically generate headline for documents. The training documents are selected from a test collection for information retrieval, CIRB. Totally 84,211 Chinese news articles published between 1998 and 1999 are selected. Testing documents has two parts, one is for outside test, and the other is for inside test.
    We conducted two evaluations, one is the automatic evaluation using metrics of presicion, recall and F1; the other is the human assessment. The precision of outside test is 14.21%、recall is 11.43%、F1 is 12.67%. And the precision of inside test is 15.84%、recall is 12.94%、F1 is 14.21%。The automatic evaluation result shows the accruacy is still not good enough, and the human assessment evaluation shows our approach can produce human-readable headlines.
    Reference: 參考文獻
    [1]Michele Banko, Vibhu O. Mittal, and Michael J. Witbrock. 2000.“Headline Generation Based on Statistical Translation”. 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, China, 1-8 October.
    [2]Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993 . “The mathematics of statistical machine translation: Parameter estimation”. Computational Linguistics, (2): 263-312.
    [3]Brown, Cocke, Della-Pietra, Della-Pietra, Jelinek, Lafferty, Mercer, Roossin. 1990. “A Statistical Approach to Machine Translation”. Computational Linguistics, 16(2) June.
    [4]Kuang-hua Chen and Hsin-Hsi Chen. 2001. “The Chinese Text Retrieval Tasks of NTCIR Workshop 2”. Proceedings of the Second NTCIR Workshop Meeting on Evaluation of Chinese & Japanese Text Retrieval and Text Summarization (NTCIR 2), pp. 51-72.
    [5]G. D. Forney. 1973. “The Viterbi Algorithm”. Proc of the IEEE, pp. 268-278.
    [6]Rong Jin and Alexander G. Hauptmann. 2001. “Headline Generation using a Training Corpus”. Second International Conference on Intelligent Text Text Processing and Computational Linguistics.
    [7]R. Jin and A. G. Hauptmann. 2000. “Title Generation for Spoken Broadcast News using a Training Corpus”.Proceedings of ICSLP 2000, Beijing China.
    [8]S. Katz. 1987. “Estimation of probabilities from sparse data for the language model component of a speech recognizer”. IEEE Transactions on Acoustics Speech and Signal Processing, pp. 24.
    [9]Paul E. Kennedy and Alexander G. Hauptmann. 2000. “Automatic Title Generation for EM”. Proceedings of the fifth ACM conference on Digital libraries.
    [10]G..J. McLachlan and K. E. Basford. 1988. Mixture Models. Marcel Dekker, NY.
    [11]M. Mitra, Amit Sighal, and Chris Buckley. 1997. “Automatic text summarization by paragraph extraction”. In Proceedings of the ACL’97/EACL’97 Workshop on Intelligent Scalable Text Summarization, Madrid, Spain.
    [12]Papineni, Kishore papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu. 2001. “IBM Research Division Technical Report”. RC22176(W0109-022), Yorktown Heights, New York.
    [13] Gernard Salton, A.Singhal, M. Mitra, and C. Buckley. 1997 .“Automatic text structuring and summary”. Info. Proc. And Management, 33(2):193-207.
    [14] T. Strzalkowski, J. Wang, and B.Wise. 1998. “A robust practical text summarization system”. In AAAI Intelligent Text Summarization Workshop, pp. 26-30, Stanford, CA.
    [15]M. Witbrock and V. Mittal. 1999. “Ultra-Summarization: A Statistical Approach to Generating Highly Condensed Non-Extractive Summaries”. Proceedings of SIGIR 99, Berkeley, CA, August.
    [16]David Zajic, Bonnie Dorr, and Richard Schwartz. 2002. “Automatic headline      generation for newspaper stories”. In Proceedings of the Workshop on Text Summarization Postconference workshop of ACL-02, Philadelphia, PA.
    [17]陳光華。電子文獻資料主題分類之自動辨識。行政院國家科學委員會專題研究計畫成果報告,NSC 86-2621-E-002-025T,民國86年9月。
    Description: 碩士
    國立政治大學
    資訊科學學系
    89753004
    91
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0089753004
    Data Type: thesis
    Appears in Collections:[資訊科學系] 學位論文

    Files in This Item:

    File Description SizeFormat
    75300401.pdf83KbAdobe PDF21428View/Open
    75300402.pdf282KbAdobe PDF2916View/Open
    75300403.pdf183KbAdobe PDF21030View/Open
    75300404.pdf332KbAdobe PDF21030View/Open
    75300405.pdf80KbAdobe PDF2859View/Open
    75300406.pdf192KbAdobe PDF2899View/Open


    All items in 政大典藏 are protected by copyright, with all rights reserved.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback