English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  Items with full text/Total items : 116849/147881 (79%)
Visitors : 63916606      Online Users : 256
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/157880


    Title: 基於大型語言模型的即時新聞檢索與生成系統開發
    Building a Real-Time News Retrieval-Augmented Generation System with Large Language Models
    Authors: 林佩欣
    Lin, Pei-Hsin
    Contributors: 蔡炎龍
    Tsai, Yen-Lung
    林佩欣
    Lin, Pei-Hsin
    Keywords: 檢索增強生成(RAG)
    大型語言模型(LLM)
    新聞檢索
    生成式人工智慧
    開源模型
    Retrieval-Augmented Generation
    Large Language Models
    News Retrieval
    Generative AI
    Open-Source Models
    Date: 2025
    Issue Date: 2025-07-01 15:49:00 (UTC+8)
    Abstract: 大型語言模型(LLM)具備優異的查詢理解與文本生成能力,有效解決傳統檢索系統中的語意挑戰。然而,其高度依賴預訓練知識,容易產生過時或虛構的資訊,特別是在即時新聞檢索的應用情境中更為明顯。

    相較於其他資料類型,新聞資料具備時效性高、語意重複率高、資訊片段化等特性,進一步加深了檢索與生成任務的挑戰。因此,本研究選擇以新聞資料作為基礎,實作一套檢索增強生成系統,並將情境設定於新聞專業人員使用內部查詢系統時的多樣需求,藉此觀察RAG技術是否能有效降低幻覺現象,提升資訊正確性。

    研究首先自行爬取ETtoday新聞資料作為知識來源,並設計實驗比較四種模型配置:Llama3-8B、Llama3-8B 搭配 RAG、GPT-4、以及 GPT-4 搭配 RAG。測試任務涵蓋事實問答、事件綜合與新聞摘要等多個面向,並針對生成內容的正確性與完整性等進行評分。結果顯示,引入RAG技術可顯著提升模型回應的事實準確性,並有效減少生成幻覺的情況。

    在系統開發方面,涵蓋從資料爬取、前處理、文件切分至向量資料庫設計,最終以 Gradio 實作互動介面。研究過程中亦強調檢索品質的追蹤與回饋機制,確保最終生成答案的可靠性。
    Large language models (LLMs) excel at query understanding and text generation but are prone to outdated or hallucinated content, especially in real-time news retrieval. News data poses unique challenges such as high redundancy, temporal sensitivity, and fragmented information.
    This study develops a retrieval-augmented generation (RAG) system tailored for internal news query scenarios and evaluates its effectiveness using ETtoday news data. Four model setups—LLaMA3-8B, LLaMA3-8B+RAG, GPT-4, and GPT-4+RAG—were tested across tasks like factual Q&A and summarization.
    The evaluation focused on factual accuracy and completeness. Results show that RAG significantly improves output reliability while reducing hallucinations. The system implementation includes data crawling, data preprocessing, word chunking, and vector database building, with an interactive frontend built using Gradio.
    Throughout the experiment, emphasis was placed on monitoring retrieval quality and incorporating feedback checking to ensure the reliability of final outputs.
    Reference: Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P.,
    Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., . . . Amodei, D. (2020). Language Models are Few-Shot Learners. Neural Information Processing Systems, 33, 1877–1901. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
    Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots. On The Dangers of Stochastic Parrots: Can Language Models Be Too Big?, 610–623. https://doi.org/10.1145/3442188.3445922
    Chase, L. (2022). LangChain: Building applications with LLMs through composability. GitHub Documentation.
    Gharge, S., & Chavan, M. (2017). An integrated approach for malicious tweets detection using NLP. 2017 International Conference on Inventive Communication and Computational Technologies (ICICCT). https://doi.org/10.1109/icicct.2017.7975235
    Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27, Montreal, Quebec, Canada, 2014, pp. 2672−2680.
    Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. In MIT Press eBooks. https://dl.acm.org/citation.cfm?id=3086952
    Guu, K., Lee, K., Tung, Z., Pasupat, P., & Chang, M. (2020, February 10). REALM: Retrieval-Augmented Language Model Pre-Training. arXiv.org. https://arxiv.org/abs/2002.08909
    Hearst, M. A. (2009). Search user interfaces. http://ci.nii.ac.jp/ncid/BA91702558
    Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
    Hochreiter, S. (1998). The vanishing gradient problem during learning recurrent neural nets
    and problem solutions. International Journal of Uncertainty Fuzziness and Knowledge-Based Systems, 06(02), 107–116. https://doi.org/10.1142/s0218488598000094
    Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The curious case of neural text degeneration. arXiv (Cornell University). https://arxiv.org/pdf/1904.09751.pdf
    Izacard, G., & Grave, E. (2020, July 2). Leveraging Passage Retrieval with
    Generative Models for Open Domain Question Answering. arXiv.org. https://arxiv.org/abs/2007.01282
    Johnson, J., Douze, M., & Jégou, H. (2017, February 28). Billion-scale similarity search with GPUs. arXiv.org. https://arxiv.org/abs/1702.08734
    Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). Dense Passage Retrieval for Open-Domain Question Answering. Dense Passage Retrieval for Open-domain Question Answering. https://doi.org/10.18653/v1/2020.emnlp-main.550
    LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.
    https://doi.org/10.1038/nature14539
    Maynez, J., Narayan, S., Bohnet, B., & McDonald, R. (2020, May 2). On faithfulness and factuality in abstractive summarization. arXiv.org. https://arxiv.org/abs/2005.00661
    Manning, C. D., Raghavan, P., & Schütze, H. (2009). Introduction to information retrieval.
    Choice Reviews Online, 46(05), 46–2715. https://doi.org/10.5860/choice.46-2715
    Mitra, B., & Craswell, N. (2018). An Introduction to Neural Information Retrieval t.
    Foundations and Trends® in Information Retrieval, 13(1), 1–126. https://doi.org/10.1561/1500000061
    Mikolov, T., Karafiát, M., Burget, L., Černocký, J., & Khudanpur, S. (2010). Recurrent
    neural network based language model. Interspeech 2022. https://doi.org/10.21437/interspeech.2010-343
    Masterman, T., Besen, S., Sawtell, M., & Chao, A. (n.d.). The landscape of emerging AI agent architectures for reasoning, planning, and tool calling: a survey. arXiv.org. https://arxiv.org/abs/2404.11584
    Nogueira, R., & Cho, K. (2019, January 13). Passage Re-ranking with BERT. arXiv.org. https://arxiv.org/abs/1901.04085
    Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and
    Trends in Information Retrieval Vol. 2, No 1-2, 1–135 http://dx.doi.org/10.1561/1500000011
    Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and
    beyond. Foundations and Trends® in Information Retrieval, 3(4), 333–389. https://doi.org/10.1561/1500000019
    Ramit Sawhney, Harshit Joshi, Saumya Gandhi, and Rajiv Ratn Shah. 2020.
    A Time-Aware Transformer Based Model for Suicide Ideation Detection on Social Media. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7685–7697, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.619
    Shuster, K., Poff, S., Chen, M., Kiela, D., & Weston, J. (2021). Retrieval augmentation reduces hallucination in conversation. Empirical Methods in Natural Language Processing, 3784–3803. https://aclanthology.org/2021.findings-emnlp.320/
    Trabelsi, M., Chen, Z., Davison, B. D., & Heflin, J. (2021). Neural ranking models for
    document retrieval. Information Retrieval, 24(6), 400–444. https://doi.org/10.1007/s10791-021-09398-0
    Thorne, J., & Vlachos, A. (2018). Automated fact checking: task formulations, methods and future directions. arXiv (Cornell University), 3346–3359. https://arxiv.org/pdf/1806.07687.pdf
    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,
    Kaiser, L., & Polosukhin, I. (2017). Attention is All you Need. Advances in neural information processing systems, 30.
    https://doi.org/10.48550/arXiv.1706.03762
    Yasunaga, M., Ren, H., Bosselut, A., Liang, P., & Leskovec, J. (2021). QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. https://doi.org/10.18653/v1/2021.naacl-main.45
    Zhang, Y., Ni, A., Mao, Z., Wu, C. H., Zhu, C., Deb, B., Awadallah, A. H., Radev, D., & Zhang, R. (2021, October 16). SUMM^N: a Multi-Stage summarization framework for long input dialogues and documents. arXiv.org. https://arxiv.org/abs/2110.10150
    Zhang, T., Ladhak, F., Durmus, E., Liang, P., McKeown, K., & Hashimoto, T. B. (2023, January 31). Benchmarking large language models for news summarization. arXiv.org. https://arxiv.org/abs/2301.13848
    Description: 碩士
    國立政治大學
    全球傳播與創新科技碩士學位學程
    111ZM1027
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0111ZM1027
    Data Type: thesis
    Appears in Collections:[全球傳播與創新科技碩士學位學程] 學位論文

    Files in This Item:

    File SizeFormat
    index.html0KbHTML17View/Open


    All items in 政大典藏 are protected by copyright, with all rights reserved.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback