資料載入中.....
|
請使用永久網址來引用或連結此文件:
https://nccur.lib.nccu.edu.tw/handle/140.119/158708
|
題名: | 基於大型語言模型的論文主題分析與跨領域應用探索 Exploring the interdisciplinary potential of research through LLM-driven subject area analysis |
作者: | 羅延康 Lo, In-Hong |
貢獻者: | 廖文宏 Liao ,Wen-Hung 羅延康 Lo, In-Hong |
關鍵詞: | 大型語言模型 跨領域研究 知識圖譜 關鍵字共現 Large Language Models Cross-domain Collaboration Knowledge Graph Keyword Co-occurrence |
日期: | 2025 |
上傳時間: | 2025-08-04 15:10:04 (UTC+8) |
摘要: | 本研究結合關鍵字共現之知識圖譜構建方法與大型語言模型(LLM)應用,聚焦於評估學術論文的跨領域特性與研究者的跨領域動態。首先,透過 Python、NLTK、py2neo、Sentence-Transformers 等工具,自 Scopus 資料庫擷取國立政治大學(NCCU)之論文標題與摘要,經過 BERT 語意探勘、同義詞整合與分群分析後,建立了關鍵字共現矩陣與知識圖譜。並以 Neo4j 進行可視化與中心性分析,協助辨識跨領域中具關鍵影響力的詞彙與「橋樑節點」,為更具體展現方法應用,本研究進行了多組「子圖」分析:例如「國家與地緣政治」子圖揭示了中國、台灣、美國與香港等關鍵詞共現下的國際關係與經濟脈絡,以追蹤研究趨勢與發掘新興議題。 在跨領域研究評估方面,本研究除了利用 Scopus 提供的期刊領域標籤外,也應用GPT4o 大型語言模型(LLM),分別在「僅使用標題」與「標題+摘要」兩種輸入模式下,為文獻分配多元領域標籤。其中,「At Least One Accuracy」指標用來評估模型是否能為每篇文獻至少正確預測一個對應的學術領域。結果顯示,無論是哪種輸入模式,模型在此指標下皆有穩定表現,顯示其具備良好的語意理解與判斷能力,能有效涵蓋文獻的關鍵主題,適合應用於初步分類與主題探索。此外,在多標籤情境下提高分類門檻雖可提升Precision,卻會造成 Recall 明顯下降,呈現分類權衡關係。而為更全面掌握研究者與機構的跨領域特性,研究亦引入熵值(Entropy)作為多樣性量化指標,分析不同學院與研究者的跨領域程度與變化趨勢,進而辨識潛在合作群體、橋樑作者,並透過時序觀察揭示研究重心的演進與未來的合作機會。 基於上述跨領域分析架構,本研究進一步設計一套以大型語言模型(LLM)結合期刊分類與對稱式 KL 散度(SymKL)指標之合作推薦系統。以國科會計畫中的潛在合作者篩選為實例,利用 LLM 對計畫摘要進行語意分類,判定其關鍵應用領域,並對應 Scopus標準領域分數進行量化。接著透過 SymKL 評估研究者與計畫主題的分佈相似性,為高等教育與科研單位在篩選出計畫主題高度契合的候選人,本方法能快速、客觀識別涵蓋跨領域之研究者,建立一支具互補性與應用導向的團隊,並大幅提升推薦流程效率與準確性,為計畫主持人提供即時決策支援。 This study integrates keyword co-occurrence-based knowledge graph construction with the application of Large Language Models (LLMs), focusing on assessing the interdisciplinary nature of academic papers and the dynamic cross-domain behavior of researchers. Using tools such as Python, NLTK, py2neo, and Sentence-Transformers, we collected paper titles and abstracts affiliated with National Chengchi University (NCCU) from the Scopus database. Through BERT-based semantic analysis, synonym integration, and clustering, a keyword co-occurrence matrix and a knowledge graph were constructed. Neo4j was employed for visualization and centrality analysis to identify influential keywords and ”bridge nodes” in interdisciplinary contexts. To demonstrate practical applications, several subgraph analyses were conducted. For example, the “Nation and Geopolitics”subgraph revealed co-occurrence among keywords like China, Taiwan, the United States, and Hong Kong, reflecting international relations and economic themes, thereby facilitating trend analysis and identification of emerging topics. In evaluating interdisciplinary research, the study adopted not only Scopus’s subject area labels but also leveraged the GPT-4o LLM. It assigned multi-label domain tags to each paper using two input scenarios: (1) titles only and (2) titles with abstracts. The ”At Least One Accuracy” metric assessed whether the model could correctly predict at least one relevant subject area per paper. Results showed consistent performance across both scenarios, confirming the model’s strong semantic understanding and suitability for preliminary classification and topic exploration. However, raising the classification threshold improved precision but significantly reduced recall, indicating a precision-recall tradeoff in multi-label classification. To more comprehensively capture the interdisciplinary nature of researchers and institutions, entropy was introduced as a diversity metric. This enabled analysis of interdisciplinary breadth and evolution among different colleges and individual researchers, revealing potential collaborative groups, identifying bridging authors, and highlighting shifts in research focus over time. Building on this interdisciplinary analysis framework, the study further designed a collaborator recommendation system combining LLM-based journal classification and symmetric KL divergence (SymKL). Taking the National Science and Technology Council’s project collaborator screening as an example, LLMs were used to classify project abstracts semantically and match them to Scopus’s standardized subject area scores. SymKL was then used to measure distributional similarity between researchers and project topics. This approach enables rapid and objective identification of interdisciplinary researchers, forming complementary, application-driven teams while greatly improving recommendation efficiency and accuracy, thus providing real-time decision support for project leaders. |
參考文獻: | [BCB03] K. Börner, C. Chen, and K. W. Boyack, “Visualizing knowledge domains”, Annual Review of Information Science and Technology, vol. 37, pp. 179–255, 2003 (引用於第 13). [CCT+83] M. Callon, J. P. Courtial, W. A. Turner, and S. Bauin, “From translations to problematic networks: An introduction to co-word analysis”, Social Science Information, vol. 22, no. 2, pp. 191–235, 1983 (引用於第 13). [Che06] C. Chen, “Citespace ii: Detecting and visualizing emerging trends and transient patterns in scientific literature”, Journal of the American Society for Information Science and Technology, vol. 57, no. 3, pp. 359–377, 2006 (引用於第 13). [DCL+19] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding”, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186 (引用於第 2). [Gro20] M. Grootendorst, Keybert: Minimal keyword extraction with bert. Version v0.3.0, 2020 (引用於第 31). [Joh67] S. C. Johnson, “Hierarchical clustering schemes”, Psychometrika, vol. 32, no. 3, pp. 241–254, 1967 (引用於第 34). [KL51] S. Kullback and R. A. Leibler, “On information and sufficiency”, The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79–86, 1951 (引用於第 8). [LIX+23] Y. Liu, D. Iter, Y. Xu, et al., “G-eval: NLG evaluation using gpt-4 with better human alignment”, in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds., Singapore: Association for Computational Linguistics, Dec. 2023, pp. 2511–2522 (引用於第 20). [New10] M. E. J. Newman, Networks: An Introduction. Oxford University Press, 2010 (引用於第 13). [NJW02] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm”, in Advances in Neural Information Processing Systems, vol. 14, 2002, pp. 849–856 (引用於第 9). [PC85] A. L. Porter and D. E. Chubin, “An indicator of cross-disciplinary research”, Scientometrics, vol. 8, no. 3, pp. 161–176, 1985 (引用於第 5). [PR09] A. L. Porter and I. Rafols, “Is science becoming more interdisciplinary? measuring and mapping six research fields over time”, Scientometrics, vol. 81, no. 3, pp. 719–745, 2009 (引用於第 7). [Raf20] I. Rafols. “On “measuring” interdisciplinarity: From indicators to indicating”. Leiden Madtrics,Science & Society 部落格文章. (Nov. 2020), [Online]. Available: https://www.leidenmadtrics.nl/articles/on-measuringinterdisciplinarity- from- indicators- to- indicating (visited on Jun. 16, 2025) (引用於第 5). [RG19] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks”, in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Nov. 2019 (引用於第 33). [RPL10] I. Rafols, A. L. Porter, and L. Leydesdorff, “Science overlay maps: A new tool for research policy and library management”, Journal of the American Society for Information Science and Technology, vol. 61, no. 9, pp. 1568–1582, 2010 (引用於第 7). [Sha48] C. E. Shannon, “A mathematical theory of communication”, Bell System Technical Journal, vol. 27, pp. 379–423, 1948 (引用於第 7). [SM00] J. Shi and J. Malik, “Normalized cuts and image segmentation”, in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, IEEE, 2000, pp. 888–905 (引用於第 9). [Von07] U. Von Luxburg, “A tutorial on spectral clustering”, Statistics and Computing, vol. 17, no. 4, pp. 395–416, 2007 (引用於第 9). [War63] J. H. J. Ward, “Hierarchical grouping to optimize an objective function”, Journal of the American Statistical Association, vol. 58, no. 301, pp. 236–244, 1963 (引用於第 34). |
描述: | 碩士 國立政治大學 資訊科學系碩士在職專班 111971011 |
資料來源: | http://thesis.lib.nccu.edu.tw/record/#G0111971011 |
資料類型: | thesis |
顯示於類別: | [資訊科學系碩士在職專班] 學位論文
|
文件中的檔案:
檔案 |
描述 |
大小 | 格式 | 瀏覽次數 |
101101.pdf | | 6163Kb | Adobe PDF | 0 | 檢視/開啟 |
|
在政大典藏中所有的資料項目都受到原著作權保護.
|