Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/150166
|
Title: | 綜合分群技術與 BERT 模型於文件推薦的探索 An Exploration of Integrating Clustering and BERT Models for Document Recommendation |
Authors: | 陳筠 Chen, Yun |
Contributors: | 劉昭麟 Liu, Chao-Lin 陳筠 Chen, Yun |
Keywords: | 深度學習 BERT 文本向量化 半監督式分群 Deep learning BERT document embeddings semi-supervised clustering |
Date: | 2024 |
Issue Date: | 2024-03-01 13:41:20 (UTC+8) |
Abstract: | 當試從大量資料中挑選出有興趣的類別內容時,往往需花費人力資源進行瀏覽,或標記資料以分類。相較之下,分群得將類似的文本分在同群,是個更快速且節省成本的方式。故為更有效地找到類似資料以進行文件推薦,本研究透過微調的 BERT 對文本進行向量化再以 K-means 分群,並實驗指定起始點的「種子分群」方式,以期達資料無標記、只需少量線索即可有效分群之效。 實驗結果顯示,文本透過微調 BERT 向量化後的分群結果,遠勝於未微調 BERT 及以 TF-IDF 向量化的分群效果。然同時也發現,BERT 投入 K-means 分群的穩定性極高,導致每次分群結果幾無差別,也影響到種子分群之結果,使得本研究中的種子分群方法對分群的改善甚微。是故未來相關研究可在以微調 BERT 進行文本向量化的基礎之上,嘗試其他分群和種子分群的方式。 When users try to find similar contents or documents they’re interested in from an abundance of data, remarkable resources are usually spent on human reviewing or labeling for the classification. In contrast, clustering, which can assign similar documents in the same clusters, is faster and more cost-saving. Therefore, to find similar contents more efficiently, in this research, documents are vectorized through fine-tuned BERT models and clustered by K-means, and by “seed clustering”, which is clustering with appointed initial centroids. The study shows that the clustering with the fine-tuned BERT embeddings outperforms those of BERT without fine-tuning and those of TF-IDF. However, it is found that K-means clustering of BERT embeddings has high stability, causing the results throughout multiple times of clustering to remain nearly identical, which also affects the performance of the seed clustering. The methods of seed clustering thus are shown to have little effect on improving the clustering. Therefore, it is suggested that research in the future be based on fine-tuned BERT embeddings but in different ways of clustering or seed clustering. |
Reference: | [1] C. D. Manning, P. Raghavan and H. Schütze, “Flat Clustering”, in Introduction to Information Retrieval, online ed. Cambridge, England: Cambridge UP, 2009, ch16, pp. 349-350, 354, 356, 357, 360. [2] A. Vaswani, et al., “Attention is all you need," in Advances in Neural Information Processing Systems, 30, 2017. [3] Y. Cui, W. Che, T. Liu, B. Qin and Z. Yang, “Pre-training with whole word masking for Chinese BERT.” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3504-3514, 2021. [4] A. Subakti, H. Murfi and N. Hariadi, “The performance of bert as data representation of text clustering,” Journal of Big Data, vol. 9, no. 1, pp. 1-21, 2022. [5] S. Basu, A. Banerjee and R. Mooney, “Semi-supervised clustering by seeding,” in Proc. of the 10th International Conference on Machine Learning (ICML-2002), Sydney, Australia, July, 2002. [6] M. Bilenko, S. Basu and R. Mooney, “Integrating constraints and metric learning in semi-supervised clustering,” in Proc. of the 21st International Conference on Machine Learning, (ICML-2004), Banff, Canada, July, 2004. [7] Z. Wang, H. Mi and A. Ittycheriah, “Semi-supervised clustering for short text via deep representation learning,” in Proc. of the 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, 2016, pp. 31-39. [8] “Clustering,” Scikit Learn. https://scikit-learn.org/stable/modules/clustering.html. (accessed Nov. 27, 2023). [9] D. Arthur and S. Vassilvitskii, “k-means++: the advantages of careful seeding,” in Proc. of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, New Orleans Louisiana, the U.S., 2007. [10] “sklearn.cluster.kmeans_plusplus,” Scikit Learn. https://scikit-learn.org/stable/modules/generated/sklearn.cluster.kmeans_plusplus.html#sklearn.cluster.kmeans_plusplus. (accessed Dec. 7, 2023). [11] Akai,〈EM Algorithm 詳盡介紹:利用簡單例子輕鬆讀懂 EM 的原理及概念〉,玩轉部落格。 https://playround.site/?p=628 。(存取日期:2023 年 12 月 31 日)。 [12] 周子皓,〈基於語境特徵及分群模型之中文多義詞消歧〉,碩士論文,國立政治大學資訊科學研究所,2019年。 [13] 陳垂呈,黃俊榮,〈利用群組發掘書籍最適性之推薦〉,教育資料與圖書館學,第43卷,第3期,第 309-325 頁,2006年。 [14] 〈維基百科分類索引〉,維基百科。 https://zh.wikipedia.org/zh-tw/Wikipedia:分類索引 。(存取日期:2023 年 10 月 4 日)。 [15] M. Majlis, “Wikipedia-API,” Python Software Foundation. https://pypi.org/project/Wikipedia-API/. (accessed Dec. 7, 2023). [16] 〈營養作用〉,維基百科。 https://zh.wikipedia.org/zh-tw/营养作用。(存取日期:2023 年 12 月 27 日)。 [17] 〈評測簡介〉,中國法律智能技術評測。 http://cail.cipsc.org.cn 。(存取日期:2023 年 12 月 20 日)。 [18] 〈Open Chinese Convert 開放中文轉換〉,Github。 https://github.com/BYVoid/OpenCC 。(存取日期:2023 年 12 月 7 日)。 [19] 〈反式脂肪〉,維基百科。 https://zh.wikipedia.org/zh-tw/反式脂肪。(存取日期:2023 年 12 月 13 日)。 [20] “Jieba,” Github. https://github.com/fxsjy/jieba. (accessed Dec. 8, 2023). [21] “Clustering text documents using k-means,” Scikit Learn. https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#sphx-glr-auto-examples-text-plot-document-clustering-py. (accessed Dec. 8, 2023). |
Description: | 碩士 國立政治大學 資訊科學系 109753140 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0109753140 |
Data Type: | thesis |
Appears in Collections: | [資訊科學系] 學位論文
|
Files in This Item:
File |
Description |
Size | Format | |
314001.pdf | | 8776Kb | Adobe PDF | 0 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|