Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/56330
|
Title: | 在高度分散式環境下進行Top-k相似文件檢索 Similar Top-k documents retrieval in highly distributed environments |
Authors: | 王俊閎 Wang, Chun Hung |
Contributors: | 陳良弼 Chen, Arbee L.P. 王俊閎 Wang, Chun Hung |
Keywords: | 分散式環境 Tok-k 相似文件檢索 端對端網路 distributed environments similar top-k documents retrieval peer-to-peer network |
Date: | 2012 |
Issue Date: | 2012-12-03 11:27:23 (UTC+8) |
Abstract: | 在文件資料庫的查詢處理上,Top-k相似文件查詢主要是協助使用者可以從龐大的文件集合中,檢索出和查詢文件具有高度相關性的文件集合。將資料庫內的文件依據和查詢文件之相似度程度,選擇出相似度最高的前k篇文件回傳給使用者。然而過去集中式資料庫,因其覆蓋性和可擴充性的不足,使得這種排名傾向的文件查詢處理,需耗費大量時間及運算成本。近年來,使用端對端(Peer-to-peer, P2P)架構解決相關的文件檢索問題已成為一種趨勢,但在高度分散式環境下,支援排名傾向的相似文件查詢是困難的,因為缺乏全域資訊和適當的系統協調者。 在本研究中,我們先針對各節點資料庫作分群前處理,並提出一個利用區域切割的作法[1],將P2P環境劃分成數個子區塊後,建立特徵索引表。因此在查詢處理時,可透過索引表加快挑選出Top-k相似群集的速度,並且確保有適當數量的回傳結果。最後在實驗中,我們提出的方法會與傳統集中式搜尋引擎以及SON-based [1] 做比較,在高度分散式環境下,我們的方法在執行Top-k相似文件查詢時,會比上述兩種作法有較為優異的表現。 On query processing in a large database, similar top-k documents query is an important mechanism to retrieve the highly correlated document collection with query for users. It ranks documents with a similarity ranking function and reports the k documents with highest similarity. However, the former approach in web searching, i.e., centralized search engines, rises some issues such as lack of coverage and scalability, impact provides rank-based query become a costly operation. Recently, using Peer-to-peer (P2P) architectures to tackle above issues has emerged as a trend of solution, but due to the shortage of global knowledge and some appropriate central coordinators, support rank-based query in highly distributed environment has been difficulty. In this paper, we proposed a framework to solve these problems. First, we performed the local cluster pre-processing on each peer, followed by the zone creation process, forming sub-zones over P2P network, and then constructing the feature index table to improve the performance of selecting similar top-k cluster results. The experiments show that our approach performs similar top-k documents query outperforms than SON-based approach in highly distributed environment. |
Reference: | [1] Christos Doulkeridis, Kjetil Nørvåg, Michalis Vazirgiannis. 2008. Peer-to-peer similarity search over widely distributed document collections. LSDS-IR 35-42. [2] Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H. 2001. Chord : A scalable peer-to-peer lookup service for internet applications. In Proceedings of the ACM SIGCOMM 149-160. [3] Ratnasamy, S., Francis, P., Handley, M., Karp, R., Schenker, S. 2001. A scalable contentaddressable network. In Proceedings of the ACM SIGCOMM 161-172. [4] Rowstron, A., Druschel, P. 2001. Pastry : Scalable, distributed object location and routing for large-scale peer-to-peer systems. In Proceedings of the Middleware [5] Chunqiang Tang, Zhichen Xu, Sandhya Dwarkadas. 2003. Peer-to-peer information retrieval using self-organizing semantic overlay networks. In Proceedings of the ACM SIGCOMM 175-186. [6] BitTorrent. <http://bittorrent.com/>. [7] eMula. <http://www.emula-project.net/>. [8] Beverly Yang, Hector Garcia-Molina. 2003. Designing a Super-Peer Network. ICDE 49-60. [9] The Gnutella protocol specification v0.6. <http://rfcgnutella.sourceforge.net>. [10] KaZaA. <http://www.kazaa.com>. [11] Salton, G., Wong, A., Yang, C.S. 1975. A vector space model for automatic indexing. Communications of the ACM Volume 18 Issue 11 613-620. [12] Bernard J. Jansen, Soumen Chakrabarti. 2006. Mining the Web : Discovering Knowledge from Hypertext Data. Morgan-Kaufmann Publishers, 352 pp., ISBN: 1-55860-754-4. Inf. Process. Manage. (IPM) 42(1) 317-318. [13] Christos. Doulkeridis, Kjetil Nørvåg, and Michalis Vazirgiannis. 2007. DESENT: Decentralized and distributed semantic overlay generation in P2P networks. IEEE Journal on Selected Areas in Communications (J-SAC) 25(1) 25–34. [14] Hersh, W.R., Buckley, C., J.Leone, T., Hickam, D.H. 1994. Ohsumed: An interactive retrieval evaluation and new large test collection for research. In Proceedings of the ACM SIGIR. 192–201 [15] GT-ITM : Georgia Tech Internetwork Topology Models. <http://www.cc.gatech.edu/projects/gtitm/>. [16] Wolf-Tilo Balke, Wolfgang Nejdl, Wolf Siberski, Uwe Thaden. 2005. Progressive Distributed Top k Retrieval in Peer-to-Peer Networks. ICDE 174-185. [17] Wolf-Tilo Balke. 2005. Supporting Information Retrieval in Peer-to-Peer Systems. Peer-to-Peer Systems and Applications 337-352. [18] C. Gkantsidis, M. Mihail, and A. Saberi. 2005. Hybrid search schemes for unstructured peer-to-peer networks. In Proceedings of INFOCOM. [19] Inderjit S. Dhillon, Dharmendra S. Modha. 2001. Concept Decompositions for Large Sparse Text Data Using Clustering. Machine Learning 42(1/2): 143-175. [20] Akrivi Vlachou, Christos Doulkeridis, Kjetil Nørvåg, Michalis Vazirgiannis. 2008. On efficient top-k query processing in highly distributed environments. SIGMOD 753-764. [21] Shiwei Zhu, Junjie Wu, Hui Xiong, Guoping Xia. 2011. Scaling up top-K cosine similarity search. Data Knowl. Eng. (DKE) 70(1) 60-83. [22] Aoying Zhou, Rong Zhang, Weining Qian, Quang Hieu Vu, Tianming Hu. 2008. Adaptive indexing for content-based search in P2P systems. Data Knowl. Eng. (DKE) 67(3) 381-398. |
Description: | 碩士 國立政治大學 資訊科學學系 99753034 101 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0099753034 |
Data Type: | thesis |
Appears in Collections: | [資訊科學系] 學位論文
|
Files in This Item:
File |
Size | Format | |
303401.pdf | 1496Kb | Adobe PDF2 | 804 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|