Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/60196
|
Title: | 資訊檢索之學術智慧 Research Intelligence Involving Information Retrieval |
Authors: | 杜逸寧 Tu, Yi-Ning |
Contributors: | 諶家蘭 林我聰 Seng, Jia-Lang Lin, Woo-Tsong 杜逸寧 Tu, Yi-Ning |
Keywords: | 議題的發現與追蹤 資料探勘 資訊檢索 學術智慧 貝氏估計 新穎度指標 發表量指標 引文分析 Topic discovery and tracking data mining information retrieval Bayesian estimation academic intelligence novelty index published volume index citation analysis |
Date: | 2009 |
Issue Date: | 2013-09-04 16:55:05 (UTC+8) |
Abstract: | 偵測新興議題對於研究者而言是一個相當重要的問題,研究者如何在有限的時間和資源下探討同一領域內的新興議題將比解決已經成熟的議題帶來較大的貢獻和影響力。本研究將致力於協助研究者偵測新興且具有未來潛力的研究議題,並且從學術論文中探究對於研究者在做研究中有幫助的學術智慧。在搜尋可能具有研究潛力的議題時,我們假設具有研究潛力的議題將會由同一領域中較具有影響力的作者和刊物發表出,因此本研究使用貝式估計的方法去推估同一領域中相關的研究者和學術刊物對於該領域的影響力,進而藉由這些資訊可以找出未來具有潛力的新興候選議題。此外就我們所知的議題偵測文獻中對於認定一個議題是否已經趨於成熟或者是否新穎且具有研究的潛力仍然缺乏有效及普遍使用的衡量工具,因此本研究試圖去發展有效的衡量工具以評估議題就本身的發展生命週期是否仍然具有繼續投入的學術價值。 本研究從許多重要的資料庫中挑選了和資料探勘和資訊檢索相關的論文並且驗證這些在會議論文中所涵蓋的議題將會領導後續幾年期刊論文相似的議題。此外本研究也使用了一些已經存在的演算法並且結合這些演算法發展一個檢測的流程幫助研究者去偵測學術論文中的領導趨勢並發掘學術智慧。本研究使用貝式估計的方法試圖從已經發表的資訊和被引用的資訊來建構估計作者和刊物的影響力的事前機率與概似函數,並且計算出同一領域重要的作者和刊物的影響力,當這些作者和刊物的論文發表時將會相對的具有被觀察的價值,進而檢定這些新興候選議題是否會成為新興議題。而找出的重要研究議題雖然已經縮小探索的範圍,但是仍然有可能是發展成熟的議題使得具有影響力的作者和刊物都必須討論,因此需要評估議題未來潛力的指標或工具。然而目前文獻中對於評估議題成熟的方法僅著重在議題的出現頻率而忽視了議題的新穎度也是重要的指標,另一方面也有只為了找出新議題並沒有顧及這個議題是否具有未來的潛力。更重要的是單一的使用出現頻率的曲線只能在議題已經成熟之後才能確定這是一個重要的議題,使得這種方法成為落後的指標。 本研究試圖提出解決這些困境的指標進而發展成衡量新興議題潛力的方法。這些指標包含了新穎度指標、發表量指標和偵測點指標,藉由這些指標和曲線可以在新興議題的偵測中提供更多前導性的資訊幫助研究者去建構各自領域中新興議題的偵測標準。偵測點所代表的意義並非這個議題開始新興的正確日期,它代表了這個議題在自己發展的生命週期上最具有研究的潛力和價值的時間點,因此偵測點會根據後來的蓬勃發展而在時間上產生遞延的結果,這表示我們的指標可以偵測出議題生命力的延續。相對於傳統的次數分配曲線可以看出議題的崛起和衰退,本研究的發表量指標更能以生命週期的概念去看出議題在各個時間點的發展潛力。本研究希望從這些過程中所發現的學術智慧可以幫助研究者建構各自領域的議題偵測標準,節省大量人力與時間於探究新興議題。本研究所提出的新方法不僅可以解決影響因子這個指標的缺點,此外還可以使用作者和刊物的影響力去針對一個尚未累積任何索引次數的論文進行潛力偵測,解決Google 學術搜尋目前總是在論文已經被很多檢索之後才能確定論文重要性的缺點,學者總是希望能夠領先發現重要的議題或論文。然而,我們以議題為導向的檢索方法相信可以更確實的滿足研究者在搜尋議題或論文上的需求。 This research presents endeavors that seek to identify the emerging topics for researchers and pinpoint research intelligence via academic papers. It is intended to reveal the connection between topics investigated by conference papers and journal papers which can help the research decrease the plenty of time and effort to detect all the academic papers. In order to detect the emerging research topics the study uses the Bayesian estimation approach to estimate the impact of the authors and publications may have on a topic and to discover candidate emerging topics by the combination of the impact authors and publications. Finally the research also develops the measurement tools which could assess the research potential of these topics to find the emerging topics. This research selected huge of papers in data mining and information retrieval from well-known databases and showed that the topics covered by conference papers in a year often leads to similar topics covered by journal papers in the subsequent year and vice versa. This study also uses some existing algorithms and combination of these algorithms to propose a new detective procedure for the researchers to detect the new trend and get the academic intelligence from conferences and journals. The research uses the Bayesian estimation approach and citation analysis methods to construct the prior distribution and likelihood function of the authors and publications in a topic. Because the topics published by these authors and publications will get more attention and valuable than others. Researchers can assess the potential of these candidate emerging topics. Although the topics we recommend decrease the range of the searching space, these topics may so popular that even all of the impact authors and publications discuss it. The measurement tools or indices are need. But the current methods only focus on the frequency of subjects, and ignore the novelty of subjects which is critical and beyond the frequency study or only focus one of them and without considering the potential of the topics. Some of them only use the curve of published frequency will make the index as a backward one. This research tackles the inadequacy to propose a set of new indices of novelty for emerging topic detection. They are the novelty index (NI) and the published volume index (PVI). These indices are then utilized to determine the detection point (DP) of emerging topics. The detection point (DP) is not the real time which the topic starts to be emerging, but it represents the topic have the highest potential no matter in novelty or hotness for research in its life cycle. Different from the absolute frequent method which can really find the exact emerging period of the topic, the PVI uses the accumulative relative frequency and tries to detect the research potential timing of its life cycle. Following the detection points, the intersection decides the worthiness of a new topic. Readers following the algorithms presented this thesis will be able to decide the novelty and life span of an emerging topic in their field. The novel methods we proposed can improve the limitations of impact factor proposed by ISI. Besides, it uses the impact power of the authors and the publication in a topic to measure the impact power of a paper before it really has been an impact paper can solve the limitations of Google scholar’s approach. We suggest that the topic oriented thinking of our methods can really help the researchers to solve their problems of searching the valuable topics. |
Reference: | Allan, J., Carbonell, J., Doddington, G., Yamron, J., & Yang, T. (1998). Topic detection and tracking pilot study: Final report. In: Proceedings of the DARPA Broadcast News Transcription an Understanding Workshop.
Allan, J., Papka, R., & Lavrenko, V., (1998). On-line new event detection and tracking. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 37-45.
Aurora, P. P., Rafael, B. L., & Jose, R. S. (2007). Topic discovery based on text mining techniques. Information Processing & Management, 43, pp. 742-768.
Berry, M.W. (2004) Survey of text mining-clustering, classification, and retrieval. Springer, pp. 185-224. Bolelli, L., Ertekin, S., Zhou, D., & Giles, C. L. (2009). Finding topic trends in digital libraries, In: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries, pp. 69-72.
Chen, K.Y., Luesukprasert, L., & Chou, S. C. (2007). Hot topic extraction based on timeline analysis and multidimensional sentence modeling. IEEE Transactions on Knowlede and Data Enginerting, 19(8), pp. 1016-1025.
Chou, T. C., & Chen, M. C. (2008). Using incremental plsi for threshold-resilient online event analysis. IEEE Transactions on Knowlede and Data Enginerting, 20(3), pp. 289-299. Clifton, C., Cooley, R., & Rennie, J. (2004). Topcat: data mining for topic indentification in a text corpus. IEEE Transactions on Knowlede and Data Enginerting, 16(8), pp. 949-964.
Cui, C., & Kitagawa, H. (2005). Topic activation analysis for document streams based on document arrival rate and relevance. In: Proceedings of the 2005 ACM symposium on applied computing, pp. 1089-1095.
Felix, M. A., Benjamin, V. Q., Zaida, C. R., Elena, C. A., Victor, H. S., Francisco J. M. F. (2005). Domain analysis and information retrieval through the construction of heliocentric maps based on ISI-JCR category cocitation. Information Processing & Management, 41(6), pp. 1521-1533. Franz, M., & McCarley, J. C. (2001). Unsupervised and supervised clustering for topic tracking. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 310-317.
Hatzivassiloglou, V., Gravano, L., & Maganti, A. (2000). An investigation of linguistic features and clustering algorithms. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 224-231.
Jin, Y., Myaeng, S. H., & Jung, Y. (2007). Use of place information for improved event tracking. Information Processing & Management, 43, pp. 365-378.
Jo, Y., Lagoze, C., & Giles, C. L. (2007). Detecting research topics via the correlation between graphs and texts. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.370-379.
Joachims, T. (1998). Text categorization with Support Vector Machines: learning with many relevant features. In: Proceedings of the EMNLP Conference.
Kollios, G., Gunopulos, D., Koudas, N., & Berchtold, S. (2003). Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Transactionson Knowlede and Data Enginerting, 15(5), pp. 1170-1187.
Kleinberg, J. (2002). Bursty and hierarchical structure in streams. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 91-101.
Kuramochi, M., & Karypis, G. (2004). An efficient algorithm for discovering frequent subgraphs. IEEE Transactionson on Knowlede and Data Enginerting, 16(9), pp. 1038-1051.
Lee, C., Lee, G. G., & J, M. (2007). Dependency structure language model for topic detection and tracking. Information Processing & Management, 43, pp. 1249-1259.
Lee, Z., Gosain, S., & Im, I. (1997). Topics of interest in IS: evolution of themes and differences between research and practice. Information & Management, 36, pp. 233-246.
Liu, Y., Niculescu-Mizil, A., & Gryc, W. (2009). Topic-link LDA: joint models of topic and author community, In :Proceedings of the 26th Annual International Conference on Machine Learning, pp. 665-672. Malone, J., McGarry, K., & Bowerman, C. (2006). Automated trend analysis of proteomics data using an intelligent data mining architecture, Expert Systems with Applications, 30, pp. 24-33.
Manmatha, R., Feng, A., & Allan, J. (2002). A critical examination of TDT’s cost function. In: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 403-404.
Markkonen, J., Ahonen-Myka, H., & Salmenkivi, M. (2004). Simple semantics in topic detection and tracking. Information Retrieval, 7, pp. 347-368.
Morinaga, S., & Yamanishi, K. (2004). Tracking dynamics of topic trends using a finite mixture model. In: Proceedings of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.811-816.
Moulinier, I., Raskinis, G., & Ganascia, J. (1996). Text categorization: A symbolic approach. In: Annual Symposium on Document Analysis and information retrieval (SDAIR).
Nallapati, R., Ahmed, A., Xing, E. P., & Cohen, W. W. (2008). Joint latent topic models for text and citations. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 542-550.
Ontrup, J., Ritter, H., Scholz, S. W., & Wagner R. (2008). Detecting, assessing and monitoring relevant topics in virtual information environments. IEEE Transactionson Knowlede and Data Enginerting, 20(7).
Ozmutlu, H. C., & Cavdur, F. (2005). Application of automatic topic identification on excited web search engine data logs. Information Processing & Management, 41, pp. 1243-1262.
Ozmutlu, S. (2006). Automatic new topic identification using multiple linear regression. Information Processing & Management, 42, pp. 934-950.
Porter, M. (1980). An algorithm for suffix stripping. Program (Automated Library and Information Systems), 14(3), pp. 130-137.
Rosen-Zvi, M., Chemudugunta, C., Griffiths, T., Smyth, P., & Steyvers, M. (2010). Learning author-topic models from text corpora, Transactions on Information Systems, 28 (1).
Salton, G. (1989). Automatic text processing: The transformation, analysis and retrieval of information by computer, Addison-Wesley, Reading, MA.
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), pp. 613-620. Salton, G., & Buckley, C. (1988). Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), pp. 513-523.
Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. McGraw Hill Publishing Company.
Schultz, J. M., & Liberman, M. (1999). Topic detection and tracking using idf-weighted cosine coefficient. In: Proceedings of the DARPA Broadcast News Transcription an Understanding Workshop.
Schutze, H., Hull, D., & Pedersen, J. (1995). A comparison of classifiers and document representations for the routing problem. In: Proceedings of the 18st annual international ACM SIGIR conference on Research and development in information retrieval, pp.229-237.
Steyvers, M., Smyth, P., & Griffiths, T. (2004). Probabilistic author topic models for information discovery. In: Proceedings of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 306-315.
Stokes, N., & Carthy, J. (2001). Combining semantic and syntactic document classifiers to improve first story detection. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 424-425.
Swan, R., & Allan, J. (2000). Automatic generation of overview timelines. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 49-56.
Tu, Y. N., & Seng, J. L. (2009). Research Intelligence Involving Information Retrieval – An example of Conferences and Journals, Expert Systems with Applications, 47(6).
Tu, Y. N., & Seng, J. L. (2010). Indices of Novelty for Emerging Topic Detection. (working paper).
Tan, P. N., Steinbach, M. & Kumar, V. (2006). Introduction to data mining. Addison-Wesley, pp. 69-84.
Thelwall, M. (2005). Scientific web intelligence: Finding relationships in university webs, Communications of the ACM, 48(7), pp. 93-96.
Thelwall, M., & Harries, G. (2004). Do better scholars’ Web publications have significantly higher online impact? Journal of the American Society for Information Science and Technology, 55(2), pp. 149-159.
Thelwall, M., Vaughan, L., Cothey, V., Li, X., & Smith, A. (2003). Which academic subjects have most online impact? A pilot study and a new classification process, Online Information Review, 27(5), pp. 333-343.
Tho, Q. T., Hui, S. C., & Fong, A. C. M. (2007). A citation-based document retrieval system for finding research expertise, Information Processing and Management, 43(1), pp. 248-264.
Walls, F., Jin, H., Sista, S., & Schwartz, R. (1999). Topic detection in broadcast news, In: Proceedings of the DARPA Broadcast News Transcription an Understanding Workshop.
Wang, X., Zhai, C., Hu, X., & Sproat, R. (2007). Mining correlated bursty topic patterns from coordinated text streams, In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 784-793.
Wu, K., Chen, M., & Sun, Y. (2004). Automatic topics discovery from hyperlinked documents, Information Processing & Management, 40, pp. 239-255.
Yang, H. C., & Lee, C. H. (2004). A text mining approach on automatic generation of web directories and hierarchies, Expert Systems with Applications, 27, pp. 645-663.
Yang, H. C., & Lee, C. H. (2005). A text mining approach for automatic construction of hypertexts, Expert Systems with Applications, 29, pp. 723-734.
Yang, Y., Ault, T., Pierce T., & Lattimer, C. W. (2000). Improving text categorization methods for event tracking, In: Proceedings of the 23th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 65-72.
Yang, Y. & Pedersen, J. (1997). A comparative study on feature selection in text categorization, In: International Conference on Machine Learning.
Yang, Y. & Wilbur, J. (1996). Using corpus statistics to remove redundant words in text categorization, Journal of the American Society for Information Science, 47(5), pp. 357-369.
Yang, Y., Zhang, J., Carbonell, J., & Jin, Chun. (2002). Topic-conditioned novelty detection, In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp.688-693.
Yang, Y., Yoo, S., Zhang, J., & Kisiel, B. (2005). Robustness of adaptive filtering methods in a cross-benchmark evaluation, In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 98-105.
Zhang, Y., Callan, J., & Minka, T. (2002). Novelty and redundancy detection in adaptive filtering, In: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 81-88.
Zhang, Y., Surendran, A. C., Platt, J. C., & Narasimhan, M. (2008). Learning from multi-topic web documents for contextual advertisement, In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.1051-1059. |
Description: | 博士 國立政治大學 資訊管理研究所 94356509 98 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0094356509 |
Data Type: | thesis |
Appears in Collections: | [資訊管理學系] 學位論文
|
Files in This Item:
File |
Description |
Size | Format | |
650901.pdf | | 1918Kb | Adobe PDF2 | 482 | View/Open | 650902.pdf | | 1918Kb | Adobe PDF2 | 408 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|