Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/99555
|
Title: | 運用財經文本情感分析於台灣電子類股價指數趨勢預測之研究 Research of applying Sentimental Analysis on financial documents to predict Taiwan Electronic Sub-Index trend |
Authors: | 劉羿廷 |
Contributors: | 姜國輝 季延平 劉羿廷 |
Keywords: | 情感分析 巨量資料 LDA 主題模型 支援向量機 電子類股價指數 Sentimental analysis Big Data LDA SVM Taiwan Electronic Sub-Index Trend |
Date: | 2015 |
Issue Date: | 2016-08-02 17:02:43 (UTC+8) |
Abstract: | 電子工業為台灣最具競爭力之產業,使得電子類股在集中市場成交比重高達 69.49%,可見電子類股的波動足以對整個台股市場造成相當大的影響。而許多研究指出,網路上的文本訊息藉由社會網路的催化而快速傳遞,會對群眾情緒造成影響,進而影響股價波動,故對於投資者而言,如果能快速分析大量網路財經文本來推測投資大眾情緒進而預測股價走勢,即可提升獲利。然而,每天有近百篇的財經文本產生,傳統的人工抽樣分析方式效率不彰且過於耗力, 已不足以負荷此巨量資料。 過去文本情感分析的研究中已證實監督式學習方法可以透過簡單量化的方式達到良好的分類效果,但監督式學習方法所使用的訓練資料集須有事先定義好的已知類別,故其有無法預期未知類別的限制,造成無法判斷文本中可能存在的未知主題,所以本研究提出一套針對財經文本的混合監督式學習與非監督式學習之情感分析方法,透過非監督式學習將 2014 整年度的電子工業財經文本進行文本主題判別、情緒指數計算與情緒傾向標注。之後配合視覺化工具作趨勢線圖分析,找出具有領先指標特性之主題,接著再用監督式學習將其結合國際指標、總體經濟指標、台股指標、技術指標等,建立分類模型以預測台灣電子類股價指數走勢。 在實驗結果中,主題標注方面,本研究發現因文本數量遠大於議題詞數量造成 TFIDF 矩陣過於稀疏,使得 TFIDF-Kmeans 主題模型分類效果不佳;而文本具有多主題之特性造成 NPMI-Concor 分群之議題詞過於複雜不易歸納,然而LDA 主題模型基於所有主題被所有文章共享的特性,使得在字詞分群與主題分類準確度都優於 TFIDF-Kmeans 和 NPMI-Concor 主題模型,分類準確度高達 98%,故後續採用 LDA 主題模型進行主題標注。情緒傾向標注方面,證實本研 究擴充後的情感詞集比起 NTUSD 有更好的字詞極性判斷效果,計算出的情緒 指數之趨勢線也較投資人常用的 MACD 之趨勢線更符合電子類股價指數之趨 勢。此外,亦發現並非所有文本的情緒指數皆具有領先特性,僅企業營運主題與總體經濟主題之文本的情緒指數能提前反應電子類股價指數趨勢,故本研究用此二主題之文本的情緒指數來建立分類模型。 接著,本研究透過比較情緒指數結合技術指標之分類模型與單純技術指標分類模型的準確率發現,前者較後者高出 7%的準確率。進一步結合間接情緒指標的分類模型更有高達 71%準確率,故證實了情感分析確實能有效提升電子股價類股指數趨勢預測準確度,以提升投資人之投資報酬率。 The electronic industry is the most competitive industry in Taiwan, and its large volume could have strong influence on the whole stock market. Many research show that text documents on the Internet have great effect on public emotion, and the public emotion could also affect the stock price. For investors, it is important to know how to analyze the potential emotion in text documents then use this information to predict the stock trend. However, the traditional way to analyze text documents by human resource cannot afford the large volume of financial text documents on the Internet. In past Sentimental Analysis research, supervised method is proven as a method could reach high accuracy, but there are limits about predicting the future trend. This research found a solution which mixed supervised and unsupervised methods to deal with these large financial text documents. First, we use unsupervised method to find out the topic of documents, and then calculate the sentimental index to judge the document’s emotional direction. After that we will produce trend line charts by visualization tools to find out which theme documents’ sentiment index are leading indicators. Furthermore, we use supervised method to integrate the sentimental index with other 24 indirect sentimental index to build the prediction model. According to the result, we found that LDA model’s performance is better than TFIDF-Kmeans model and NPMI-Concor mode because of document characteristic. Besides, sentimental dictionary I build has higher accuracy than NTUSD on judging word polarity. The trend of sentimental index and Taiwan electronic sub-index(TE) to each other is more similar than MACD line and TE to each other. We also discover that the sentiment index produced from documents about enterprise operation and macroeconomics are leading indicators, so we use these to build prediction model. Moreover, we found that the prediction model which include the sentiment index better than which only include the technical indicators. As mentioned above, the sentimental index could make the prediction of Taiwan electronic sub-index trend be more accurate and promote the return of investment. |
Reference: | [ 1 ] B. Kim, K.-S. Han, H.-C. Rim, and S.-H. Myaeng, “Some Effective Techniques for Naive Bayes Text Classification,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 11, pp. 1457–1466, Nov. 2006. [ 2 ] Baker, M. and J. Wurgler. Investor sentiment and the cross-section of stock returns, Journal of Finance, 4, 1645-1680, 2006 [ 3 ] Ballve, M.. Big Data Will Drive The Next Phase Of Innovation In Mobile Computing, 2013 [ 4 ] Barber, B.“Noise trader risk, odd-lot trading, and security returns,” Working Paper, University of California at Davis, 1999 [ 5 ] Chan WJ, Cheng KC, Shieh JM, Fong Y, Chang JM, Chuang SS, Ko SC., Mediastinal hemangiomatosis. Thorac Med , 19,125-131, 2004 [ 6 ] Corinna Cortes Vladimir Vapnik, “Support-Vector networks” Machine Learning, pp.273-297, 1995 [ 7 ] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation.Journal of Machine Learning Research, 3:993–1022,January 2003. [ 8 ] DeLong, J. B., A. Shleifer, L. H. Summers, and R. J. Waldmann, “Noise trader risk in financial markets,”Journal of Political Economy, 98,703-738, 1990 [ 9 ] Devitt, A. and K. Ahmad 2007. Sentiment Polarity Identification in Financial News: A Cohesion-Based Approach. Association of Computational Linguistics, Prague, Czech Republic. [ 10 ] E. Cambria and A. Hussain. Sentic Computing: Techniques, Tools, and Applications. Dordrecht, Netherlands: Springer, ISBN: 978-94-007-5069-2, 2012 [ 11 ] Erkan, A Hassan, Q Diao, D Radev, “Improved Nearest Neighbor Methods For Text Classification. ” 2011 [ 12 ] Farhoodi and Yari, Indexing of Arabic documents automatically based on lexical analysis, 2010 [ 13 ] Feldman, Techniques and applications for sentiment analysis, 2013 [ 14 ] Giovanni Vigna, The wall street journal-0424, 2013 [ 15 ] Griffiths, T. L., & Steyvers, M. Finding scientific topics. Proceedings of the National Academy of Sciences, 101, 5228-5235, 2004 [ 16 ] H. (Sam) Han, G. Karypis, and V. Kumar, “Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification,” in Advances in Knowledge Discovery and Data Mining, D. Cheung, G. J. Williams, and Q. Li, Eds. Springer Berlin Heidelberg, 2001, pp. 53–65. [ 17 ] Harris Drucker, Support Vector Machines for Spam Categorization, 1999 [ 18 ] Johan Bollen1, Huina Mao1, Xiao-Jun Zeng. Twitter mood predicts the stock market. 2010 [ 19 ] Jonathan Taplin, Twitter tool delves into the sentiment of social media, 2013 [ 20 ] Kumar, A., Lee, C. M. C. Retail Investor Sentiment and Return Comovements, 2006 [ 21 ] Lee, Wayne Y., Christine X. Jiang, and Daniel C. Indro. Stock market volatility, excess returns, and the role of investor sentiment, Journal of Banking & Finance, 2277-2299, 2002 [ 22 ] Liu, “Sentiment Analysis and Opinion Mining,” Synthesis Lectures on Human Language Technologies, vol. 5, no. 1, pp. 1–167, May 2012. [ 23 ] M. Qamar, E. Gaussier, J.-P. Chevallet, and J.-H. Lim, “Similarity Learning for Nearest Neighbor Classification,” in Eighth IEEE International Conference on Data Mining, 2008. ICDM ’08, pp. 983–988, 2008 [ 24 ] Mishne, G. and de Rijke, M., MoodViews: Tools for Blog Mood Analysis, AAAI 2006 Spring Symposium on Computational Approaches to analyzing Weblogs (AAAI-CAAW2006), 2006. [ 25 ] Newman, Hage, Chemudugunta, Smyth. Subject Metadata Enrichment using Statistical Topic Models. JCDL : 366-375, 2007 [ 26 ] Pang and L. Lee, “Opinion Mining and Sentiment Analysis,” Found. Trends Inf. Retr., vol. 2, no. 1–2, pp. 1–135, Jan. 2008. [ 27 ] Pang and Lee. Opinion mining and sentiment analysis, 2008 [ 28 ] Pang, L. Lee, and S. Vaithyanathan, “Thumbs Up?: Sentiment Classification Using Machine Learning Techniques,” in Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10, Stroudsburg, PA, USA, pp. 79–86, 2002 [ 29 ] Pang, L. Lee, and S. Vaithyanathan, “Thumbs Up?: Sentiment Classification Using Machine Learning Techniques,” in Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10, Stroudsburg, PA, USA, pp. 79–86, 2002 [ 30 ] Soliman, Utilizing support vector machines in mining online customer reviews, 2012 [ 31 ] Sui, Y. Jianping, Z. Hongxian, and Z. Wei, “Sentiment analysis of Chinese micro-blog using semantic sentiment space model,” in 2012 2nd International Conference on Computer Science and Network Technology (ICCSNT), pp. 1443–1447,2012 [ 32 ] Taboada, J. Brooke, M. Tofiloski, K. Voll, and M. Stede, “Lexicon-based Methods for Sentiment Analysis,” Comput. Linguist., vol. 37, no. 2, pp. 267–307, Jun. 2011. [ 33 ] Thorsten Joachims, SVM-Light Support Vector Machine, 2008 [ 34 ] Turney, “Thumbs Up or Thumbs Down?: Semantic Orientation Applied to Unsupervised Classification of Reviews,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 417–424, 2002 [ 35 ] Zheng and Y. Tian, “Chinese Web Text Classification System Model Based on Naive Bayes,” in 2010 International Conference on E-Product E-Service and E-Entertainment (ICEEE), pp. 1–4, 2010 [ 36 ] 王波, 郭曉軍. 基地情感分析的網絡財經媒體通貨膨脹預期研究 55(16): 140-143.(CSSCI), 2011 [ 37 ] 王濟川,郭志剛. Logistic 迴歸模型-方法及應用, 2003. [ 38 ] 台灣證券交易所, 投資人開戶統計表, 2014 [ 39 ] 李啟菁,王正豪. “中文部落格文章之意見分析”, 2010 [ 40 ] 林育龍, 對使用者評論之情感分析研究-以Google Play市集為例, 2014 [ 41 ] 林冠中. 漸進式支持向量機於人臉辨識之應用, 2005 [ 42 ] 洪崇洋, 以LDA 和使用紀錄為基礎的線上電子書主題趨勢發掘方法, 2012 [ 43 ] 徐中琦, 黃銘遠. 公開資訊之資訊內涵與投資人在不同情緒下投資行為之研究, 2014 [ 44 ] 徐健, 劉穎. 網絡商品評論的特徵-情感詞本體構建與情感分析方法研究, 2014 [ 45 ] 張良杰. 巨量資料環境下之新聞主題暨輿情與股價關係之研究, 2014 [ 46 ] 郭俊桔,張育蓉. 使用情緒分析於圖書館使用者滿意度評估之研究, 2013 [ 47 ] 郭敏華, 如何測量投資人情緒?, 2009 [ 48 ] 陳信源, 葉鎮源, 林昕潔, 黃明居, 柯皓仁, 楊維邦, & 圖書館. 結合支援向量機與詮釋資料之圖書自動分類方法. 資訊科技國際期刊, 3(1), 2-21, 2009 [ 49 ] 游和正, 黃挺豪, 陳信希. 領域相關詞彙極性分析及文件情緒分類之研究, 2012 [ 50 ] 黃承龍, 陳穆臻, & 王界人. 支援向量機於信用評等之應用: 計量管理期刊, 2004 [ 51 ] 黃純敏,應用LDA進行Plurk主題分類及使用者情緒分析,2014 [ 52 ] 黃運高,王妍,邱武松,向林泓,趙學良.基于K-means和TF-IDF的中文藥名聚類分析, 2014 [ 53 ] 經濟部統計處, 工業生產資訊年報, 2014 [ 54 ] 萬常選, 江騰蛟, 鍾敏娟, 邊海容. 基於詞性標註和依存句法的 Web 金融信息情感計算, 2013 [ 55 ] 葉又豪, 運用文字探勘分析非量化資訊協助投資人預測公司財務表現, 2012 [ 56 ] 榮泰生, SPSS 與研究方法, 2006. [ 57 ] 劉吉軒, 吳建良, “以情緒為中心之情境資訊觀察與評估, ” 2007NCS全國計算機會議, pp. 12-20~21, 2007 [ 58 ] 劉鵬,滕家雨. 基於Spark的大規模文本k-means並行聚類算法, 2014 [ 59 ] 蔡正修, 台灣電子類股價指數趨勢預測之研究, 2007 [ 60 ] 談成訪, 基於LDA模型的新聞話題分類研究, 2014 [ 61 ] 魏晶晶,吳曉吟. 電子商務產品評論多級情感分析的研究與實現, 2013 [ 62 ] 龔建彰, 基於新聞字詞漲跌極性之股價趨勢分類預測, 2014 |
Description: | 碩士 國立政治大學 資訊管理學系 102356034 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G1023560341 |
Data Type: | thesis |
Appears in Collections: | [資訊管理學系] 學位論文
|
Files in This Item:
File |
Size | Format | |
034101.pdf | 1918Kb | Adobe PDF2 | 184 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|