政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/113845

政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/113845

English | 正體中文 | 简体中文 | Post-Print筆數 : 27 | Items with full text/Total items : 113325/144300 (79%)
Visitors : 51183166 Online Users : 675

RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.

Scope

please add "double quotation mark" for query phrases to get precise results

please goto advance search for comprehansive author search

Adv. Search

Home ‧ Login ‧ Upload ‧ Help ‧ About ‧ Administer

Goto mobile version

政大典藏 > Conferences of NCCU > TANet Conference > Conference Papers > Item 140.119/113845

Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/113845

Title:	應用MMB演算法清理網頁雜訊和擷取網頁Metadata
Authors:	駱思安徐俊傑
Keywords:	多重關係貝氏演算法;網頁探勘;網頁清理;資訊擷取 Multimembership Bayesian Algorithm;Web Mining;Web Page Cleaning;Information Extraction;Metadata;TF/IDF;Entropy
Date:	2006
Issue Date:	2017-10-19 09:37:03 (UTC+8)
Abstract:	傳統擷取網頁重要詞彙的方式大都是以TF/IDF和Entropy方式為主流，但我們赫然發現TF 值較高的詞彙並不等同於這個詞彙重要;而Entropy的方式雖然擁有不錯的判別能力，但由於其計算過程過於繁瑣，故本研究提出MMB 演算法法以期能取代這兩個方法，實驗證明MMB演算法確實有效地提昇辨識重要詞彙的機率值和網頁自動分類的準確率。每一個網站中包含著許許多多的文字，分散在網站內的每一個網頁中，而這些文字一部分是描述網網頁隸屬於屬於何種類別，另一部分則是與隸屬類別毫無關係的雜質。因此，如能有效地去除網站中的雜質文字，即能成功地提昇中文網頁自動分類的效能。 The traditional methods of acquiring important terms of the Web page are TF/IDF and Entropy, but we find the higher TF value is not stand for the more important term is. Although Entropy has good ability of differing, the processes of calculating are too much. So, in the research, we propose MMB algorithm to replace these two methods, and we verify MMB algorithm can actually improve the probabilities of differing important terms and the performances of classifying the Chinese Web page. A Web site contains a lot of terms which are distributed in each Web page of the Web site. Some of these terms describe the characteristics of the Web page and can used to classify the Web page to a specific category. The others have no relationship to the Web page are ignored while performing the classification task. So, if we can eliminate the noisy terms, we can successfully improve the performance of Web page automatically classified system.
Relation:	TANET 2006 台灣網際網路研討會論文集網際網路技術
Data Type:	conference
Appears in Collections:	[TANet Conference] Conference Papers

Files in This Item:

File	Description	Size	Format
598.pdf		390Kb	Adobe PDF2	185	View/Open

All items in 政大典藏 are protected by copyright, with all rights reserved.

社群 sharing

著作權政策宣告 Copyright Announcement

1.本網站之數位內容為國立政治大學所收錄之機構典藏，無償提供學術研究與公眾教育等公益性使用，惟仍請適度，合理使用本網站之內容，以尊重著作權人之權益。商業上之利用，則請先取得著作權人之授權。
The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

2.本網站之製作，已盡力防止侵害著作權人之權益，如仍發現本網站之數位內容有侵害著作權人權益情事者，請權利人通知本網站維護人員(nccur@nccu.edu.tw)，維護人員將立即採取移除該數位著作等補救措施。
NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.

DSpace Software Copyright © 2002-2004 MIT & Hewlett-Packard / Enhanced by NTU Library IR team Copyright © - Feedback