政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/113845

English | 正體中文 | 简体中文 | Post-Print筆數 : 27 | 全文笔数/总笔数 : 118786/149850 (79%)
造访人次 : 81695617 在线人数 : 31

RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.

搜寻范围

查询小技巧：

您可在西文检索词汇前后加上"双引号"，以获取较精准的检索结果

若欲以作者姓名搜寻，建议至进阶搜寻限定作者字段，可获得较完整数据

进阶搜寻

主页 ‧ 登入 ‧ 上传 ‧ 说明 ‧ 关于政大典藏 ‧ 管理

到手机版

政大機構典藏 > 政大會議論文集 > TANET 台灣網際網路研討會 > 會議論文 > Item 140.119/113845

请使用永久网址来引用或连结此文件: https://nccur.lib.nccu.edu.tw/handle/140.119/113845

题名:	應用MMB演算法清理網頁雜訊和擷取網頁Metadata
作者:	駱思安徐俊傑
关键词:	多重關係貝氏演算法;網頁探勘;網頁清理;資訊擷取 Multimembership Bayesian Algorithm;Web Mining;Web Page Cleaning;Information Extraction;Metadata;TF/IDF;Entropy
日期:	2006
上传时间:	2017-10-19 09:37:03 (UTC+8)
摘要:	傳統擷取網頁重要詞彙的方式大都是以TF/IDF和Entropy方式為主流，但我們赫然發現TF 值較高的詞彙並不等同於這個詞彙重要;而Entropy的方式雖然擁有不錯的判別能力，但由於其計算過程過於繁瑣，故本研究提出MMB 演算法法以期能取代這兩個方法，實驗證明MMB演算法確實有效地提昇辨識重要詞彙的機率值和網頁自動分類的準確率。每一個網站中包含著許許多多的文字，分散在網站內的每一個網頁中，而這些文字一部分是描述網網頁隸屬於屬於何種類別，另一部分則是與隸屬類別毫無關係的雜質。因此，如能有效地去除網站中的雜質文字，即能成功地提昇中文網頁自動分類的效能。 The traditional methods of acquiring important terms of the Web page are TF/IDF and Entropy, but we find the higher TF value is not stand for the more important term is. Although Entropy has good ability of differing, the processes of calculating are too much. So, in the research, we propose MMB algorithm to replace these two methods, and we verify MMB algorithm can actually improve the probabilities of differing important terms and the performances of classifying the Chinese Web page. A Web site contains a lot of terms which are distributed in each Web page of the Web site. Some of these terms describe the characteristics of the Web page and can used to classify the Web page to a specific category. The others have no relationship to the Web page are ignored while performing the classification task. So, if we can eliminate the noisy terms, we can successfully improve the performance of Web page automatically classified system.
關聯:	TANET 2006 台灣網際網路研討會論文集網際網路技術
数据类型:	conference
显示于类别:	[TANET 台灣網際網路研討會] 會議論文

文件中的档案:

档案	描述	大小	格式	浏览次数
598.pdf		390Kb	Adobe PDF2	185	检视/开启

在政大典藏中所有的数据项都受到原著作权保护.

社群 sharing

著作權政策宣告 Copyright Announcement

1.本網站之數位內容為國立政治大學所收錄之機構典藏，無償提供學術研究與公眾教育等公益性使用，惟仍請適度，合理使用本網站之內容，以尊重著作權人之權益。商業上之利用，則請先取得著作權人之授權。
The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

2.本網站之製作，已盡力防止侵害著作權人之權益，如仍發現本網站之數位內容有侵害著作權人權益情事者，請權利人通知本網站維護人員(nccur@nccu.edu.tw)，維護人員將立即採取移除該數位著作等補救措施。
NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.

DSpace Software Copyright © 2002-2004 MIT & Hewlett-Packard / Enhanced by NTU Library IR team Copyright © - 回馈