English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  Items with full text/Total items : 113303/144284 (79%)
Visitors : 50799530      Online Users : 815
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    政大機構典藏 > 商學院 > 統計學系 > 學位論文 >  Item 140.119/149650
    Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/149650


    Title: dataSDA: 用於象徵型資料分析的資料集之 R 套件
    dataSDA: Data Sets for Symbolic Data Analysis in R
    Authors: 陳柏維
    Chen, Po-Wei
    Contributors: 吳漢銘
    Wu, Han-Ming
    陳柏維
    Chen, Po-Wei
    Keywords: 區間值資料
    直方圖值資料
    R 套件
    象徵型資料分析
    interval-valued data
    histogram-valued data
    R package
    symbolic data analysis
    Date: 2023
    Issue Date: 2024-02-01 11:41:51 (UTC+8)
    Abstract: 在傳統資料集的範疇下,分析對象通常被局限於由單一觀察值構成的資料集合。然而,隨著資料的量與複雜性持續增加,資料收集已變得更為龐大和多樣化。為了更加有效地整合管理資料並保留其中蘊含的關鍵資訊,資料收集的變數格式已經超越了單一數值,轉而採用了包含區間、直方圖、機率分佈等在內的多值描述方式,這種資料描述形式被稱作「象徵型資料」。通過這種描述方式,我們能更全面地掌握資料的分佈、特性和變異性,有助於進一步的數據分析和解釋。本研究開發了一個名為 dataSDA 的 R 語言套件。這個套件的主要目標是針對不同的研究主題來收集各種象徵型資料,並進行不同格式的象徵型資料的讀取、寫出及轉換,以及計算象徵型資料的描述性統計量。此套件參考了當前廣泛使用的象徵型資料套件 RSDA 和 HistDAWass的格式架構,並在功能上進行了擴展,例如,從傳統資料依不同條件整合出一象徵型資料。我們利用 dataSDA 套件中的資料集進行了分群、分類和迴歸分析的演示和比較。我們相信,dataSDA 作為一個象徵型資料的收集和處理工具,能夠成為一個重要的象徵型資料來源,並能有效地協助使用者深入象徵型資料分析研究領域,進一步發展象徵型資料的分析方法。dataSDA 套件已發佈在 the Comprehensive R Archive Network (CRAN) 供人下載使用。
    Within the context of traditional datasets, the subjects of analysis are typically restricted to data collections composed of singular values of variables. However, as the volume and complexity of data continue to grow, data collection has become increasingly vast and diverse. To more effectively consolidate and manage data while preserving the essential information it contains, the format of data variables has evolved beyond singular values. Instead, it now adopts multivalued descriptive methods that encompass intervals, histograms, and probability distributions. This representation of data is termed ”symbolic data.” Through this descriptive method, we can gain a more comprehensive grasp of the data’s distribution, characteristics, and variability, facilitating further data analysis and interpretation. This study introduced an R package named dataSDA. The primary aim of this package is to gather various symbolic data tailored to different research themes, and to execute the reading, writing, and conversion of symbolic data in diverse formats, as well as compute the descriptive statistics of symbolic variables. This package draws inspiration from the structural framework of widely-used symbolic data packages, RSDA and HistDAWass, and has expanded its functionalities such as generating symbolic data by aggregation of the conventional data. We utilized benchmark datasets within the dataSDA package to demonstrate and compare clustering, classification, and regression analyses in R. We believe that dataSDA, serving as a tool for the collection and processing of symbolic data, can stand as a pivotal source for symbolic data. It holds the potential to effectively guide users deeper into the realm of symbolic data analysis research, fostering the development of analytical methods for symbolic data. The dataSDA package is currently available on the Comprehensive R Archive Network (CRAN).
    Reference: [1] Bean B. Intkrige: a numerical implementation of interval-valued kriging. R package version 1.0.1;2020.

    [2] Bean B, Maguire M, Sun Y. The Utah snow load study. Civil and Environmental Engineering Faculty Publications. 2018; Paper 3589.

    [3] Bertrand P, Goupil F. Descriptive statistics for symbolic data. In: Analysis of Symbolic Data, Bock HH, Diday E. (eds). Springer, Berlin, Heidelberg. 2000;106–124.

    [4] Billard L. Dependencies and variation components of symbolic interval-valued data. In: Selected contributions in data analysis and classification. Springer. 2007;3–12.

    [5] Billard L. Sample covariance functions for complex quantitative data. In: Proceedings of World IASC Conference, Yokohama, Japan. 2008;157–163.

    [6] Billard L, Diday E. Regression analysis for interval-valued data. In: Data Analysis, Classification, and Related Methods. Springer. 2000;369–374.

    [7] Billard L, Diday E. From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc. 2003;98(462):470–487.

    [8] Billard L, Diday E. Symbolic Data Analysis: Conceptual Statistics and Data Mining. John Wiley & Sons, Ltd; 2007.

    [9] Bock HH, Diday E. Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data, Springer. 2000.

    [10] Borcard D, Gillet F, Legendre P. Numerical Ecology with R. Springer New York; 2011.

    [11] Borysov SS, Geilhufe RM, Balatsky AV. Organic materials database: An open access online database for data mining. PLoS ONE. 2017;12(2): e0171501.

    [12] Brito P, Duarte Silva AP. Modelling interval data with normal and skew-normal distributions. J. Appl. Stat. 2012;39(1):3–20.

    [13] Cazes P, Chouakria A, Diday E, Schecktman Y. Extension de l’analyse en composantes principales ’a des donn’ees de type intervalle. Rev Stat Appl. 1997;45, 5–24.

    [14] Chiang K, Shu J, Zempleni J, Cui J. Dietary microRNA database (DMD): an archive database and analytic tool for food-borne microRNAs. PLoS ONE. 2015;10(6):e0128089.

    [15] Chouakria A. Extension de l’analyse en composantes principales ’a des donn’ees de type intervalle.” Doctoral Thesis;University of Paris IX Dauphine; 1998.

    [16] Chouakria A, Cazes P, Diday E. Symbolic principal component analysis,” In: Analysis of Symbolic Data, Bock HH, Diday E (eds). Berlin, Springer-Verlag; 2000.

    [17] Dau HA, Keogh E, et al. The UCR time series classification archive. 2019. URL https://www.cs.ucr.edu/ eamonn/time_series_data_2018/

    [18] De Carvalho FdA. Fuzzy c-means clustering methods for symbolic interval data. Pattern Recognit. Lett. 2007;28(4):423–437.

    [19] DeCarvalho, FdA, Lechevallier Y. Partitional clustering algorithms for symbolic interval data based on single adaptive distances. Pattern Recognit. 2009;42(7):1223–1236.

    [20] Denoeux T, Masson M. Multidimensional scaling of interval-valued dissimilarity data. Pattern Recognit. Lett. 2000;21(1):83–92.

    [21] Douzal-Chouakria A, Billard L, Diday E. Principal component analysis for interval-valued observations. Stat Anal Data Min. 2011;4(2):229–246.

    [22] Diday E. The symbolic approach in clustering and related methods of data analysis: the basic choices. In: Classification and Related Methods of Data Analysis, Proceedings of the First Conference of the International Federation of Classification Societies. IFCS-87: Technical University of Aachen. North Holland. 1988;673–684.

    [23] Diday E, Noirhomme-Fraiture M. Symbolic Data Analysis and the SODAS Software, Wiley-Interscience.; 2008.

    [24] D’Urso P, Giordani P. A least squares approach to principal component analysis for interval valued data. Chem Intell Lab Syst. 2004;70:179–192.

    [25] Kelly M, Longjohn R, Nottingham K, The UCI Machine Learning Repository. 2023; https://archive.ics.uci.edu

    [26] Garcia J. IntervalQuestionStat: tools to deal with interval-valued responses in questionnaires. R package version 0.1.0; 2022.

    [27] Gilchrist W. Statistical Modelling with Quantile Functions. Chapman & Hall; 2000.

    [28] Gioia F, Lauro NC, Principal component analysis on interval data. Comput. Stat. 2006;21:343–363.

    [29] Groenen PJF, Winsberg S, Rodriguez O, Diday E. I-Scal: multidimensional scaling of interval dissimilarities. Comput Stat Data Anal. 2006;51(1):360–378.

    [30] Grzegorzewski P, Śpiewak M. The sign test and the signed-rank test for interval-valued data. Int. J. Intell. Syst. 2019;34(9):2122–2150.

    [31] Hastie T, Tibshirani R, Friedman JH. The Elements of Statistical Learning. Springer (2nd edition); 2009.

    [32] Hayes B, A lucid interval. Am. Sci. 2003;91(6):484–488.

    [33] Henderson HV, Velleman PF. Building multiple regression models interactively. Biometrics. 1981;37(2):391–411.

    [34] Ichino M. The quantile method for symbolic principal component analysis. Stat Anal Data Min. 2011;4(2):184–198.

    [35] Irpino A. ”Spaghetti” PCA analysis: an extension of principal components analysis to time dependent interval data. Pattern Recognit. Lett. 2006;27:504–513.

    [36] Irpino A, Verde R. A new Wasserstein-based distance for the hierarchical clustering of histogram symbolic data. In: Data Science and Classification. Studies in Classification, Data Analysis, and Knowledge Organization, Batagelj V, Bock HH, Ferligoj A, Žiberna A. (eds). Springer, Berlin, Heidelberg. 2006;185–192.

    [37] Irpino A, Verde R. Basic statistics for distributional symbolic variables: a new metric-based approach. Adv Data Anal Classif. 2015;9:143–175.

    [38] Irpino A, Verde R, De Carvalho FdA. Dynamic clustering of histogram data based on adaptive squared Wasserstein distances. Expert Systems with Applications. 2014;41(7):3351–3366.

    [39] Kao CH, Nakano J, Shieh SH, Tien YJ, Wu HM, Yang CK, Chen CH. Exploratory data analysis of interval-valued symbolic data with matrix visualiza tion. Comput Stat Data Anal. 2014;79:14–29.

    [40] Kapoor P, Singh H, Gautam A, Chaudhary K, Kumar R, Raghava GPS. TumorHoPe: A database of tumor homing peptides. PLoS ONE. 2012;7(4):e35187.

    [41] Lauro CN, Palumbo F. Principal component analysis of interval data: a symbolic analysis approach. Comput. Stat. 2000;15(1):73–87.

    [42] Lauro CN, Gioia F. Dependence and interdependence analysis for interval-valued variables. In: Data Science and Classification, Batagelj V, HBock HH, Ferligoj A, Ziberna A (eds). Berlin, Springer-Verlag. 2006;171–183.

    [43] Lauro NC, Verde R, Irpino A. Principal component analysis of symbolic data described by intervals. In: Symbolic Data Analysis and the SODAS Software, Diday E, Noirhomme-Fraiture M (eds). Wiley, Chichester. 2008;279–311.

    [44] Lauro NC, Verde R, Palumbo F. Factorial data analysis on symbolic objects under cohesion constrains. In: Data Analysis, Classification and Related Methods. Springer-Verlag, Heidelberg; 2000.

    [45] Lee JA, Verleysen M. Quality assessment of dimensionality reduction: Rank-based criteria. Neurocomputing. 2009;72:1431–1443.

    [46] Lee JA, Verleysen M. Quality assessment of nonlinear dimensionality reduction based on K-ary neighborhoods. JMLR: Workshop and Conference Proceedings. 2008;4: 21–35.

    [47] Lee JA, Verleysen M. Scale-independent quality criteria for dimensionality reduction. Pattern Recognit. Lett. 2010;31:2248–2257.

    [48] Leroy B, Chouakria A, Herlin I, Diday E. Approche geometrique et classification pour la reconnaissance de visage, Reconnaissance des Forms et Intelligence Artificelle, INRIA and IRISA and CNRS, France. 1996;548–557.

    [49] Le-Rademacher J, Billard L. Symbolic covariance principal component analysis and visualization for interval-valued data. J Comput Graph Stat. 2012;21(2):413 -–432.

    [50] Meng D, Leung Y, Xu Z. A new quality assessment criterion for nonlinear dimensionality reduction. Neurocomputing. 2011;74:941–948.

    [51] Mokbel B, Lueks W, Gisbrecht A, Hammer B. Visualizing the quality of dimensionality reduction. Neurocomputing. 2013;112:109–123

    [52] Neto EAL, Cordeiro GM, de Carvalho FdA. Bivariate symbolic regression models for interval - valued variables. J Stat Comput Simul. 2011;81(11):1727–1744.

    [53] Neto EAL, de Carvalho FdA. Centre and range method for fitting a linear regression model to symbolic interval data. Comput Stat Data Anal. 2008;52(3):1500–1515.

    [54] Palumbo F, Lauro CN. A PCA for interval valued data based on midpoints and radii, In: New Developments in Psychometrics, Yanai H, Okada A, Shigematu K, Kano Y, Meulman JJ (eds). Japan, Springer-Verlag. 2003;641–648.

    [55] Rüschendorf L. Wasserstein metric. In: Encyclopaedia of Mathematics, Hazewinkel M (ed), Springer; 2001.

    [56] Silva APD, Brito P, Discriminant analysis of interval data: An assessment of parametric and distance-based approaches. J. Classif. 2015;32:516–541.

    [57] Silva APD, Brito P, Filzmoser P, Dias JG. MAINT.Data: modelling and analysing interval data in R. The R Journal. 2021;13(2):336–364.

    [58] Tenenbaum JB, de Silva V, Langford JC. A global geometric framework for nonlinear dimensionality reduction. Science. 2000;290:2319–2323.

    [59] Umbleja K, Ichino M, Yaguchi H. Improving symbolic data visualization for pattern recognition and knowledge discovery. Visual Informatics. 2020;4(1):23–31.

    [60] Verde R, Irpino A. Dynamic clustering of histogram data: using the right metric. In: Selected Contributions in Data Analysis and Classification. Studies in Classification, Data Analysis, and Knowledge Organization, Brito P, Cucumel G, Bertrand P, de Carvalho F. (eds). Springer, Berlin, Heidelberg. 2007;123–134.

    [61] Wang H, Guan R, Wu J. CIPCA: Complete-information-based Principal Component Analysis for interval-valued data. Neurocomputing. 2012;86:158–169.

    [62] Wickham et al. Welcome to the Tidyverse. Journal of Open Source Software. 2019;4(43):1686.

    [63] Xu W. Symbolic Data Analysis: Interval-valued Data Regression. PhD thesis, University of Georgia Athens, GA; 2010.

    [64] Zhang P, Ren Y, Zhang B. A new embedding quality assessment method for manifold learning. Neurocomputing. 2012;97:251–266.
    Description: 碩士
    國立政治大學
    統計學系
    111354013
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0111354013
    Data Type: thesis
    Appears in Collections:[統計學系] 學位論文

    Files in This Item:

    File Description SizeFormat
    401301.pdf651KbAdobe PDF0View/Open


    All items in 政大典藏 are protected by copyright, with all rights reserved.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback