政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/123696
English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  Items with full text/Total items : 113648/144635 (79%)
Visitors : 51685597      Online Users : 600
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/123696


    Title: 以進階生成對抗網路合成擬真資料
    Realistic data synthesis using enhanced generative adversarial networks
    Authors: 包諾克
    Baowaly, Mrinal Kanti
    Contributors: 陳昇瑋
    劉昭麟

    Chen, Sheng-Wei
    Liu, Chao-Lin

    包諾克
    Mrinal Kanti Baowaly
    Keywords: 電子健康記錄
    合成資料生成
    資料合成
    生成對抗網路
    梯度懲罰型沃瑟斯坦GAN
    邊界尋求GAN
    Electronic health records
    Synthetic data generation
    Data synthesis
    Generative adversarial networks
    Wasserstein GANs with Gradient Penalty
    Boundary-seeking GANs
    Date: 2019
    Issue Date: 2019-06-03 13:08:37 (UTC+8)
    Abstract: 真實資料在許多情況下無法取得,或者在時間和金錢方面都太昂貴。這是因為這些資料可能存在隱私和保密問題。在這些情況下,使用合成資料是一個可行的選擇。本研究的主要目的是生成近乎真實的合成電子健康記錄(EHR),以便人們可以自由地使用,進行醫療保健或相關領域的研究。我們提出了兩種合成資料的生成模型,分別稱為具有梯度懲罰的醫學沃瑟斯坦GAN(medWGAN),以及醫學邊界尋求GAN(medBGAN),並且將其表現與現有的醫學GAN(medGAN)進行比較。本研究所提出的模型是基於生成對抗網絡(GAN)的兩種增強方法,即具有梯度懲罰的沃瑟斯坦GAN(WGAN-GP),以及邊界尋求GAN(BGAN)。我們在醫學領域中具有離散特徵(例如,二元和計數)的三個匯總EHR資料集上進行資料合成,分別是MIMIC-III,擴展的MIMIC-III,以及台灣國家健康保險研究資料庫(NHIRD)。首先,我們訓練上述模型並生成合成EHR資料。接著,我們應用統計方法(維度平均值以及柯爾莫哥洛夫-斯米爾諾夫檢定)和兩個機器學習任務(關聯規則挖掘以及預測)來分析和比較模型的表現。綜合分析的結果顯示,與使用medGAN模型相比,本研究所提出的模型在生成近乎真實的合成EHR資料方面是更為有效的。
      我們的模型可用於生成任何近乎真實的合成資料,而不限於醫學領域。為了證明模型的一般性,在醫學領域之外,我們還研究了洛杉磯市警察局的一個匯總的犯罪資料集,進一步證實了本研究所提出的模型在廣泛應用中的能力。我們證明本研究所提出的模型可用於生成具有離散特徵的高品質合成資料,這些資料在統計上是合理的,並且足以用於機器學習任務。 我們相信,以提供更好的服務來生成近乎真實的合成資料的角度來看,本研究所提出的模型將在工業和學術研究中起到作用。本研究將有助於消除機密資料的存取限制等障礙,從而加速醫學資訊學、醫療保健或相關領域的發展。
    There are many situations when the real data are not available or are too expensive to afford in respect of both time and money. This is because those data may have privacy and confidentiality concerns. In these situations, it is a good alternative to use synthetic data. The primary objective of this study is to generate realistic synthetic electronic health records (EHRs) so that people can use it freely for progressing research in healthcare or related fields. We propose two synthetic data generation models – designated as medical Wasserstein GAN with gradient penalty (medWGAN) and medical boundary-seeking GAN (medBGAN) – and compare the performances with an existing method medical GAN (medGAN). The proposed models are based on the two enhanced methods of generative adversarial networks (GANs), namely, Wasserstein GAN with gradient penalty (WGAN-GP) and boundary-seeking GAN (BGAN). We perform data synthesis on three aggregated EHR datasets with discrete features (e.g., binary and count) in the medical domain. They are MIMIC-III, extended MIMIC-III and National Health Insurance Research Database (NHIRD), Taiwan. Firstly, we train the models and generate synthetic EHR data by using these trained models. We then analyze and compare the models’ performance by applying some statistical methods (dimension-wise average and Kolmogorov–Smirnov test) and two machine learning tasks (association rule mining and prediction). The comprehensive analysis of this study shows that the proposed models are more effective in generating realistic synthetic EHR data than those generated using medGAN.
    Our models can be applied to generate any realistic synthetic data, even beyond the medical domain. To prove the generality of our models, we also investigate an aggregated crime dataset in the City of Los Angeles Police Department apart from the medical domain which confirms our models’ capability to work in a wide range of applications. We prove that the proposed models are suitable for producing high-quality synthetic data with discrete features that are statistically sound and good enough for machine learning tasks. We believe the proposed models will be effective in industry and research from the viewpoint of providing better services in generating realistic synthetic data. This study will help to eliminate barriers including limited access to confidential data and thus accelerate the development of medical informatics, healthcare or related fields.
    Reference: [1] Mrinal Kanti Baowaly, Chia-Ching Lin, Chao-Lin Liu, and Kuan-Ta Chen. Synthesizing Electronic Health Records Using Improved Generative Adversarial Networks. Journal of the American Medical Informatics Association, 26(3):228–241, 12 2018.
    [2] Mrinal Kanti Baowaly, Chao-Lin Liu, and Kuan-Ta Chen. Realistic Data Synthesis Using Enhanced Generative Adversarial Networks. In 2019 IEEE International Confer- ence on Artificial Intelligence and Knowledge Engineering (IEEE AIKE 2019). IEEE, June 2019.
    [3] Donald B Rubin. Statistical disclosure limitation. Journal of official Statistics, 9(2):461– 468, 1993.
    [4] Office for Civil Rights. Guidance Regarding Methods for De-identification of Pro- tected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. U.S. Department of Health and Human Ser- vices, November 2013. [online] https://www.hhs.gov/hipaa/for-professionals/privacy/ special-topics/de-identification/index.html, Accessed 12 Mar 2017.
    [5] Khaled El Emam, Elizabeth Jonker, Luk Arbuckle, and Bradley Malin. A systematic review of re-identification attacks on health data. PloS one, 6(12):e28071, 2011.
    [6] Khaled El Emam, Sam Rodgers, and Bradley Malin. Anonymising and sharing individual patient data. bmj, 350:h1139, 2015.
    [7] Ross Anderson. Under threat: patient confidentiality and NHS computing. Drugs and Alcohol Today, 6(4):13–17, 2006.
    [8] Paul Ohm. Broken promises of privacy: Responding to the surprising failure of anonymization (August 13, 2009). UCLA Law Review, 57:1701, 2010.
    [9] Melissa Gymrek, Amy L. McGuire, David Golan, Eran Halperin, and Yaniv Erlich. Identifying Personal Genomes by Surname Inference. Science, 339(6117):321–324, 2013.
    [10] Jason Walonoski, Mark Kramer, Joseph Nichols, and et al. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. Journal of the American Medical Informatics Association, 25(3):230–238, 2018.
    [11] John M. Abowd and Julia Lane. New Approaches to Confidentiality Protection: Synthetic Data, Remote Access and Research Data Centers. In Josep Domingo-Ferrer and Vicenç Torra, editors, Privacy in Statistical Databases, pages 282–289, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg.
    [12] Roderick JA Little. Statistical Analysis of Masked Data. JOURNAL OF OFFICIAL STATISTICS-STOCKHOLM-, 9:407–407, 1993.
    [13] Jim Gray, Prakash Sundaresan, Susanne Englert, Ken Baclawski, and Peter J. Weinberger. Quickly Generating Billion-record Synthetic Databases. SIGMOD Rec., 23(2):243–252, May 1994.
    [14] Stephen E Fienberg and Russell J Steele. Disclosure Limitation Using Perturbation and Related Methods for Categorical Data. Journal of Official Statistics, 14(4):485, 1998.
    [15] Stephen E Fienberg. A radical proposal for the provision of micro-data samples and the preservation of confidentiality. Department of statistics, 1994.
    [16] SE Fienberg. Taking uncertainty and error in censuses and surveys seriously. In Proceedings of Statistics Canada Symposium 95: From Data to Information-Methods and Systems, 1996.
    [17] Stephen E Fienberg, Russell J Steele, and Udi E Makov. Statistical notions of data disclosure avoidance and their relationship to traditional statistical methodology: data swapping and log-linear models. In Proceedings of Bureau of the Census 1996 Annual Research Conference, pages 87–105, 1996.
    [18] Trivellore E Raghunathan, Jerome P Reiter, and Donald B Rubin. Multiple imputation for statistical disclosure limitation. Journal of official statistics, 19(1):1, 2003.
    [19] Yaling Pei and Osmar Zaïane. A synthetic data generator for clustering and outlier analysis. Technical report, TR06-15, 2006.
    [20] Kenneth Houkjær, Kristian Torp, and Rico Wind. Simple and realistic data generation. In Proceedings of the 32Nd International Conference on Very Large Data Bases, VLDB ’06, pages 1243–1246. VLDB Endowment, 2006.
    [21] Peter Christen and Agus Pudjijono. Accurate synthetic generation of realistic personal information. In Advances in Knowledge Discovery and Data Mining, pages 507–514, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg.
    [22] M. Bozkurt and M. Harman. Automatically generating realistic test input from web services. In Proceedings of 2011 IEEE 6th International Symposium on Service Oriented System (SOSE), pages 13–24, Dec 2011.
    [23] Joseph S. Lombardo and Linda J. Moniz. A Method for Generation and Distribution of Synthetic Medical Record Data for Evaluation of Disease-Monitoring Systems. Johns Hopkins APL Technical Digest, 27(4), 2008.
    [24] Anna L Buczak, Steven Babin, and Linda Moniz. Data-driven approach for creating synthetic electronic medical records. BMC medical informatics and decision making, 10(1):59, 2010.
    [25] S. McLachlan, K. Dube, and T. Gallagher. Using the CareMap with Health Incidents Statistics for Generating the Realistic Synthetic Electronic Healthcare Record. In 2016 IEEE International Conference on Healthcare Informatics (ICHI), pages 439–448, October 2016.
    [26] Y. Park, J. Ghosh, and M. Shankar. Perturbed Gibbs Samplers for Generating Large- Scale Privacy-Safe Synthetic Health Data. In 2013 IEEE International Conference on Healthcare Informatics, pages 493–498, September 2013.
    [27] S. McLachlan. Realism in synthetic data generation. Massey University, Palmerston North, New Zealand, February 2017. [online] http://hdl.handle.net/10179/11569, Ac- cessed 5 Oct 2017.
    [28] Edward Choi, Siddharth Biswal, Bradley Malin, and et al. Generating Multi-label Discrete Electronic Health Records using Generative Adversarial Networks. CoRR, abs/1703.06490, 2017.
    [29] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, and et al. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
    [30] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, and et al. Improved Techniques for Training GANs. In Advances in Neural Information Processing Systems 29, pages 2234–2242. Curran Associates, Inc., 2016.
    [31] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. CoRR, abs/1511.06434, 2015.
    [32] Yanghua Jin, Jiakai Zhang, Minjun Li, and et al. Towards the Automatic Anime Characters Creation with Generative Adversarial Networks. CoRR, abs/1708.05509, 2017.
    [33] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, and et al. High-resolution image synthesis and semantic manipulation with conditional gans. arXiv preprint arXiv:1711.11585, 2017.
    [34] Scott Reed, Zeynep Akata, Xinchen Yan, and et al. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.
    [35] Han Zhang, Tao Xu, Hongsheng Li, and et al. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv preprint, 2017.
    [36] Hao Dong, Paarth Neekhara, Chao Wu, and Yike Guo. Unsupervised image-to-image translation with generative adversarial networks. arXiv preprint arXiv:1701.02676, 2017.
    [37] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint, 2017.
    [38] Xun Huang, Ming-Yu Liu, Serge J. Belongie, and Jan Kautz. Multimodal Unsupervised Image-to-Image Translation. CoRR, abs/1804.04732, 2018.
    [39] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating Videos with Scene Dynamics. In Advances in Neural Information Processing Systems 29, pages 613–621. Curran Associates, Inc., October 2016.
    [40] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. arXiv preprint arXiv:1707.04993, 2017.
    [41] Li-Chia Yang, Szu-Yu Chou, and Yi-Hsuan Yang. MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation using 1D and 2D Conditions. CoRR, abs/1703.10847, 2017.
    [42] Matt J Kusner and José Miguel Hernández-Lobato. Gans for sequences of discrete elements with the gumbel-softmax distribution. arXiv preprint arXiv:1611.04051, 2016.
    [43] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. In AAAI, pages 2852–2858, March 2017.
    [44] R Devon Hjelm, A. P. Jacob, T. Che, and et al. Boundary-Seeking Generative Adversarial Networks. ArXiv e-prints, 2017.
    [45] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, and et al. Improved Training of Wasserstein GANs. In Advances in Neural Information Processing Systems 30, pages 5767–5777. Curran Associates, Inc., 2017.
    [46] appliedAI. Synthetic Data: An Introduction & 10 Tools. [online] https://blog.appliedai. com/synthetic-data/, Accessed 31 July 2018.
    [47] E. L. Barse, H. Kvarnstrom, and E. Jonsson. Synthesizing test data for fraud detection systems. In 19th Annual Computer Security Applications Conference, 2003. Proceedings., pages 384–394, Dec 2003.
    [48] Margaret Rouse and Nicole Laskowski. Synthetic data. [online] https://searchcio. techtarget.com/definition/synthetic-data, Accessed 11 May 2018.
    [49] Yann LeCun. What are some recent and potentially upcoming breakthroughs in deep learning?, July 2016. [online] https://www.quora.com/ What-are-some-recent-and-potentially-upcoming-breakthroughs-in-deep-learning, Accessed 3 November 2017.
    [50] Ian J. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. CoRR, abs/1701.00160, April 2017.
    [51] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. CoRR, abs/1701.07875, December 2017.
    [52] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, Cambridge, Massachusetts, United States, 2016. http://www.deeplearningbook.org.
    [53] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and Composing Robust Features with Denoising Autoencoders. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pages 1096– 1103, New York, NY, USA, 2008. ACM.
    [54] G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786):504–507, 2006.
    [55] Alistair E.W. Johnson, Tom J. Pollard, Lu Shen, and et al. MIMIC-III, a freely accessible critical care database. Scientific Data, May 2016. [online] https://doi.org/10.1038/sdata. 2016.35, Accessed 5 October 2016.
    [56] International Classification of Diseases, Ninth Revision, Clinical Modification (ICD- 9-CM). National Center for Health Statistics (NCHS) and the Centers for Medicare
    & Medicaid Services (CMS). [online] https://www.cdc.gov/nchs/icd/icd9cm.htm, Accessed 30 June 2017.
    [57] National Health Insurance Research Database, Taiwan. National Health Insurance Administration, Ministry of Health and Welfare, Taiwan. [online] http://nhird.nhri.org. tw/en/, Accessed 10 January 2016.
    [58] Diseases and Injuries Tabular Index. National Center for Health Statistics (NCHS) and the Centers for Medicare & Medicaid Services (CMS). [online] http://icd9.chrisendres. com/index.php?action=contents, Accessed 10 July 2017.
    [59] Procedures Index. National Center for Health Statistics (NCHS) and the Centers for Medicare & Medicaid Services (CMS). [online] http://icd9.chrisendres.com/index.php? action=procslist, Accessed 10 July 2017.
    [60] Blanca E. Himes, Yi Dai, Isaac S. Kohane, and et al. Prediction of Chronic Obstructive Pulmonary Disease (COPD) in Asthma Patients Using Electronic Medical Records. Journal of the American Medical Informatics Association, 16(3):371–379, 2009.
    [61] Jionglin Wu, Jason Roy, and Walter F. Stewart. Prediction Modeling Using EHR Data: Challenges, Strategies, and a Comparison of Machine Learning Approaches. Medical Care, 48(6):S106–S113, 2010.
    [62] Sandy H Huang, Paea LePendu, Srinivasan V Iyer, and et al. Toward personalizing treatment for depression: predicting diagnosis and severity. Journal of the American Medical Informatics Association, 21(6):1069–1075, 2014.
    [63] Pedro L Teixeira, Wei-Qi Wei, Robert M Cronin, and et al. Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals. Journal of the American Medical Informatics Association, 24(1):162–171, 2017.
    [64] medGAN Source Code. GitHub repository. [online] https://github.com/mp2893/ medgan, Accessed 15 November 2017.
    [65] Wikipedia contributors. Kolmogorov–smirnov test — Wikipedia, the free encyclopedia. [online] https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test, Accessed 20 November 2017.
    [66] Pranjul Yadav, Michael Steinbach, Vipin Kumar, and Gyorgy Simon. Mining Electronic Health Records (EHRs): A Survey. ACM Computing Surveys (CSUR), 50(6):85:1– 85:40, January 2018.
    [67] Adam Wright, Elizabeth S. Chen, and Francine L. Maloney. An automated technique for identifying associations between medications, laboratory results and problems. Journal of Biomedical Informatics, 43(6):891–901, 2010.
    [68] Shin AM, Lee IH, Lee GH, and et al. Diagnostic Analysis of Patients with Essential Hypertension Using Association Rule Mining. Healthcare Informatics Research, 16(2):77–81, June 2010.
    [69] Jimeng Sun, Candace D McNaughton, Ping Zhang, and et al. Predicting changes in hypertension control using electronic health records from a chronic disease management program. Journal of the American Medical Informatics Association, 21(2):337–344, 2014.
    [70] Los Angeles’ Crime Data, Los Angeles Police Department, USA. [online] https:
    //data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-Present/y8tr-7khq, Accessed 15 January 2018.
    Description: 博士
    國立政治大學
    社群網路與人智計算國際研究生博士學位學程(TIGP)
    104761507
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0104761507
    Data Type: thesis
    DOI: 10.6814/DIS.NCCU.TIGP.002.2019.B02
    Appears in Collections:[Taiwan International Graduate Program] Theses

    Files in This Item:

    File SizeFormat
    150701.pdf3767KbAdobe PDF2238View/Open


    All items in 政大典藏 are protected by copyright, with all rights reserved.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback