Reference: | [1] T. H. Davenport and J. Dyché, "Big data in big companies," International Institute for Analytics, 2013. [2] R. Kabacoff, R in action: data analysis and graphics with R: Manning Publications Co., 2015. [3] F. Pedregosa, et al., "Scikit-learn: Machine learning in Python," Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011. [4] L. Buitinck, et al., "API design for machine learning software: experiences from the scikit-learn project," arXiv preprint arXiv:1309.0238, 2013. [5] D. Agrawal, et al., "Big data and cloud computing: current state and future opportunities," in Proceedings of the 14th International Conference on Extending Database Technology, 2011, pp. 530-533. [6] K.-w. Fu, et al., "Assessing censorship on microblogs in China: Discriminatory keyword analysis and the real-name registration policy," Internet Computing, IEEE, vol. 17, pp. 42-50, 2013. [7] A. R. Jagdale, et al., "Data Mining and Data Pre-processing for Big Data." [8] D. Borthakur, "HDFS architecture guide," HADOOP APACHE PROJECT http://hadoop. apache. org/common/docs/current/hdfs design. pdf, 2008. [9] K. Shvachko, et al., "The hadoop distributed file system," in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, 2010, pp. 1-10. [10] H. Karau, et al., Learning Spark: Lightning-Fast Big Data Analysis: " O`Reilly Media, Inc.", 2015. [11] M. Armbrust, et al., "Spark sql: Relational data processing in spark," in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015, pp. 1383-1394. [12] R. S. Xin, et al., "Graphx: A resilient distributed graph system on spark," in First International Workshop on Graph Data Management Experiences and Systems, 2013, p. 2. [13] M. Zaharia, et al., "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing," in Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, 2012, pp. 2-2. [14] N. Rana and S. Deshmukh, "Shuffle Performance in Apache Spark," in International Journal of Engineering Research and Technology, 2015. [15] S. Kotsiantis, et al., "Data preprocessing for supervised leaning," International Journal of Computer Science, vol. 1, pp. 111-117, 2006. [16] S. Landset, et al., "A survey of open source tools for machine learning with big data in the Hadoop ecosystem," Journal of Big Data, vol. 2, pp. 1-36, 2015. [17] S. Mathew, "Overview of amazon web services," Amazon Whitepapers, 2014. [18] P. Pääkkönen and D. Pakkala, "Reference architecture and classification of technologies, products and services for big data systems," Big Data Research, vol. 2, pp. 166-186, 2015. [19] P. Gupta, et al., "Wtf: The who to follow service at twitter," in Proceedings of the 22nd international conference on World Wide Web, 2013, pp. 505-514. [20] A. Thusoo, et al., "Data warehousing and analytics infrastructure at facebook," in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010, pp. 1013-1020. [21] G. Mishne, et al., "Fast data in the era of big data: Twitter`s real-time related query suggestion architecture," in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013, pp. 1147-1158. [22] M. Busch, et al., "Earlybird: Real-time search at twitter," in 2012 IEEE 28th International Conference on Data Engineering, 2012, pp. 1360-1369. [23] M. Zaharia, et al., "Spark: Cluster Computing with Working Sets," HotCloud, vol. 10, pp. 10-10, 2010. [24] C. Engle, et al., "Shark: fast data analysis using coarse-grained distributed memory," in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 2012, pp. 689-692. [25] R. Sumbaly, et al., "The big data ecosystem at linkedin," in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013, pp. 1125-1134. [26] J. Lin and D. Ryaboy, "Scaling big data mining infrastructure: the twitter experience," ACM SIGKDD Explorations Newsletter, vol. 14, pp. 6-19, 2013. [27] X. Meng, et al., "Mllib: Machine learning in apache spark," arXiv preprint arXiv:1505.06807, 2015. [28] L. C. Freeman, "Centrality in social networks conceptual clarification," Social networks, vol. 1, pp. 215-239, 1978. [29] S. Ryza, "Advanced analytics with Spark. ed," by Ann Spencer. O’Reilly, 2014. [30] L. Breiman, "Bagging predictors," Machine learning, vol. 24, pp. 123-140, 1996. [31] L. Breiman, "Random forests," Machine learning, vol. 45, pp. 5-32, 2001. [32] R. Genuer, et al., "Random Forests for Big Data," arXiv preprint arXiv:1511.08327, 2015. [33] Y. Liu, "Random forest algorithm in big data environment," CMNT, vol. 18, pp. 147-51, 2014. [34] K. Singh, et al., "Big data analytics framework for peer-to-peer botnet detection using random forests," Information Sciences, vol. 278, pp. 488-497, 2014. [35] T. Fawcett, "An introduction to ROC analysis," Pattern recognition letters, vol. 27, pp. 861-874, 2006. [36] S. Venkataraman, et al., "SparkR: Scaling R Programs with Spark." [37] M. Armbrust, et al., "Scaling spark in the real world: performance and usability," Proceedings of the VLDB Endowment, vol. 8, pp. 1840-1843, 2015. |