Włodzimierz Funika, Paweł Koperek


Organizations across the globe gather more and more data, encouraged by easy-to-use and cheap cloud storage services. Large datasets require new approaches to analysis and processing, which include methods based on machine learning. In particular, symbolic regression can provide many useful insights. Unfortunately, due to high resource requirements, use of this method for large-scale dataset analysis might be unfeasible. In this paper, we analyze a bottleneck in the open-source implementation of this method we call hubert. We identify that the evaluation of individuals is the most costly operation. As a solution to this problem, we propose a new evaluation service based on the Apache Spark framework, which attempts to speed up computations by executing them in a distributed manner on a cluster of machines. We analyze the performance of the service by comparing the evaluation execution time of a number of samples with the use of both implementations. Finally, we draw conclusions and outline plans for further research.


distributed systems; evolutionary programming; symbolic regression; scaling; Apache Spark

Full Text:


References, Inc.: AWS Amazon Elastic Compute Cloud (EC2) – Scalable Cloud Hosting., 2014, accessed 2.12.2014.

Apache Software Foundation: Welcome to Apache Hadoop!, 2014, accessed 11.11.2014.

Baldeschwieler E.: Yahoo! Launches Worlds Largest Hadoop Production Application., 2008, accessed 11.11.2014.

Du X., Ni Y., Yao Z., Xiao R., Xie D.: High performance parallel evolutionary algorithm model based on MapReduce framework. International Journal of Computer Applications in Technology, vol. 46(3), pp. 290–295, 2013.

Evans J., Rzhetsky A.: Machine Science. Science, vol. 329, pp. 399–400, 2010.

Fernndez F., Snchez J.M., Tomassini M., Gmez J.A.: A Parallel Genetic Programming Tool Based on PVM. In: J. Dongarra, E. Luque, T. Margalef, eds., Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science, vol. 1697, pp. 241–248, Springer, Berlin–Heidelberg, 1999.

Funika W., Godowski P., Pegiel P., Król D.: Semantic-Oriented Performance Monitoring of Distributed Applications. Computing and Informatics, vol. 31(2), pp. 427–446, 2012.

Funika W., Koperek P.: Genetic Programming in Automatic Discovery of Relationships in Computer System Monitoring Data. In: Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science, vol. 8384, pp. 371–380, Springer, Berlin–Heidelberg, 2014.

Funika W., Kupisz M., Koperek P.: Towards Autonomic Semantic-Based Management of Distributed Applications. Computer Science, vol. 11, pp. 51–64, 2010.

Hindman B., Konwinski A., Zaharia M., Ghodsi A., Joseph A.D., Katz R., Shenker S., Stoica I.: Mesos: A Platform for Fine-grained Resource Sharing in the Data Center. In:

Proceedings of the 8th USENIX Conference on Networked Sysems Design and Implementation, NSDI’11, pp. 295–308, USENIX Association, Berkeley, CA, USA, 2011.

hubert: project source code., 2015, accessed 15.02.2015.

King R.D., Rowland J., Oliver S.G., et al.: The Automation of Science. Science, vol. 324, pp. 85–89, 2009.

Koza J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA, 1992.

Ryan A.: Under the Hood: Hadoop Distributed Filesystem reliability with Namenode and Avatarnode., 2012, accessed 11.11.2014.

Salhi A., Glaser H., De Roure D.: Parallel Implementation of a Genetic-programming Based Tool for Symbolic Regression. Inf. Process. Lett., vol. 66(6), pp. 299–307, 1998.

Schmidt M., Lipson H.: Distilling free-form natural laws from experimental data. Science, vol. 324, pp. 81–85, 2009.

Schmidt M.D., Lipson H.: Data-Mining Dynamical Systems: Automated Symbolic System Identification for Exploratory Analysis. ASME Conference Proceedings, vol. 2008(48364), pp. 643–649, 2008.

Schmidt M.D., Lipson H.: Age-fitness pareto optimization. In: M. Pelikan, J. Branke, eds., GECCO, pp. 543–544, ACM, 2010.

Schwarzkopf M., Konwinski A., Abd-El-Malek M., Wilkes J.: Omega: flexible, scalable schedulers for large compute clusters. In: SIGOPS European Conference on Computer Systems (EuroSys), pp. 351–364, Prague, Czech Republic, 2013.

Zaharia M., Chowdhury M., Das T., Dave A., Ma J., McCauley M., Franklin M.J., Shenker S., Stoica I.: Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12, pp. 2–2, USENIX Association, Berkeley, CA, USA, 2012.



  • There are currently no refbacks.