DEVELOPING ARTIFICIAL INTELLIGENCE IN THE CLOUD: THE AI INFN PLATFORM
DOI:
https://doi.org/10.7494/csci.2025.26.SI.7071Abstract
The INFN CSN5-funded project AI INFN (“Artificial Intelligence at INFN”) aims to promote ML and AI adoption within INFN by providing comprehensive support, including state of-the-art hardware and cloud-native solutions within INFN Cloud. This facilitates efficient sharing of hardware accelerators without hindering the institute’s diverse research activities. AI INFN advances from a Virtual-Machine-based model to a flexible Kubernetes-based platform, offering features such as JWT-based authentication, JupyterHub multitenant interface, distributed file system, customizable conda environments, and specialized monitoring and accounting systems. It also enables virtual nodes in the cluster, offloading computing payloads to remote resources through the Virtual Kubelet technology, with InterLink as provider. This setup can manage workflows across various providers and hardware types, which is crucial for scientific use cases that require dedicated infrastructures for different parts of the workload. Results of initial tests to validate its production applicability, emerging case studies and integration scenarios are presented.
Downloads
References
References
[1] AMD Vitis™ AI Software, 2024. https://www.amd.com/en/products/software/
vitis-ai.html. Accessed: 15/09/2024.
[2] Apache Airflow, 2024. https://airflow.apache.org/. Accessed: 12/12/2024.
[3] Borg, 2024. https : / / borgbackup.readthedocs.io / en / stable / #. Accessed:
15/09/2024.
[4] Conda, 2024. https://conda.io. Accessed: 15/09/2024.
[5] Docker Swarm Mode, 2024. https://docs.docker.com/engine/swarm/. Accessed:
12/12/2024.
[6] ELOG, 2024. https://elog.psi.ch/elog/. Accessed: 15/09/2024.
[7] Intel© Distribution of OpenVINO™ Toolkit, 2024. https://www.intel.com/
content/www/us/en/developer/tools/openvino-toolkit/overview.html. Accessed:
15/09/2024.
[8] InterLink, 2024. https : / / intertwin - eu.github.io / interLink/. Accessed:
15/09/2024.
[9] InterTwin, 2024. https://www.intertwin.eu/. Accessed: 15/09/2024.
[10] JuiceFS, 2024. https://juicefs.com/en/. Accessed: 15/09/2024.
[11] Jupyter Server Proxy, 2024. https://jupyter-server-proxy.readthedocs.io/en/
latest/. Accessed: 15/09/2024.
[12] Kueue, 2024. https://kueue.sigs.k8s.io/. Accessed: 15/09/2024.
[13] Leonardo, 2024. https : / / leonardo - supercomputer.cineca.eu/. Accessed:
15/09/2024.
[14] MinIO, 2024. https://min.io/. Accessed: 15/09/2024.
[15] Multi-instance GPU, 2024. https://www.nvidia.com/it-it/technologies/multi-
instance-gpu/. Accessed: 15/09/2024.
[16] Nomad by HashiCorp, 2024. https : / / www.nomadproject.io/. Accessed:
12/12/2024
[17] nVidia DCGM Exporter, 2024. https://docs.nvidia.com/datacenter/cloud-
native/gpu-telemetry/latest/dcgm-exporter.html. Accessed: 15/09/2024.
[18] NVIDIA GPU Operator, 2024. https://docs.nvidia.com/datacenter/cloud-
native/gpu-operator/latest/index.html. Accessed: 15/09/2024.
[19] PostgreSQL, 2024. https://www.postgresql.org/. Accessed: 15/09/2024.
[20] Python venv, 2024. https://docs.python.org/3/library/venv.html. Accessed:
15/09/2024.
[21] Rados Gateway, 2024. https://docs.ceph.com/en/reef/radosgw/. Accessed:
15/09/2024.
[22] Squashfs, 2024. https : / / www.kernel.org / doc / Documentation / filesystems /
squashfs.txt. Accessed: 15/09/2024.
[23] Virtual Kubelet, 2024. https://virtual-kubelet.io/. Accessed: 15/09/2024.
[24] Abadi M., Agarwal A., Barham P., Brevdo E., Chen Z., Citro C., Corrado G.S.,
et al.: TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems,
2015. https://www.tensorflow.org/. Software available from tensorflow.org.
[25] Adamec M., Attebury G., Bloom K., Bockelman B., Lundstedt C., Shadura O.,
Thiltges J.: Coffea-casa: an analysis facility prototype, EPJ Web Conferences,
vol. 251, 02061, 2021. doi: 10.1051/epjconf/202125102061.
[26] Anderlini L., Boccali T., Dal Pra S., Duma D., Giommi L., Spiga D., Vino G.:
ML_INFN project: Status report and future perspectives, EPJ Web of Confer-
ences, vol. 295, 2024. doi: 10.1051/epjconf/202429508013.
[27] Antonacci M., Salomoni D.: Leveraging TOSCA orchestration to enable fully
automated cloud-based research environments on federated heterogeneous e-
infrastructures, PoS, vol. ISGC&HEPiX2023, 020, 2023. doi: 10.22323 /
1.434.0020.
[28] Bergholm V., Izaac J., Schuld M., Gogolin C., Ahmed S., Ajith V., Alam M.S.,
et al.: PennyLane: Automatic differentiation of hybrid quantum-classical com-
putations, 2022. https://arxiv.org/abs/1811.04968.
[29] Bockelman B., Livny M., Lin B., Prelz F.: Principles, technologies, and time:
The translational journey of the HTCondor-CE, Journal of Computational Sci-
ence, vol. 52, 101213, 2021. doi: 10.1016/j.jocs.2020.101213. Case Studies in
Translational Computer Science.
[30] Ceccanti A., Hardt M., Wegh B., Millar A., Caberletti M., Vianello E., Lice-
hammer S.: The INDIGO-Datacloud Authentication and Authorization Infras-
tructure, Journal of Physics: Conference Series, vol. 898(10), 102016, 2017.
doi: 10.1088/1742-6596/898/10/102016.
[31] Chen S., Glioti A., Panico G., Wulzer A.: Boosting likelihood learning with event
reweighting, Journal of High Energy Physics, vol. 2024, 117, 2024. doi: 10.1007/
JHEP03(2024)117.
[32] Chollet F., et al.: Keras, https://keras.io, 2015.
[33] Ciangottini D.: rclone, 2022. https://github.com/DODAS-TS/rclone.
[34] Eddelbuettel D.: A Brief Introduction to Redis, 2022. https://arxiv.org/abs/
2203.06559
FastML Team: fastmachinelearning/hls4ml, 2023. doi: 10.5281/zenodo.1201549.
[36] Grafana Labs: Grafana Documentation, 2018. https://grafana.com/docs/.
[37] Grant T., Karau H., Lublinsky B., Liu R., Filonenko I.: Kubeflow for Ma-
chine Learning, O’Reilly Media, 2020. https://books.google.it/books?id=
YLICEAAAQBAJ.
[38] Janssens D., Brunbauer F., Flöthner K., Lisowska M., Muller H., Oliveri E.,
Orlandini G., et al.: Studying signals in particle detectors with resistive ele-
ments such as the 2D resistive strip bulk MicroMegas, Journal of Instrumenta-
tion, vol. 18(08), C08010, 2023. doi: 10.1088/1748-0221/18/08/C08010.
[39] Kluyver T., Ragan-Kelley B., Pérez F., Granger B., Bussonnier M., Frederic J.,
Kelley K., et al.: Jupyter Notebooks – a publishing format for reproducible
computational workflows. In: F. Loizides, B. Schmidt (eds.), Positioning and
Power in Academic Publishing: Players, Agents and Agendas, pp. 87–90, IOS
Press, 2016.
[40] Lizzi F., Postuma I., Brero F., Cabini R., Fantacci M., Oliva P., Rinaldi L., et al.:
Quantification of pulmonary involvement in COVID-19 pneumonia: an upgrade
of the LungQuant software for lung CT segmentation, The European Physical
Journal Plus, vol. 138, 2023. doi: 10.1140/epjp/s13360-023-03896-4.
[41] Mariani S., Anderlini L., Di Nezza P., Franzoso E., Graziani G., Pappalardo L.L.:
A neural-network-defined Gaussian mixture model for particle identification ap-
plied to the LHCb fixed-target programme, Journal of Physics: Conference Se-
ries, vol. 2438(1), 012107, 2023. doi: 10.1088/1742-6596/2438/1/012107.
[42] Mariotti M., Magalotti D., Spiga D., Storchi L.: The BondMachine, a moldable
computer architecture, Parallel Computing, vol. 109, 102873, 2022. doi: 10.1016/
j.parco.2021.102873.
[43] NVIDIA, Vingelmann P., Fitzek F.H.: CUDA, release: 10.2.89, 2020. https:
//developer.nvidia.com/cuda-toolkit.
[44] Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Killeen T.,
et al.: PyTorch: An Imperative Style, High-Performance Deep Learning Library.
In: Advances in Neural Information Processing Systems 32, pp. 8024–8035, Cur-
ran Associates, Inc., 2019. http://papers.neurips.cc/paper/9015-pytorch-an-
imperative-style-high-performance-deep-learning-library.pdf.
[45] Salomoni D., Campos I., Gaido L., de Lucas J.M., Solagna P., Gomes J.,
Matyska L., et al.: INDIGO-DataCloud: a Platform to Facilitate Seamless Access
to E-Infrastructures, Journal of Grid Computing, vol. 16(3), pp. 381–408, 2018.
doi: 10.1007/s10723-018-9453-3.
[46] Schneppenheim M.: Kube eagle, 2020. https://github.com/cloudworkz/kube-
eagle.
[47] Stetzler S., Jurić M., Boone K., Connolly A., Slater C.T., Zečević P.: The As-
tronomy Commons Platform: A Deployable Cloud-based Analysis Platform for
Astronomy, The Astronomical Journal, vol. 164(2), 68, 2022. doi: 10.3847/1538-
3881/ac77fb
48] Tejedor E., Bocchi E., Castro D., Gonzalez H., Lamanna M., Mato P., Mosci-
cki J., et al.: Facilitating Collaborative Analysis in SWAN, EPJ Web Conferences,
vol. 214, 07022, 2019. doi: 10.1051/epjconf/201921407022.
[49] Weil S.A., Brandt S.A., Miller E.L., Long D.D.E., Maltzahn C.: Ceph: a scal-
able, high-performance distributed file system. In: Proceedings of the 7th Sympo-
sium on Operating Systems Design and Implementation, pp. 307–320, OSDI ’06,
USENIX Association, USA, 2006.
[50] Winikoff M., Padgham L.: The Prometheus Methodology. In: F. Bergenti, M.P.
Gleizes, F. Zambonelli (eds.), Methodologies and Software Engineering for Agent
Systems, pp. 217–234, Springer, Boston, 2004. doi: 10.1007/1-4020-8058-1_14.
[51] Yoo A.B., Jette M.A., Grondona M.: SLURM: Simple Linux Utility for Re-
source Management. In: D. Feitelson, L. Rudolph, U. Schwiegelshohn (eds.), Job
Scheduling Strategies for Parallel Processing, pp. 44–60, Springer Berlin Heidel-
berg, Berlin, Heidelberg, 2003.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Computer Science

This work is licensed under a Creative Commons Attribution 4.0 International License.