DEVELOPING ARTIFICIAL INTELLIGENCE IN THE CLOUD: THE AI INFN PLATFORM

Rosa Petrini; Lucio Anderlini; Matteo Barbetti; Giulio Bianchini; Diego Ciangottini; Stefano Dal Pra; Diego Michelotto; Daniele Spiga

doi:10.7494/csci.2025.26.SI.7071

Authors

Rosa Petrini INFN of Florence
Lucio Anderlini
Matteo Barbetti
Giulio Bianchini
Diego Ciangottini
Stefano Dal Pra
Diego Michelotto
Daniele Spiga

DOI:

https://doi.org/10.7494/csci.2025.26.SI.7071

Abstract

The INFN CSN5-funded project AI INFN (“Artificial Intelligence at INFN”) aims to promote ML and AI adoption within INFN by providing comprehensive support, including state of-the-art hardware and cloud-native solutions within INFN Cloud. This facilitates efficient sharing of hardware accelerators without hindering the institute’s diverse research activities. AI INFN advances from a Virtual-Machine-based model to a flexible Kubernetes-based platform, offering features such as JWT-based authentication, JupyterHub multitenant interface, distributed file system, customizable conda environments, and specialized monitoring and accounting systems. It also enables virtual nodes in the cluster, offloading computing payloads to remote resources through the Virtual Kubelet technology, with InterLink as provider. This setup can manage workflows across various providers and hardware types, which is crucial for scientific use cases that require dedicated infrastructures for different parts of the workload. Results of initial tests to validate its production applicability, emerging case studies and integration scenarios are presented.

Downloads

Download data is not yet available.

References

[1] AMD Vitis™ AI Software, 2024. https://www.amd.com/en/products/software/

vitis-ai.html. Accessed: 15/09/2024.

[2] Apache Airflow, 2024. https://airflow.apache.org/. Accessed: 12/12/2024.

[3] Borg, 2024. https : / / borgbackup.readthedocs.io / en / stable / #. Accessed:

15/09/2024.

[4] Conda, 2024. https://conda.io. Accessed: 15/09/2024.

[5] Docker Swarm Mode, 2024. https://docs.docker.com/engine/swarm/. Accessed:

12/12/2024.

[6] ELOG, 2024. https://elog.psi.ch/elog/. Accessed: 15/09/2024.

[7] Intel© Distribution of OpenVINO™ Toolkit, 2024. https://www.intel.com/

content/www/us/en/developer/tools/openvino-toolkit/overview.html. Accessed:

15/09/2024.

[8] InterLink, 2024. https : / / intertwin - eu.github.io / interLink/. Accessed:

15/09/2024.

[9] InterTwin, 2024. https://www.intertwin.eu/. Accessed: 15/09/2024.

[10] JuiceFS, 2024. https://juicefs.com/en/. Accessed: 15/09/2024.

[11] Jupyter Server Proxy, 2024. https://jupyter-server-proxy.readthedocs.io/en/

latest/. Accessed: 15/09/2024.

[12] Kueue, 2024. https://kueue.sigs.k8s.io/. Accessed: 15/09/2024.

[13] Leonardo, 2024. https : / / leonardo - supercomputer.cineca.eu/. Accessed:

15/09/2024.

[14] MinIO, 2024. https://min.io/. Accessed: 15/09/2024.

[15] Multi-instance GPU, 2024. https://www.nvidia.com/it-it/technologies/multi-

instance-gpu/. Accessed: 15/09/2024.

[16] Nomad by HashiCorp, 2024. https : / / www.nomadproject.io/. Accessed:

12/12/2024

[17] nVidia DCGM Exporter, 2024. https://docs.nvidia.com/datacenter/cloud-

native/gpu-telemetry/latest/dcgm-exporter.html. Accessed: 15/09/2024.

[18] NVIDIA GPU Operator, 2024. https://docs.nvidia.com/datacenter/cloud-

native/gpu-operator/latest/index.html. Accessed: 15/09/2024.

[19] PostgreSQL, 2024. https://www.postgresql.org/. Accessed: 15/09/2024.

[20] Python venv, 2024. https://docs.python.org/3/library/venv.html. Accessed:

15/09/2024.

[21] Rados Gateway, 2024. https://docs.ceph.com/en/reef/radosgw/. Accessed:

15/09/2024.

[22] Squashfs, 2024. https : / / www.kernel.org / doc / Documentation / filesystems /

squashfs.txt. Accessed: 15/09/2024.

[23] Virtual Kubelet, 2024. https://virtual-kubelet.io/. Accessed: 15/09/2024.

[24] Abadi M., Agarwal A., Barham P., Brevdo E., Chen Z., Citro C., Corrado G.S.,

et al.: TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems,

2015. https://www.tensorflow.org/. Software available from tensorflow.org.

[25] Adamec M., Attebury G., Bloom K., Bockelman B., Lundstedt C., Shadura O.,

Thiltges J.: Coffea-casa: an analysis facility prototype, EPJ Web Conferences,

vol. 251, 02061, 2021. doi: 10.1051/epjconf/202125102061.

[26] Anderlini L., Boccali T., Dal Pra S., Duma D., Giommi L., Spiga D., Vino G.:

ML_INFN project: Status report and future perspectives, EPJ Web of Confer-

ences, vol. 295, 2024. doi: 10.1051/epjconf/202429508013.

[27] Antonacci M., Salomoni D.: Leveraging TOSCA orchestration to enable fully

automated cloud-based research environments on federated heterogeneous e-

infrastructures, PoS, vol. ISGC&HEPiX2023, 020, 2023. doi: 10.22323 /

1.434.0020.

[28] Bergholm V., Izaac J., Schuld M., Gogolin C., Ahmed S., Ajith V., Alam M.S.,

et al.: PennyLane: Automatic differentiation of hybrid quantum-classical com-

putations, 2022. https://arxiv.org/abs/1811.04968.

[29] Bockelman B., Livny M., Lin B., Prelz F.: Principles, technologies, and time:

The translational journey of the HTCondor-CE, Journal of Computational Sci-

ence, vol. 52, 101213, 2021. doi: 10.1016/j.jocs.2020.101213. Case Studies in

Translational Computer Science.

[30] Ceccanti A., Hardt M., Wegh B., Millar A., Caberletti M., Vianello E., Lice-

hammer S.: The INDIGO-Datacloud Authentication and Authorization Infras-

tructure, Journal of Physics: Conference Series, vol. 898(10), 102016, 2017.

doi: 10.1088/1742-6596/898/10/102016.

[31] Chen S., Glioti A., Panico G., Wulzer A.: Boosting likelihood learning with event

reweighting, Journal of High Energy Physics, vol. 2024, 117, 2024. doi: 10.1007/

JHEP03(2024)117.

[32] Chollet F., et al.: Keras, https://keras.io, 2015.

[33] Ciangottini D.: rclone, 2022. https://github.com/DODAS-TS/rclone.

[34] Eddelbuettel D.: A Brief Introduction to Redis, 2022. https://arxiv.org/abs/

2203.06559

FastML Team: fastmachinelearning/hls4ml, 2023. doi: 10.5281/zenodo.1201549.

[36] Grafana Labs: Grafana Documentation, 2018. https://grafana.com/docs/.

[37] Grant T., Karau H., Lublinsky B., Liu R., Filonenko I.: Kubeflow for Ma-

chine Learning, O’Reilly Media, 2020. https://books.google.it/books?id=

YLICEAAAQBAJ.

[38] Janssens D., Brunbauer F., Flöthner K., Lisowska M., Muller H., Oliveri E.,

Orlandini G., et al.: Studying signals in particle detectors with resistive ele-

ments such as the 2D resistive strip bulk MicroMegas, Journal of Instrumenta-

tion, vol. 18(08), C08010, 2023. doi: 10.1088/1748-0221/18/08/C08010.

[39] Kluyver T., Ragan-Kelley B., Pérez F., Granger B., Bussonnier M., Frederic J.,

Kelley K., et al.: Jupyter Notebooks – a publishing format for reproducible

computational workflows. In: F. Loizides, B. Schmidt (eds.), Positioning and

Power in Academic Publishing: Players, Agents and Agendas, pp. 87–90, IOS

Press, 2016.

[40] Lizzi F., Postuma I., Brero F., Cabini R., Fantacci M., Oliva P., Rinaldi L., et al.:

Quantification of pulmonary involvement in COVID-19 pneumonia: an upgrade

of the LungQuant software for lung CT segmentation, The European Physical

Journal Plus, vol. 138, 2023. doi: 10.1140/epjp/s13360-023-03896-4.

[41] Mariani S., Anderlini L., Di Nezza P., Franzoso E., Graziani G., Pappalardo L.L.:

A neural-network-defined Gaussian mixture model for particle identification ap-

plied to the LHCb fixed-target programme, Journal of Physics: Conference Se-

ries, vol. 2438(1), 012107, 2023. doi: 10.1088/1742-6596/2438/1/012107.

[42] Mariotti M., Magalotti D., Spiga D., Storchi L.: The BondMachine, a moldable

computer architecture, Parallel Computing, vol. 109, 102873, 2022. doi: 10.1016/

j.parco.2021.102873.

[43] NVIDIA, Vingelmann P., Fitzek F.H.: CUDA, release: 10.2.89, 2020. https:

//developer.nvidia.com/cuda-toolkit.

[44] Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Killeen T.,

et al.: PyTorch: An Imperative Style, High-Performance Deep Learning Library.

In: Advances in Neural Information Processing Systems 32, pp. 8024–8035, Cur-

ran Associates, Inc., 2019. http://papers.neurips.cc/paper/9015-pytorch-an-

imperative-style-high-performance-deep-learning-library.pdf.

[45] Salomoni D., Campos I., Gaido L., de Lucas J.M., Solagna P., Gomes J.,

Matyska L., et al.: INDIGO-DataCloud: a Platform to Facilitate Seamless Access

to E-Infrastructures, Journal of Grid Computing, vol. 16(3), pp. 381–408, 2018.

doi: 10.1007/s10723-018-9453-3.

[46] Schneppenheim M.: Kube eagle, 2020. https://github.com/cloudworkz/kube-

eagle.

[47] Stetzler S., Jurić M., Boone K., Connolly A., Slater C.T., Zečević P.: The As-

tronomy Commons Platform: A Deployable Cloud-based Analysis Platform for

Astronomy, The Astronomical Journal, vol. 164(2), 68, 2022. doi: 10.3847/1538-

3881/ac77fb

48] Tejedor E., Bocchi E., Castro D., Gonzalez H., Lamanna M., Mato P., Mosci-

cki J., et al.: Facilitating Collaborative Analysis in SWAN, EPJ Web Conferences,

vol. 214, 07022, 2019. doi: 10.1051/epjconf/201921407022.

[49] Weil S.A., Brandt S.A., Miller E.L., Long D.D.E., Maltzahn C.: Ceph: a scal-

able, high-performance distributed file system. In: Proceedings of the 7th Sympo-

sium on Operating Systems Design and Implementation, pp. 307–320, OSDI ’06,

USENIX Association, USA, 2006.

[50] Winikoff M., Padgham L.: The Prometheus Methodology. In: F. Bergenti, M.P.

Gleizes, F. Zambonelli (eds.), Methodologies and Software Engineering for Agent

Systems, pp. 217–234, Springer, Boston, 2004. doi: 10.1007/1-4020-8058-1_14.

[51] Yoo A.B., Jette M.A., Grondona M.: SLURM: Simple Linux Utility for Re-

source Management. In: D. Feitelson, L. Rudolph, U. Schwiegelshohn (eds.), Job

Scheduling Strategies for Parallel Processing, pp. 44–60, Springer Berlin Heidel-

berg, Berlin, Heidelberg, 2003.

DEVELOPING ARTIFICIAL INTELLIGENCE IN THE CLOUD: THE AI INFN PLATFORM

Authors

DOI:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

Latest publications

Information

Make a Submission