Vis sammendrag
The benefits of Open Science (OS) and FAIR foundational principles - Findable, Accessible, Interoperable and Reusable - are increasingly valued by academia, although what OS and FAIR entail in practice is largely misunderstood. Once researchers manage to grasp OS and FAIR principles, they are often hit by practical difficulties. The European Open Science Cloud (EOSC) is the main initiative in Europe for providing a federated and open multi-disciplinary environment where European researchers, innovators, companies and citizens can share, publish, find and re-use data, tools and services for research, innovation and educational purposes. One of the goals of the EOSC is to co-design with communities the tools and services that are useful for their day to day research work, to facilitate collaboration and to foster wider adoption of Open Science practices.
The Pangeo (https://pangeo.io/) community is a world-wide community of scientists and developers, who thrives to facilitate the deployment of ready-to-use and community-driven platforms for big data geoscience. While a number of services based on Jupyter Notebooks were already available, no public Pangeo deployments providing fast access to large amounts of data and compute resources were accessible on EOSC. Most existing cloud-based Pangeo deployments are USA-based, and members of the Pangeo community in Europe did not have a shared platform where scientists or technologists could exchange know-how and experiences. Pangeo teamed up with two EOSC projects, namely EGI-ACE (https://www.egi.eu/project/egi-ace/) and C-SCALE (https://c-scale.eu/) to demonstrate how to deploy and use Pangeo on EOSC and emphasise the benefits for the European community.
The Pangeo Europe Community together with EGI deployed a DaskHub, composed of a Dask Gateway (https://gateway.dask.org/) and JupyterHub (https://jupyter.org/hub), with a Kubernetes cluster backend on EOSC using the infrastructure of the EGI Federation (https://www.egi.eu/egi-federation/). The Pangeo EOSC JupyterHub deployment makes use of 1) the EGI Check-In to enable user registration (and thereby authenticated and authorised access to the Pangeo JupyterHub portal and to the underlying distributed compute infrastructure); and 2) the EGI Cloud Compute and the cloud-based EGI Online Storage (to distribute the computational tasks to a scalable compute platform and to store intermediate results produced by the user jobs).
To facilitate future Pangeo deployments on top of a wide range of cloud providers (AWS, Google Cloud, Microsoft Azure, EGI Cloud Computing, OpenNebula, OpenStack, and many more), the Pangeo EOSC JupyterHub deployment is now possible through a so-called Infrastructure Manager (IM) Dashboard (https://im.egi.eu/). All the computing and storage resources are currently supplied by CESNET (https://www.cesnet.cz/?lang=en) in the frame of EGI-ACE project (https://www.egi.eu/project/egi-ace/). Several deployments have been made to serve the geoscience community, both for teaching and for research work. One major advantage of these deployments for teaching and on-boarding researchers is the possibility to train them with realistic, large and complex data analysis problems similar to or directly part of their research work. Participants are taught the usage of Xarray, Dask and more generally how to efficiently access and analyse large online datasets. With this approach, attendees have the opportunity to ask questions, collaborate with other researchers as well as Research Software Engineers, and apply Open Science practices without the burden of trying and (sometimes) failing alone and without having to build their own infrastructure. To date, more than 100 researchers have been trained on Pangeo@EOSC deployments. With a growing community, discussions arose about the need to define clear practices for writing and publishing FAIR Jupyter Notebooks that can be reused and built upon for new research. This is where the community of