CERNs know-how and experience with ‘big data’ analysis for high energy physics and control of systems used in the LHC.
- Cern's experiments probing the fundamental nature of the universe creates 1 PB/sec —roughly four times that held in the US Library of Congress
- About 1 million CPU cores worldwide are used to process and analyse all data from LHC, using advanced data analytics
- Additionally, online and offline analysis of the data acquired from each of the 20,000 devices that monitor and control the CERN complex
Facts & Figures
- >10 PB/month of data selected by trigger mechanisms and stored in CERN Data Center
- 170 Data Centers worldwide at which LHC data analysis is being done
- >250 PB CERN Data Center storing all physics data for analysis
- >1000 PB of ROOT data
Read more about Data Analytics here.
Designing Data Analytics Infrastructure
In order to process and analyze the vast amounts of data generated by the experiments at CERN, a data infrastructure was designed for distributed analytics. This infrastructure is made of various layers and allows 1000 clients to access the data for analysis, handling >5 million data transaction per day. With its unique knowhow in structuring big data sets, CERN can elaborate efficient analysis.
Components used for big data and related analytics
- User Interface: Notebooks, SWAN (developed by CERN)
- Data analysis: ROOT / TMVA (developed by CERN)
- Apache Hadoop clusters with YARN and HDFS (also HBase, Impala, Hive,...)
- Apache Spark for analytics and Apache Kafka for streaming
Data Analysis for Control Systems
CERN analyses data from its large industrial infrastructure, for monitoring, control and predictive maintenance purposes. This includes data from accelerators, detectors, cryogenic systems, data centers and log files from the Worldwide LHC Computing Grid and others.
- Online monitoring (analysis of logs, alarms, loads)
- Fault analysis (root cause analysis / fault detection)
- Predictive maintenance
- Input for new engineering designs
Robust Big Bata Analysis Framework
ROOT / TMVA is a modular big data software framework, providing the functionalities needed to deal with big data statistical analysis, visualisation and storage. It is mainly written in C++ but integrated with other languages such as Python and R. Integrated machine learning environment (bindings for Python is provided). Good for analysis of extreme large sets of structured data. Used in industry, physics, biology, finance and insurance fraud analysis. Possible application in processing and analysis of large medical datasets, for example genomics data, EEG/ECG data, biosensor data.
Interactive Data Analysis in the Cloud
SWAN (Service for web based analysis) offers an integrated environment for data analysis in the CERN cloud where the user can find all the experiment and user data together with rich stacks of scientific software. The interface offered by the service is the one of Jupyter notebooks. For any service that allows users to perform interactive data analysis in the cloud, following a "software as a service" model. Especially for cloud based analysis of very large datasets by many users using different analytics tools and programming languages.