Computational aspects of AI for environmental sciences
Course Overview
This lecture provides an introduction to modern techniques for large-scale data handling and data analysis. It focuses primarily on Earth system data, but the principles described here can also be applied to many other types of data from other scientific disciplines. The course begins with an easy and general introduction and leads to advanced data management concepts and design patterns towards the end. We use the notion of design patterns here, which is borrowed from software development to describe reusable patterns. While reusability of large-scale data workflows is difficult, there are nevertheless some overarching principles and techniques that are useful to know and understand. Examples for such techniques include concepts from database design, such as sharding, and modern programming paradigms for massive data analysis on distributed clouds, i.e. map and reduce. At the end of this lecture you should have a basic understanding of the main challenges of large-scale data handling and of several important techniques you can use to address these challenges.
Details
Lecturer
Overview
This lecture provides an introduction to modern techniques for large-scale data handling and data analysis. It focuses primarily on Earth system data, but the principles described here can also be applied to many other types of data from other scientific disciplines. The course begins with an easy and general introduction and leads to advanced data management concepts and design patterns towards the end. We use the notion of design patterns here, which is borrowed from software development to describe reusable patterns. While reusability of large-scale data workflows is difficult, there are nevertheless some overarching principles and techniques that are useful to know and understand. Examples for such techniques include concepts from database design, such as sharding, and modern programming paradigms for massive data analysis on distributed clouds, i.e. map and reduce. At the end of this lecture you should have a basic understanding of the main challenges of large-scale data handling and of several important techniques you can use to address these challenges.
Intro Part I - Data science and big data analytics
General introduction to data science
Web accessible data & data publications
Many datasets of environmental models or observations are now available through web services. Here, I explain, how you can work with these.
Pythons request library
Python's request library is an important cornerstone for interacting with web services from within a Python program. We therefore dive a little deeper into it here.
Some hints for good data management
Before you get lost in massive amounts of data, it is useful if you understand some good practices for data management. This part of the lecture shall help you with that.
The netCDF file format
Netcdf is one important format for storing environmental data and it is primarily used for gridded model output and input. Here, I explain you the data model of netcdf and tell you how you can work with data in this format.
The role of metadata
Data is useless if you don't know what these data are, in what units th evariables are stored, or where the data comes from. Here, I introduce a number of fundamental aspects about environmental metadata.
Work with netCDF data in Python
This final section of lecture part 1 introduces some advanced Python tools and libraries for efficient and user-friendly work with netcdf data.
Intro Part II - Data science and big data analytics
Introduction to part 2 of this course.
Types of data in Earth system science
Earth system data comes in many different formats and shapes. This part provides a general overview of Earth system data types and their key properties.
5 "V" of Earth system data types
Here, we explore what characterizes "large" data and provide some examples. The first important aspect to investigate is data volume.
How to cope with > 1 TByte of data
This final section of part 2 provides a glimpse on tools and techniques for working with really large datasets.
Intro Part III - Data science and big data analytics
Background information on large-scale data handling
Challenges of large-scale data analysis and data system architectures
This part of the lecture describes different types of data storage systems and discusses some implications for the management of data.It covers simple file systems, databases and data warehouses, hierarchical storage architectures on HPC systems, and complex client-server architectures.
Data structures, data models & data patterns
Data come in many different ways and formats. Relevant for Earth sciences are the following data types: unstructured data, point clouds, series and time series, tree structures, relational tables, graphs, gridded data, images and videos. Data structures and formats influence access patterns and access speed.
Classic design patterns
This section explains a number of classic data handling patterns and techniques. It starts with the extract-transform-load pattern, describes some aspects of chunking and tiling, introduces index tables, and memory mapping.
Modern design patterns
Here, we discuss some modern design patterns for data management with particular focus on distributed architectures. Key concepts that are introduced in this section with examples are asynchronous processing, caching, messaging, and sharding. These are important concepts to allow for parallel data processing in heterogeenous environments.
Hadoop & MapReduce
Hadoop and Map (and) Reduce are two examples of modern, sophisticated designs for the asynchronous parallel processing of massive amounts of data. This section of the lecture provides an overview on the Hadoop architecture and describes the map-reduce algorithm with an example.