Computational aspects of AI for environmental sciences

Interpretability & Analysis

Design patterns

large-scale data management

large-scale data analysis

data science

storage systems

data management

Course Overview

This lecture provides an introduction to modern techniques for large-scale data handling and data analysis. It focuses primarily on Earth system data, but the principles described here can also be applied to many other types of data from other scientific disciplines. The course begins with an easy and general introduction and leads to advanced data management concepts and design patterns towards the end. We use the notion of design patterns here, which is borrowed from software development to describe reusable patterns. While reusability of large-scale data workflows is difficult, there are nevertheless some overarching principles and techniques that are useful to know and understand. Examples for such techniques include concepts from database design, such as sharding, and modern programming paradigms for massive data analysis on distributed clouds, i.e. map and reduce. At the end of this lecture you should have a basic understanding of the main challenges of large-scale data handling and of several important techniques you can use to address these challenges.

Details

•

Lessons:

17

•

Course Length:

1h : 57min

Lecturer

•

PD Dr. Martin Schultz

•

•

Intro Part I - Data science and big data analytics

Web accessible data & data publications

Pythons request library

Some hints for good data management

The netCDF file format

The role of metadata

Work with netCDF data in Python

Intro Part II - Data science and big data analytics

Types of data in Earth system science

5 "V" of Earth system data types

How to cope with > 1 TByte of data

Intro Part III - Data science and big data analytics

Challenges of large-scale data analysis and data system architectures

Data structures, data models & data patterns

Classic design patterns

Modern design patterns

Hadoop & MapReduce

Overview

This lecture provides an introduction to modern techniques for large-scale data handling and data analysis. It focuses primarily on Earth system data, but the principles described here can also be applied to many other types of data from other scientific disciplines. The course begins with an easy and general introduction and leads to advanced data management concepts and design patterns towards the end. We use the notion of design patterns here, which is borrowed from software development to describe reusable patterns. While reusability of large-scale data workflows is difficult, there are nevertheless some overarching principles and techniques that are useful to know and understand. Examples for such techniques include concepts from database design, such as sharding, and modern programming paradigms for massive data analysis on distributed clouds, i.e. map and reduce. At the end of this lecture you should have a basic understanding of the main challenges of large-scale data handling and of several important techniques you can use to address these challenges.

PD Dr. Martin Schultz

•

This is some text inside of a div block.

•

This is some text inside of a div block.

Intro Part I - Data science and big data analytics

General introduction to data science

2:35

PD Dr. Martin Schultz

•

This is some text inside of a div block.

•

This is some text inside of a div block.

Web accessible data & data publications

Many datasets of environmental models or observations are now available through web services. Here, I explain, how you can work with these.

8:58

PD Dr. Martin Schultz

•

This is some text inside of a div block.

•

This is some text inside of a div block.

Pythons request library

Python's request library is an important cornerstone for interacting with web services from within a Python program. We therefore dive a little deeper into it here.

7:27

PD Dr. Martin Schultz

•

This is some text inside of a div block.

•

This is some text inside of a div block.

Some hints for good data management

Before you get lost in massive amounts of data, it is useful if you understand some good practices for data management. This part of the lecture shall help you with that.

10:41

PD Dr. Martin Schultz

•

This is some text inside of a div block.

•

This is some text inside of a div block.

The netCDF file format

Netcdf is one important format for storing environmental data and it is primarily used for gridded model output and input. Here, I explain you the data model of netcdf and tell you how you can work with data in this format.

6:39

PD Dr. Martin Schultz

•

This is some text inside of a div block.

•

This is some text inside of a div block.

The role of metadata

Data is useless if you don't know what these data are, in what units th evariables are stored, or where the data comes from. Here, I introduce a number of fundamental aspects about environmental metadata.

4:00

PD Dr. Martin Schultz

•

This is some text inside of a div block.

•

This is some text inside of a div block.

Work with netCDF data in Python

This final section of lecture part 1 introduces some advanced Python tools and libraries for efficient and user-friendly work with netcdf data.

6:32

PD Dr. Martin Schultz

•

This is some text inside of a div block.

•

This is some text inside of a div block.

Intro Part II - Data science and big data analytics

Introduction to part 2 of this course.

1:07

PD Dr. Martin Schultz

•

This is some text inside of a div block.

•

This is some text inside of a div block.

Types of data in Earth system science

Earth system data comes in many different formats and shapes. This part provides a general overview of Earth system data types and their key properties.

6:29

PD Dr. Martin Schultz

•

This is some text inside of a div block.

•

This is some text inside of a div block.

5 "V" of Earth system data types

Here, we explore what characterizes "large" data and provide some examples. The first important aspect to investigate is data volume.

10:56

PD Dr. Martin Schultz

•

This is some text inside of a div block.

•

This is some text inside of a div block.

How to cope with > 1 TByte of data

This final section of part 2 provides a glimpse on tools and techniques for working with really large datasets.

1:33

PD Dr. Martin Schultz

•

This is some text inside of a div block.

•

This is some text inside of a div block.

Intro Part III - Data science and big data analytics

Background information on large-scale data handling

1:50

PD Dr. Martin Schultz

•

This is some text inside of a div block.

•

This is some text inside of a div block.

Challenges of large-scale data analysis and data system architectures

This part of the lecture describes different types of data storage systems and discusses some implications for the management of data.It covers simple file systems, databases and data warehouses, hierarchical storage architectures on HPC systems, and complex client-server architectures.

6:15

PD Dr. Martin Schultz

•

This is some text inside of a div block.

•

This is some text inside of a div block.

Data structures, data models & data patterns

Data come in many different ways and formats. Relevant for Earth sciences are the following data types: unstructured data, point clouds, series and time series, tree structures, relational tables, graphs, gridded data, images and videos. Data structures and formats influence access patterns and access speed.

4:45

PD Dr. Martin Schultz

•

This is some text inside of a div block.

•

This is some text inside of a div block.

Classic design patterns

This section explains a number of classic data handling patterns and techniques. It starts with the extract-transform-load pattern, describes some aspects of chunking and tiling, introduces index tables, and memory mapping.

14:10

PD Dr. Martin Schultz

•

This is some text inside of a div block.

•

This is some text inside of a div block.

Modern design patterns

Here, we discuss some modern design patterns for data management with particular focus on distributed architectures. Key concepts that are introduced in this section with examples are asynchronous processing, caching, messaging, and sharding. These are important concepts to allow for parallel data processing in heterogeenous environments.

15:20

PD Dr. Martin Schultz

•

This is some text inside of a div block.

•

This is some text inside of a div block.

Hadoop & MapReduce

Hadoop and Map (and) Reduce are two examples of modern, sophisticated designs for the asynchronous parallel processing of massive amounts of data. This section of the lecture provides an overview on the Hadoop architecture and describes the map-reduce algorithm with an example.

6:29

KI:STE

About KI:STE Work Packages Research AI Platform

Partners

Forschungszentrum Jülich Aachen University RWTH University of Cologne University of Bonn

Members

Legal

Privacy Policy Imprint