Operational tools for data management in biosciences

Etablissement : ECOLE DU NUMERIQUE

Langue : Anglais

Période : S3



  • Python programming knowledge (intermediate level).

  • Understanding of databases: Familiarity with SQL and basic database management would be helpful.

  • Basic biology: Given the bioscience focus, students should have foundational knowledge in life sciences to understand biological datasets and their context.







  • Familiarity with command line interfaces: Necessary for bash scripting and using tools like git.



By the end of the course, students should be able to:



  • Understand the principles of data management, including planning, acquisition, processing, and sharing of scientific data.

  • “Implement basic data pipelines in accordance with ETL principles.

  • Work with large-scale biological datasets and employ modern software tools to analyze, organize, and document research.

  • Collaborate effectively in research using ELN tools and version control systems to ensure reproducibility and transparency.

  • Deploy APIs and containerized applications for sharing and managing data in an open-access framework.

Chapter 1. Introduction to data management



  1. What is data management?

  2. The importance of data management

  3. Data management in science

    1. Practical work: Research data management at the ETH


    2. The data lifecycle

      1. Managing the data

      2. Example of a data lifecycle in the biomedical field

      3. The main characteristics of well-managed data

        1. The FAIR data principles





Chapter 2. Planning for data management



  1. Creating a data management plan

  2. Data policies (Quick overview, details in course B0908 European environment and policies in life sciences and public health)

  3. Case studies


Chapter 3. Data acquisition and pre-processing



  1. Preparing data for analysis

  2. The Extract-Transform-Load and Extract-Load-Transform processes

    1. Examples of tools used for ETL

    2. Practical work: ETL with Python on Cedrus data

      1. The JSON data format

      2. Manipulation a JSON file using bash and Python

      3. Using SQLAlchemy to manage SQL database from Python




Chapter 4. Data aggregation and data integration



  1. Data aggregation

  2. Data integration

  3. Practical work with mixOmics


Chapter 5. Data analysis



  1. Raw vs analysed data

  2. Gold, Silver, and Bronze Levels of Data

  3. Managing research code

  4. Workflow systems

  5. Practical work with Nextflow/Circos


Chapter 6. Organization



  1. File organization

  2. Naming convention

  3. Databases

  4. Storage and backups

  5. Version control systems (File versioning)

  6. Practical work: collaborative python coding using git


Chapter 7. Managing sensitive data



  1. Types of sensitive data

  2. Keeping data secure (Quick overview, details in course B0906 Mechanisms of data protection)

  3. Anonymizing data (Quick overview, details in course B0906 Mechanisms of data protection)

    1. Practical work: data anonymization with python.



Chapter 8. Documentation



  1. Notes, Laboratory Information Management System (LIMS) and Electronic Laboratory Notebook (ELN).

  2. Methods

  3. Other useful documentation formats

  4. Metadata

    1. Metadata management

    2. Standards

    3. Practical work: collaborative management of RNA sequencing data using eLabFTW: from experimental design to results sharing



Chapter 9. Reproducibility and interoperability



  1. Introduction (Quick overview, details in course B0909 Responsible Research and Innovation)

  2. Application Programming interfaces

    1. HTTP status codes

    2. Containerization

    3. Practical work: construction of an API with the python package FASTAPI on a biological open access database and deploy it in a docker container.



Chapter 10. Conclusion



  1. Data sharing (Quick overview, details in course B0909 Responsible Research and Innovation)

    1. Open Access and its Green and Gold routes

    2. Data archiving

    3. Restarting the data life cycle