Operational tools for data management in biosciences
Etablissement : ECOLE DU NUMERIQUE
Langue : Anglais
Formation(s) dans laquelle/lesquelles le cours apparait :
Période : S3
- Python programming knowledge (intermediate level).
- Understanding of databases: Familiarity with SQL and basic database management would be helpful.
- Basic biology: Given the bioscience focus, students should have foundational knowledge in life sciences to understand biological datasets and their context.
- Familiarity with command line interfaces: Necessary for bash scripting and using tools like git.
By the end of the course, students should be able to:
- Understand the principles of data management, including planning, acquisition, processing, and sharing of scientific data.
- “Implement basic data pipelines in accordance with ETL principles.
- Work with large-scale biological datasets and employ modern software tools to analyze, organize, and document research.
- Collaborate effectively in research using ELN tools and version control systems to ensure reproducibility and transparency.
- Deploy APIs and containerized applications for sharing and managing data in an open-access framework.
Chapter 1. Introduction to data management
- What is data management?
- The importance of data management
- Data management in science
- Practical work: Research data management at the ETH
- The data lifecycle
- Managing the data
- Example of a data lifecycle in the biomedical field
- The main characteristics of well-managed data
- The FAIR data principles
Chapter 2. Planning for data management
- Creating a data management plan
- Data policies (Quick overview, details in course B0908 European environment and policies in life sciences and public health)
- Case studies
Chapter 3. Data acquisition and pre-processing
- Preparing data for analysis
- The Extract-Transform-Load and Extract-Load-Transform processes
- Examples of tools used for ETL
- Practical work: ETL with Python on Cedrus data
- The JSON data format
- Manipulation a JSON file using bash and Python
- Using SQLAlchemy to manage SQL database from Python
Chapter 4. Data aggregation and data integration
- Data aggregation
- Data integration
- Practical work with mixOmics
Chapter 5. Data analysis
- Raw vs analysed data
- Gold, Silver, and Bronze Levels of Data
- Managing research code
- Workflow systems
- Practical work with Nextflow/Circos
Chapter 6. Organization
- File organization
- Naming convention
- Databases
- Storage and backups
- Version control systems (File versioning)
- Practical work: collaborative python coding using git
Chapter 7. Managing sensitive data
- Types of sensitive data
- Keeping data secure (Quick overview, details in course B0906 Mechanisms of data protection)
- Anonymizing data (Quick overview, details in course B0906 Mechanisms of data protection)
- Practical work: data anonymization with python.
Chapter 8. Documentation
- Notes, Laboratory Information Management System (LIMS) and Electronic Laboratory Notebook (ELN).
- Methods
- Other useful documentation formats
- Metadata
- Metadata management
- Standards
- Practical work: collaborative management of RNA sequencing data using eLabFTW: from experimental design to results sharing
Chapter 9. Reproducibility and interoperability
- Introduction (Quick overview, details in course B0909 Responsible Research and Innovation)
- Application Programming interfaces
- HTTP status codes
- Containerization
- Practical work: construction of an API with the python package FASTAPI on a biological open access database and deploy it in a docker container.
Chapter 10. Conclusion
- Data sharing (Quick overview, details in course B0909 Responsible Research and Innovation)
- Open Access and its Green and Gold routes
- Data archiving
- Restarting the data life cycle