
Operational tools for data management in biosciences
Etablissement : ECOLE DU NUMERIQUE
Langue : Anglais
Formation(s) dans laquelle/lesquelles le cours apparait :
- Master Data Management in Biosciences [ECTS : 4,00]
Période : S3
- Python programming knowledge (intermediate level).
- Understanding of databases: Familiarity with SQL and basic database management would be helpful.
- Basic biology: Given the bioscience focus, students should have foundational knowledge in life sciences to understand biological datasets and their context.
- Familiarity with command line interfaces: Necessary for bash scripting and using tools like git.
By the end of the course, students should be able to:
- Understand the principles of data management, including planning, acquisition, processing, and sharing of scientific data.
- “Implement basic data pipelines in accordance with ETL principles.
- Work with large-scale biological datasets and employ modern software tools to analyze, organize, and document research.
- Collaborate effectively in research using ELN tools and version control systems to ensure reproducibility and transparency.
- Deploy APIs and containerized applications for sharing and managing data in an open-access framework.
1. Introduction to data management
– Course content
-> The importance of data management
-> Data management in sciences
-> The data lifecycle
–> Example of a data lifecycle in the biomedical field
-> The main characteristics of well-managed data
–> The FAIR data principles
– Practical works
-> Case study: Research data management at ETH
2. Planning for data management
– Course content
-> Creating a data management plan
–> Examples of data management plans in biosciences projects
-> Data policies
-> Case studies
– Comments
Data policies: Quick overview, details in course B0908 European environment and policies in life sciences and public health
3. Data acquisition and pre-processing
– Course content
-> Raw vs analysed data: Gold, Silver, and Bronze Levels of Data
-> Data integration and data aggregation
–> The Extract-Transform-Load and Extract-Load-Transform processes
–> Examples of tools used for ETL
– Tutorials
-> The JSON data format; manipulating JSON files using Python
-> REST APIs; manipulatinf APIs using Python (requests package)
– Practical works
-> ETL with Python on Cedrus data (SQLAlchemy package)
-> Data scraping with Python (BeautifulSoup and requests packages)
4. File and Versioning Best Practices
– Course content
-> File organisation
-> Naming conventions
–> Example of a workflow respecting a file naming scheme
-> Format conventions
–> Example: PEP8 in Python
-> Version control systems (file versioning)
– Tutorials
-> Tutorial on Git
– Practical works
-> Collaborative work on Git to construct an API with the Python package FastAPI on a biological (static) database
5. Scalable and Reproducible Data Workflows
– Course content
-> Scalability, reproducibility & interoperability in data analysis
-> Workflow systems
Package management environments
-> Containerization
– Practical works
-> Constructing a workflow with Nextflow/conda/docker
6. Documentation
– Course content
-> Data and metadata
-> Metadata standards
-> Documentation formats: notes, Laboratory Information Management System (LIMS) and Electronic Laboratory Notebook (ELN).
-> Other useful documentation formats
– Practical works
-> ??? Collaborative management of biological data using eLabFTW: from experimental design to results sharing
7. Managing sensitive data
– Course content
-> Types of sensitive data
-> Keeping data secure
-> Anonymizing data
– Practical works
-> Data anonymization with python
– Comments
Quick overview, details in course B0906 Mechanisms of data protection
8. Data lifecycle: storage to reuse
– Course content
-> Storage and backups
-> Data sharing
–> Open access and its green and gold routes
-> Data archiving
-> Restarting the data life cycle
– Comments
Data sharing: Quick overview, details in course B0909 Responsible Research and Innovation