Formation/Cours

Logo UCL monochrome

Operational tools for data management in biosciences

Etablissement : ECOLE DU NUMERIQUE

Langue : Anglais

Période : S3

  • Python programming knowledge (intermediate level).
  • Understanding of databases: Familiarity with SQL and basic database management would be helpful.
  • Basic biology: Given the bioscience focus, students should have foundational knowledge in life sciences to understand biological datasets and their context.

    • Familiarity with command line interfaces: Necessary for bash scripting and using tools like git.

    By the end of the course, students should be able to:

    • Understand the principles of data management, including planning, acquisition, processing, and sharing of scientific data.
    • “Implement basic data pipelines in accordance with ETL principles.
    • Work with large-scale biological datasets and employ modern software tools to analyze, organize, and document research.
    • Collaborate effectively in research using ELN tools and version control systems to ensure reproducibility and transparency.
    • Deploy APIs and containerized applications for sharing and managing data in an open-access framework.

    1. Introduction to data management
    – Course content
    -> The importance of data management
    -> Data management in sciences
    -> The data lifecycle
    –> Example of a data lifecycle in the biomedical field
    -> The main characteristics of well-managed data
    –> The FAIR data principles
    – Practical works
    -> Case study: Research data management at ETH
    2. Planning for data management
    – Course content
    -> Creating a data management plan
    –> Examples of data management plans in biosciences projects
    -> Data policies
    -> Case studies
    – Comments
    Data policies: Quick overview, details in course B0908 European environment and policies in life sciences and public health
    3. Data acquisition and pre-processing
    – Course content
    -> Raw vs analysed data: Gold, Silver, and Bronze Levels of Data
    -> Data integration and data aggregation
    –> The Extract-Transform-Load and Extract-Load-Transform processes
    –> Examples of tools used for ETL
    – Tutorials
    -> The JSON data format; manipulating JSON files using Python
    -> REST APIs; manipulatinf APIs using Python (requests package)
    – Practical works
    -> ETL with Python on Cedrus data (SQLAlchemy package)
    -> Data scraping with Python (BeautifulSoup and requests packages)
    4. File and Versioning Best Practices
    – Course content
    -> File organisation
    -> Naming conventions
    –> Example of a workflow respecting a file naming scheme
    -> Format conventions
    –> Example: PEP8 in Python
    -> Version control systems (file versioning)
    – Tutorials
    -> Tutorial on Git
    – Practical works
    -> Collaborative work on Git to construct an API with the Python package FastAPI on a biological (static) database
    5. Scalable and Reproducible Data Workflows
    – Course content
    -> Scalability, reproducibility & interoperability in data analysis
    -> Workflow systems
    Package management environments
    -> Containerization
    – Practical works
    -> Constructing a workflow with Nextflow/conda/docker
    6. Documentation
    – Course content
    -> Data and metadata
    -> Metadata standards
    -> Documentation formats: notes, Laboratory Information Management System (LIMS) and Electronic Laboratory Notebook (ELN).
    -> Other useful documentation formats
    – Practical works
    -> ??? Collaborative management of biological data using eLabFTW: from experimental design to results sharing
    7. Managing sensitive data
    – Course content
    -> Types of sensitive data
    -> Keeping data secure
    -> Anonymizing data
    – Practical works
    -> Data anonymization with python
    – Comments
    Quick overview, details in course B0906 Mechanisms of data protection
    8. Data lifecycle: storage to reuse
    – Course content
    -> Storage and backups
    -> Data sharing
    –> Open access and its green and gold routes
    -> Data archiving
    -> Restarting the data life cycle
    – Comments
    Data sharing: Quick overview, details in course B0909 Responsible Research and Innovation