Skip to content

Run pytests Github badge

geopephub

Automatic uploader of GEO metadata projects to PEPhub.

This repository contains geopephub CLI, that enables to automatic upload GEO projects to PEPhub based on date and scheduled automatic uploading using GitHub actions. Additionally, the CLI includes a download command, enabling users to retrieve projects from specified namespace directly from the PEPhub database. This feature is particularly helpful for downloading all GEO projects at once.


Documentation: https://pep.databio.org/pephub

Source Code: https://github.com/pepkit/geopephub


Installation

To install geopephub use this command:

pip install git+https://github.com/pepkit/geopephub.git

Overview:

The geopephub consists of 4 main functionalities:

1) Queuer: This module comprises functions that scan for new projects in GEO, generate a new cycle for the current run, and log details for each GEO project. It sets the project status to queued and adds it to the database. 2) Uploader: Checks if there are any queued cycles in the cycle_status table. It retrieves a list of queued projects, executes GEOfetch to download them, and uploads the results to PEPhub database using pepdbagent. geopephub updates the project upload status at each step, allowing for later checks to determine why the upload failed and what occurred. 3) Checker: This component examines previous cycles, verifies their status, and determines if they were executed. If a cycle was not executed or was unsuccessful, it triggers a rerun. In cases where only one project was unsuccessful, it attempts to upload it again. Additionally, if the cycle does not exist, it creates one using the queuer and uploads files using the uploader. 4) Downloader: Retrieves projects from the specified namespace, filters by uploading or updating date, and optionally sorts by name or date. It also allows setting a limit on the number of downloaded projects. Projects can be downloaded locally or to a specified S3 bucket. For more information, use the geopephub --help command

More information about these processes can be found in the flowcharts and overview below.

geopephub

Queuer Flowchart:

%%{init: {'theme':'forest'}}%%
stateDiagram-v2
    s1 --> s2 
    s2 --> s3
    s3 --> s4
    s4 --> s5
    s1: Create a new cycle
    s2: Find GEO updated projects with geofetch Finder
    s3: Add projects to the queue in sample status table
    s4: Change cycle status to queued
    s5: Exit

Uploader Flowchart:

%%{init: {'theme':'forest'}}%%
stateDiagram-v2
    s1 --> s2 
    s2 --> s3
    s3 --> s4
    s4 --> s5
    s5 --> s6
    s6 --> s7
    s7 --> s8

    s7 --> s2
    s6 --> s3

    s1: Get queued cycles by specifying namespace
    s2: Change status of the cycle
    s2: Get each element from list of queued cycle
    s3: Get each project (GSE) from one cycle
    s4: Change status of the project in project_status_table
    s5: Get specified project by running Geofetcher
    s6: Using pepdbagent add project to the DB
    s6: Change status of the project in project_status_table
    s7: Change status of cycle in cycle_status_table
    s8: Exit

Checker Flowchart:

graph TD
    A[Choose cycle to check] --> B{Did it run?}
    B -->|Yes| C{Was it successful?}
    B -->|No| D[Run Queuer for the cycle]
    C -->|Yes| E{Did all samples succeed?}
    C -->|No| D

    D --> D1[Run Uploader for the cycle]
    D1 --> K

    E --> |Yes| K[Exit]
    E --> |No| G[Retrieve failed samples]

    G --> H[Run Queuer for samples]
    H --> F[Run Uploader for queued samples]

    F --> I[Change samples status in the table]

    I --> J[Change cycle status in the table]

    J --> K[Exit]