geofetch tutorial for processed data
The GSE185701 data set has about 355 Mb of processed data that contains 57 Supplementary files, so it's a quick download for a test case. Let's take a quick peek at the geofetch version:
geofetch --version
geofetch 0.10.1
To see your CLI options, invoke geofetch -h
:
Calling geofetch will do 4 tasks:
- download all or filtered processed files from
GSE#####
into your geo folder. - download all metadata from GEO and store in your metadata folder.
- produce a PEP-compatible sample table,
PROJECT_NAME_sample_processed.csv
andPROJECT_NAME_series_processed.csv
, in your metadata folder. - produce a PEP-compatible project configuration file,
PROJECT_NAME_sample_processed.yaml
andPROJECT_NAME_series_processed.yaml
, in your metadata folder.
Complete details about geofetch outputs is cataloged in the metadata outputs reference.
from IPython.core.display import SVG SVG(filename='logo.svg')
Download the data
First, create the metadata for processed data (by adding --processed and --just-metadata):
geofetch -i GSE185701 --processed -n bright_test --just-metadata
Metadata folder: /home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter/bright_test
Trying GSE185701 (not a file) as accession...
Skipped 0 accessions. Starting now.
[38;5;200mProcessing accession 1 of 1: 'GSE185701'[0m
--2022-07-08 12:34:57-- https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=gse&acc=GSE185701&form=text&view=full
Resolving www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)... 2607:f220:41e:4290::110, 130.14.29.110
Connecting to www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)|2607:f220:41e:4290::110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [geo/text]
Saving to: ‘/home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter/bright_test/GSE185701_GSE.soft’
/home/bnt4me/Virgin [ <=> ] 2.82K --.-KB/s in 0s
2022-07-08 12:34:57 (973 MB/s) - ‘/home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter/bright_test/GSE185701_GSE.soft’ saved [2885]
--2022-07-08 12:34:57-- https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=gsm&acc=GSE185701&form=text&view=full
Resolving www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)... 2607:f220:41e:4290::110, 130.14.29.110
Connecting to www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)|2607:f220:41e:4290::110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [geo/text]
Saving to: ‘/home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter/bright_test/GSE185701_GSM.soft’
/home/bnt4me/Virgin [ <=> ] 39.51K 132KB/s in 0.3s
2022-07-08 12:34:58 (132 KB/s) - ‘/home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter/bright_test/GSE185701_GSM.soft’ saved [40454]
[38;5;242m
--2022-07-08 12:34:58-- ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE185nnn/GSE185701/suppl/filelist.txt
=> ‘/home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter/bright_test/GSE185701_file_list.txt’
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 2607:f220:41e:250::10, 2607:f220:41e:250::7, 165.112.9.229, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|2607:f220:41e:250::10|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /geo/series/GSE185nnn/GSE185701/suppl ... done.
==> SIZE filelist.txt ... 794
==> EPSV ... done. ==> RETR filelist.txt ... done.
Length: 794 (unauthoritative)
filelist.txt 100%[===================>] 794 --.-KB/s in 0s
2022-07-08 12:34:58 (219 MB/s) - ‘/home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter/bright_test/GSE185701_file_list.txt’ saved [794]
[38;5;242m0[0m
[0m
Total number of processed SAMPLES files found is: 8
Total number of processed SERIES files found is: 1
Expanding metadata list...
Expanding metadata list...
Finished processing 1 accession(s)
Unifying and saving of metadata...
[92mFile /home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter/bright_test/PEP_samples/GSE185701_samples.csv has been saved successfully[0m
Config file: /home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter/bright_test/PEP_samples/GSE185701_samples.yaml
ls bright_test
GSE185701_file_list.txt GSE185701_GSE.soft GSE185701_GSM.soft [0m[01;34mPEP_samples[0m
The .soft
files are the direct output from GEO, which contain all the metadata as stored by GEO, for both the experiment (_GSE
) and for the individual samples (_GSM
). Geofetch also produces a csv
file with the SRA metadata. The filtered version (ending in _filt
) would contain only the specified subset of the samples if we didn't request them all, but in this case, since we only gave an accession, it is identical to the complete file. Additionally, file_list.txt is downloaded, that contains information about size, type and creation date of all sample files.
Finally, there are the 2 files that make up the PEP: the _config.yaml
file and the _annotation.csv
file (for samples and series). Let's see what's in these files now.
cat bright_test/PEP_samples/GSE185701_samples.yaml
# Autogenerated by geofetch
pep_version: 2.1.0
project_name: GSE185701
sample_table: GSE185701_samples.csv
sample_modifiers:
append:
output_file_path: FILES
sample_growth_protocol_ch1: Huh 7 was cultured in Dulbecco’s modified Eagle’s medium (DMEM) (Invitrogen, Carlsbad, CA, USA) containing 10% fetal bovine serum (FBS) (HyClone, Logan, UT, USA) and antibiotics (penicillin and streptomycin, Invitrogen) at 37 °C in 5% CO2.
derive:
attributes: [output_file_path]
sources:
FILES: /{gse}/{file}
There are few important things to note in this file:
- First, see in the PEP that
sample_table
points to the csv file produced by geofetch. - Second: output_file_path is location of all the files.
- Third: sample_modifier Sample_growth_protocol_ch1 is constant sample character and is larger then 50 characters so it is deleted from csv file. For large project it can significantly reduced size of the metadata
Now let's look at the first 100 characters of the csv file:
cut -c -100 bright_test/PEP_samples/GSE185701_samples.csv
sample_taxid_ch1,sample_geo_accession,sample_channel_count,sample_instrument_model,biosample,supplem
9606,GSM5621756,1,HiSeq X Ten,https://www.ncbi.nlm.nih.gov/biosample/SAMN22223730,wig files were gen
9606,GSM5621756,1,HiSeq X Ten,https://www.ncbi.nlm.nih.gov/biosample/SAMN22223730,wig files were gen
9606,GSM5621758,1,HiSeq X Ten,https://www.ncbi.nlm.nih.gov/biosample/SAMN22223732,wig files were gen
9606,GSM5621758,1,HiSeq X Ten,https://www.ncbi.nlm.nih.gov/biosample/SAMN22223732,wig files were gen
9606,GSM5621760,1,HiSeq X Ten,https://www.ncbi.nlm.nih.gov/biosample/SAMN22223728,wig files were gen
9606,GSM5621760,1,HiSeq X Ten,https://www.ncbi.nlm.nih.gov/biosample/SAMN22223728,wig files were gen
9606,GSM5621761,1,HiSeq X Ten,https://www.ncbi.nlm.nih.gov/biosample/SAMN22223729,wig files were gen
9606,GSM5621761,1,HiSeq X Ten,https://www.ncbi.nlm.nih.gov/biosample/SAMN22223729,wig files were gen
Now let's download the actual data. This time we will will be downloading data from the GSE185701 data set .
Let's additionally add few arguments:
- geo-folder (required) - path to the location where processed files have to be saved
- filter argument, to download only bed files (--filter ".Bed.gz$")
- data-source argument, to download files only from sample location (--data-source samples)
geofetch -i GSE185701 --processed -n bright_test --filter ".bed.gz$" --data-source samples \
--geo-folder /home/bnt4me/Virginia/for_docs/geo
Metadata folder: /home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter
Trying GSE185701 (not a file) as accession...
Skipped 0 accessions. Starting now.
[38;5;200mProcessing accession 1 of 1: 'GSE185701'[0m
--2022-07-08 12:36:16-- https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=gse&acc=GSE185701&form=text&view=full
Resolving www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)... 2607:f220:41e:4290::110, 130.14.29.110
Connecting to www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)|2607:f220:41e:4290::110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [geo/text]
Saving to: ‘/home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter/GSE185701_GSE.soft’
/home/bnt4me/Virgin [ <=> ] 2.82K --.-KB/s in 0s
2022-07-08 12:36:16 (245 MB/s) - ‘/home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter/GSE185701_GSE.soft’ saved [2885]
--2022-07-08 12:36:16-- https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=gsm&acc=GSE185701&form=text&view=full
Resolving www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)... 2607:f220:41e:4290::110, 130.14.29.110
Connecting to www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)|2607:f220:41e:4290::110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [geo/text]
Saving to: ‘/home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter/GSE185701_GSM.soft’
/home/bnt4me/Virgin [ <=> ] 39.51K --.-KB/s in 0.1s
2022-07-08 12:36:16 (269 KB/s) - ‘/home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter/GSE185701_GSM.soft’ saved [40454]
[38;5;242m
--2022-07-08 12:36:16-- ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE185nnn/GSE185701/suppl/filelist.txt
=> ‘/home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter/GSE185701_file_list.txt’
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 2607:f220:41e:250::12, 2607:f220:41e:250::13, 130.14.250.13, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|2607:f220:41e:250::12|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /geo/series/GSE185nnn/GSE185701/suppl ... done.
==> SIZE filelist.txt ... 794
==> EPSV ... done. ==> RETR filelist.txt ... done.
Length: 794 (unauthoritative)
filelist.txt 100%[===================>] 794 --.-KB/s in 0s
2022-07-08 12:36:17 (2.55 MB/s) - ‘/home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter/GSE185701_file_list.txt’ saved [794]
[38;5;242m0[0m
[0m
Total number of processed SAMPLES files found is: 8
[32mTotal number of files after filter is: 4 [0m
Total number of processed SERIES files found is: 1
[32mTotal number of files after filter is: 0 [0m
Expanding metadata list...
Expanding metadata list...
[38;5;242m
--2022-07-08 12:36:17-- ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM5621nnn/GSM5621756/suppl/GSM5621756_ChIPseq_Huh7_siNC_H3K27ac_summits.bed.gz
=> ‘/home/bnt4me/Virginia/for_docs/geo/GSE185701/GSM5621756_ChIPseq_Huh7_siNC_H3K27ac_summits.bed.gz’
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 2607:f220:41e:250::13, 2607:f220:41e:250::12, 165.112.9.229, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|2607:f220:41e:250::13|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /geo/samples/GSM5621nnn/GSM5621756/suppl ... done.
==> SIZE GSM5621756_ChIPseq_Huh7_siNC_H3K27ac_summits.bed.gz ... 785486
==> EPSV ... done. ==> RETR GSM5621756_ChIPseq_Huh7_siNC_H3K27ac_summits.bed.gz ... done.
Length: 785486 (767K) (unauthoritative)
GSM5621756_ChIPseq_ 100%[===================>] 767.08K 1.64MB/s in 0.5s
2022-07-08 12:36:19 (1.64 MB/s) - ‘/home/bnt4me/Virginia/for_docs/geo/GSE185701/GSM5621756_ChIPseq_Huh7_siNC_H3K27ac_summits.bed.gz’ saved [785486]
[38;5;242m0[0m
[0m
[92mFile /home/bnt4me/Virginia/for_docs/geo/GSE185701/GSM5621756_ChIPseq_Huh7_siNC_H3K27ac_summits.bed.gz has been downloaded successfully[0m
[38;5;242m
--2022-07-08 12:36:19-- ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM5621nnn/GSM5621758/suppl/GSM5621758_ChIPseq_Huh7_siDHX37_H3K27ac_summits.bed.gz
=> ‘/home/bnt4me/Virginia/for_docs/geo/GSE185701/GSM5621758_ChIPseq_Huh7_siDHX37_H3K27ac_summits.bed.gz’
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 2607:f220:41e:250::13, 2607:f220:41e:250::12, 165.112.9.229, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|2607:f220:41e:250::13|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /geo/samples/GSM5621nnn/GSM5621758/suppl ... done.
==> SIZE GSM5621758_ChIPseq_Huh7_siDHX37_H3K27ac_summits.bed.gz ... 784432
==> EPSV ... done. ==> RETR GSM5621758_ChIPseq_Huh7_siDHX37_H3K27ac_summits.bed.gz ... done.
Length: 784432 (766K) (unauthoritative)
GSM5621758_ChIPseq_ 100%[===================>] 766.05K 1.03MB/s in 0.7s
2022-07-08 12:36:20 (1.03 MB/s) - ‘/home/bnt4me/Virginia/for_docs/geo/GSE185701/GSM5621758_ChIPseq_Huh7_siDHX37_H3K27ac_summits.bed.gz’ saved [784432]
[38;5;242m0[0m
[0m
[92mFile /home/bnt4me/Virginia/for_docs/geo/GSE185701/GSM5621758_ChIPseq_Huh7_siDHX37_H3K27ac_summits.bed.gz has been downloaded successfully[0m
[38;5;242m
--2022-07-08 12:36:21-- ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM5621nnn/GSM5621760/suppl/GSM5621760_CUTTag_Huh7_DHX37_summits.bed.gz
=> ‘/home/bnt4me/Virginia/for_docs/geo/GSE185701/GSM5621760_CUTTag_Huh7_DHX37_summits.bed.gz’
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 2607:f220:41e:250::13, 2607:f220:41e:250::12, 165.112.9.229, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|2607:f220:41e:250::13|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /geo/samples/GSM5621nnn/GSM5621760/suppl ... done.
==> SIZE GSM5621760_CUTTag_Huh7_DHX37_summits.bed.gz ... 163441
==> EPSV ... done. ==> RETR GSM5621760_CUTTag_Huh7_DHX37_summits.bed.gz ... done.
Length: 163441 (160K) (unauthoritative)
GSM5621760_CUTTag_H 100%[===================>] 159.61K 816KB/s in 0.2s
2022-07-08 12:36:21 (816 KB/s) - ‘/home/bnt4me/Virginia/for_docs/geo/GSE185701/GSM5621760_CUTTag_Huh7_DHX37_summits.bed.gz’ saved [163441]
[38;5;242m0[0m
[0m
[92mFile /home/bnt4me/Virginia/for_docs/geo/GSE185701/GSM5621760_CUTTag_Huh7_DHX37_summits.bed.gz has been downloaded successfully[0m
[38;5;242m
--2022-07-08 12:36:22-- ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM5621nnn/GSM5621761/suppl/GSM5621761_CUTTag_Huh7_PLRG1_summits.bed.gz
=> ‘/home/bnt4me/Virginia/for_docs/geo/GSE185701/GSM5621761_CUTTag_Huh7_PLRG1_summits.bed.gz’
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 2607:f220:41e:250::13, 2607:f220:41e:250::12, 165.112.9.229, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|2607:f220:41e:250::13|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /geo/samples/GSM5621nnn/GSM5621761/suppl ... done.
==> SIZE GSM5621761_CUTTag_Huh7_PLRG1_summits.bed.gz ... 117250
==> EPSV ... done. ==> RETR GSM5621761_CUTTag_Huh7_PLRG1_summits.bed.gz ... done.
Length: 117250 (115K) (unauthoritative)
GSM5621761_CUTTag_H 100%[===================>] 114.50K 318KB/s in 0.4s
2022-07-08 12:36:23 (318 KB/s) - ‘/home/bnt4me/Virginia/for_docs/geo/GSE185701/GSM5621761_CUTTag_Huh7_PLRG1_summits.bed.gz’ saved [117250]
[38;5;242m0[0m
[0m
[92mFile /home/bnt4me/Virginia/for_docs/geo/GSE185701/GSM5621761_CUTTag_Huh7_PLRG1_summits.bed.gz has been downloaded successfully[0m
Finished processing 1 accession(s)
Unifying and saving of metadata...
[92mFile /home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter/PEP_samples/GSE185701_samples.csv has been saved successfully[0m
Config file: /home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter/PEP_samples/GSE185701_samples.yaml
Now lets list the folder to see what data is there. And let's see what's in pep files now.
ls /home/bnt4me/Virginia/for_docs/geo/GSE185701
[0m[01;31mGSM5621756_ChIPseq_Huh7_siNC_H3K27ac_summits.bed.gz[0m
[01;31mGSM5621758_ChIPseq_Huh7_siDHX37_H3K27ac_summits.bed.gz[0m
[01;31mGSM5621760_CUTTag_Huh7_DHX37_summits.bed.gz[0m
[01;31mGSM5621761_CUTTag_Huh7_PLRG1_summits.bed.gz[0m
cut -c -100 cat PEP_samples/GSE185701_samples.csv
cut: cat: No such file or directory
sample_platform_id,sample_library_strategy,sample_contact_country,sample_contact_name,sample_contact
GPL20795,ChIP-Seq,China,"Xianghuo,,He",Shanghai,HCC,"transfected with siNC using Lipofectamine RNAiM
GPL20795,ChIP-Seq,China,"Xianghuo,,He",Shanghai,HCC,"transfected with siDHX37 using Lipofectamine RN
GPL20795,OTHER,China,"Xianghuo,,He",Shanghai,HCC,"transfected with Flag-DHX37 lentivirus, renew the
GPL20795,OTHER,China,"Xianghuo,,He",Shanghai,HCC,untreated,SRA,Huh 7,hg38,Homo sapiens,HiSeq X Ten,h
cat PEP_samples/GSE185701_samples.yaml
# Autogenerated by geofetch
pep_version: 2.1.0
project_name: GSE185701
sample_table: GSE185701_samples.csv
sample_modifiers:
append:
output_file_path: FILES
sample_growth_protocol_ch1: Huh 7 was cultured in Dulbecco’s modified Eagle’s medium (DMEM) (Invitrogen, Carlsbad, CA, USA) containing 10% fetal bovine serum (FBS) (HyClone, Logan, UT, USA) and antibiotics (penicillin and streptomycin, Invitrogen) at 37 °C in 5% CO2.
derive:
attributes: [output_file_path]
sources:
FILES: /home/bnt4me/Virginia/for_docs/geo/{gse}/{file}
Now we have easy access to this data by using peppy package in python or pepr in r in further analysis