Skip to content

NGSTk API Documentation

Overview

NGSTk (Next-Generation Sequencing Toolkit) is a toolkit class that provides helper functions for building command strings used in NGS pipelines. It can be configured with a YAML configuration file to specify custom tool paths, or it will use tools from the system PATH.

Key Features

  • Command Building: Generate command strings for common NGS tools
  • Configuration Management: Use custom tool paths via YAML config
  • Tool Integration: Built-in support for common tools like samtools, bedtools, etc.
  • Pipeline Integration: Works seamlessly with PipelineManager

Installation

NGSTk is included with pypiper:

pip install pypiper

Quick Example

from pypiper.ngstk import NGSTk

# Initialize NGSTk
tk = NGSTk()

# Generate a command
cmd = tk.samtools_index("sample.bam")
# Returns: "samtools index sample.bam"

API Reference

NGSTk Class

NGSTk

NGSTk(config_file=None, pm=None)

Bases: AttMapEcho

Class to hold functions to build command strings used during pipeline runs. Object can be instantiated with a string of a path to a yaml pipeline config file. Since NGSTk inherits from AttMapEcho, the passed config file and its elements will be accessible through the NGSTk object as attributes under config (e.g. NGSTk.tools.java). In case no config_file argument is passed, all commands will be returned assuming the tool is in the user's $PATH.

:param str config_file: Path to pipeline yaml config file (optional). :param pypiper.PipelineManager pm: A PipelineManager with which to associate this toolkit instance; that is, essentially a source from which to grab paths to tools, resources, etc.

:Example:

from pypiper.ngstk import NGSTk as tk
tk = NGSTk()
tk.samtools_index("sample.bam")
# returns: samtools index sample.bam

# Using a configuration file (custom executable location):
from pypiper.ngstk import NGSTk
tk = NGSTk("pipeline_config_file.yaml")
tk.samtools_index("sample.bam")
# returns: /home/.local/samtools/bin/samtools index sample.bam
Source code in pypiper/ngstk.py
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
def __init__(self, config_file=None, pm=None):
    # parse yaml into the project's attributes
    # self.add_entries(**config)
    super(NGSTk, self).__init__(
        None if config_file is None else load_yaml(config_file)
    )

    # Keep a link to the pipeline manager, if one is provided.
    # if None is provided, instantiate "tools" and "parameters" with empty AttMaps
    # this allows the usage of the same code for a command with and without using a pipeline manager
    if pm is not None:
        self.pm = pm
        if hasattr(pm.config, "tools"):
            self.tools = self.pm.config.tools
        else:
            self.tools = AttMapEcho()
        if hasattr(pm.config, "parameters"):
            self.parameters = self.pm.config.parameters
        else:
            self.parameters = AttMapEcho()
    else:
        self.tools = AttMapEcho()
        self.parameters = AttMapEcho()

    # If pigz is available, use that. Otherwise, default to gzip.
    if (
        hasattr(self.pm, "cores")
        and self.pm.cores > 1
        and self.check_command("pigz")
    ):
        self.ziptool_cmd = "pigz -f -p {}".format(self.pm.cores)
    else:
        self.ziptool_cmd = "gzip -f"

ziptool property

ziptool

Returns the command to use for compressing/decompressing.

:return str: Either 'gzip' or 'pigz' if installed and multiple cores

bam2fastq

bam2fastq(input_bam, output_fastq, output_fastq2=None, unpaired_fastq=None)

Create command to convert BAM(s) to FASTQ(s).

:param str input_bam: Path to sequencing reads file to convert :param output_fastq: Path to FASTQ to write :param output_fastq2: Path to (R2) FASTQ to write :param unpaired_fastq: Path to unpaired FASTQ to write :return str: Command to convert BAM(s) to FASTQ(s)

Source code in pypiper/ngstk.py
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
def bam2fastq(
    self, input_bam, output_fastq, output_fastq2=None, unpaired_fastq=None
):
    """
    Create command to convert BAM(s) to FASTQ(s).

    :param str input_bam: Path to sequencing reads file to convert
    :param output_fastq: Path to FASTQ to write
    :param output_fastq2: Path to (R2) FASTQ to write
    :param unpaired_fastq: Path to unpaired FASTQ to write
    :return str: Command to convert BAM(s) to FASTQ(s)
    """
    self._ensure_folders(output_fastq, output_fastq2, unpaired_fastq)
    cmd = self.tools.java + " -Xmx" + self.pm.javamem
    cmd += " -jar " + self.tools.picard + " SamToFastq"
    cmd += " INPUT={0}".format(input_bam)
    cmd += " FASTQ={0}".format(output_fastq)
    if output_fastq2 is not None and unpaired_fastq is not None:
        cmd += " SECOND_END_FASTQ={0}".format(output_fastq2)
        cmd += " UNPAIRED_FASTQ={0}".format(unpaired_fastq)
    return cmd

bam_conversions

bam_conversions(bam_file, depth=True)

Sort and index bam files for later use.

:param bool depth: also calculate coverage over each position

Source code in pypiper/ngstk.py
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
def bam_conversions(self, bam_file, depth=True):
    """
    Sort and index bam files for later use.

    :param bool depth: also calculate coverage over each position
    """
    cmd = (
        self.tools.samtools
        + " view -h "
        + bam_file
        + " > "
        + bam_file.replace(".bam", ".sam")
        + "\n"
    )
    cmd += (
        self.tools.samtools
        + " sort "
        + bam_file
        + " -o "
        + bam_file.replace(".bam", "_sorted.bam")
        + "\n"
    )
    cmd += (
        self.tools.samtools
        + " index "
        + bam_file.replace(".bam", "_sorted.bam")
        + "\n"
    )
    if depth:
        cmd += (
            self.tools.samtools
            + " depth "
            + bam_file.replace(".bam", "_sorted.bam")
            + " > "
            + bam_file.replace(".bam", "_sorted.depth")
            + "\n"
        )
    return cmd

bam_to_bigwig

bam_to_bigwig(input_bam, output_bigwig, genome_sizes, genome, tagmented=False, normalize=False, norm_factor=1000)

Convert a BAM file to a bigWig file.

:param str input_bam: path to BAM file to convert :param str output_bigwig: path to which to write file in bigwig format :param str genome_sizes: path to file with chromosome size information :param str genome: name of genomic assembly :param bool tagmented: flag related to read-generating protocol :param bool normalize: whether to normalize coverage :param int norm_factor: number of bases to use for normalization :return list[str]: sequence of commands to execute

Source code in pypiper/ngstk.py
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
def bam_to_bigwig(
    self,
    input_bam,
    output_bigwig,
    genome_sizes,
    genome,
    tagmented=False,
    normalize=False,
    norm_factor=1000,
):
    """
    Convert a BAM file to a bigWig file.

    :param str input_bam: path to BAM file to convert
    :param str output_bigwig: path to which to write file in bigwig format
    :param str genome_sizes: path to file with chromosome size information
    :param str genome: name of genomic assembly
    :param bool tagmented: flag related to read-generating protocol
    :param bool normalize: whether to normalize coverage
    :param int norm_factor: number of bases to use for normalization
    :return list[str]: sequence of commands to execute
    """
    # TODO:
    # addjust fragment length dependent on read size and real fragment size
    # (right now it asssumes 50bp reads with 180bp fragments)
    cmds = list()
    transient_file = os.path.abspath(re.sub(r"\.bigWig", "", output_bigwig))
    cmd1 = self.tools.bedtools + " bamtobed -i {0} |".format(input_bam)
    if not tagmented:
        cmd1 += (
            " "
            + self.tools.bedtools
            + " slop -i stdin -g {0} -s -l 0 -r 130 |".format(genome_sizes)
        )
        cmd1 += " fix_bedfile_genome_boundaries.py {0} |".format(genome)
    cmd1 += (
        " "
        + self.tools.genomeCoverageBed
        + " {0}-bg -g {1} -i stdin > {2}.cov".format(
            "-5 " if tagmented else "", genome_sizes, transient_file
        )
    )
    cmds.append(cmd1)
    if normalize:
        cmds.append(
            """awk 'NR==FNR{{sum+= $4; next}}{{ $4 = ($4 / sum) * {1}; print}}' {0}.cov {0}.cov | sort -k1,1 -k2,2n > {0}.normalized.cov""".format(
                transient_file, norm_factor
            )
        )
    cmds.append(
        self.tools.bedGraphToBigWig
        + " {0}{1}.cov {2} {3}".format(
            transient_file,
            ".normalized" if normalize else "",
            genome_sizes,
            output_bigwig,
        )
    )
    # remove tmp files
    cmds.append("if [[ -s {0}.cov ]]; then rm {0}.cov; fi".format(transient_file))
    if normalize:
        cmds.append(
            "if [[ -s {0}.normalized.cov ]]; then rm {0}.normalized.cov; fi".format(
                transient_file
            )
        )
    cmds.append("chmod 755 {0}".format(output_bigwig))
    return cmds

bam_to_fastq

bam_to_fastq(bam_file, out_fastq_pre, paired_end)

Build command to convert BAM file to FASTQ file(s) (R1/R2).

:param str bam_file: path to BAM file with sequencing reads :param str out_fastq_pre: path prefix for output FASTQ file(s) :param bool paired_end: whether the given file contains paired-end or single-end sequencing reads :return str: file conversion command, ready to run

Source code in pypiper/ngstk.py
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
def bam_to_fastq(self, bam_file, out_fastq_pre, paired_end):
    """
    Build command to convert BAM file to FASTQ file(s) (R1/R2).

    :param str bam_file: path to BAM file with sequencing reads
    :param str out_fastq_pre: path prefix for output FASTQ file(s)
    :param bool paired_end: whether the given file contains paired-end
        or single-end sequencing reads
    :return str: file conversion command, ready to run
    """
    self.make_sure_path_exists(os.path.dirname(out_fastq_pre))
    cmd = self.tools.java + " -Xmx" + self.pm.javamem
    cmd += " -jar " + self.tools.picard + " SamToFastq"
    cmd += " I=" + bam_file
    cmd += " F=" + out_fastq_pre + "_R1.fastq"
    if paired_end:
        cmd += " F2=" + out_fastq_pre + "_R2.fastq"
    cmd += " INCLUDE_NON_PF_READS=true"
    cmd += " QUIET=true"
    cmd += " VERBOSITY=ERROR"
    cmd += " VALIDATION_STRINGENCY=SILENT"
    return cmd

bam_to_fastq_awk

bam_to_fastq_awk(bam_file, out_fastq_pre, paired_end, zipmode=False)

This converts bam file to fastq files, but using awk. As of 2016, this is much faster than the standard way of doing this using Picard, and also much faster than the bedtools implementation as well; however, it does no sanity checks and assumes the reads (for paired data) are all paired (no singletons), in the correct order. :param bool zipmode: Should the output be zipped?

Source code in pypiper/ngstk.py
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
def bam_to_fastq_awk(self, bam_file, out_fastq_pre, paired_end, zipmode=False):
    """
    This converts bam file to fastq files, but using awk. As of 2016, this is much faster
    than the standard way of doing this using Picard, and also much faster than the
    bedtools implementation as well; however, it does no sanity checks and assumes the reads
    (for paired data) are all paired (no singletons), in the correct order.
    :param bool zipmode: Should the output be zipped?
    """
    self.make_sure_path_exists(os.path.dirname(out_fastq_pre))
    fq1 = out_fastq_pre + "_R1.fastq"
    fq2 = out_fastq_pre + "_R2.fastq"

    if zipmode:
        fq1 = fq1 + ".gz"
        fq2 = fq2 + ".gz"
        fq1_target = ' | "' + self.ziptool + " -c  > " + fq1 + '"'
        fq2_target = ' | "' + self.ziptool + " -c  > " + fq2 + '"'
    else:
        fq1_target = ' > "' + fq1 + '"'
        fq2_target = ' > "' + fq2 + '"'

    if paired_end:
        cmd = self.tools.samtools + " view " + bam_file + " | awk '"
        cmd += r'{ if (NR%2==1) print "@"$1"/1\n"$10"\n+\n"$11' + fq1_target + ";"
        cmd += r' else print "@"$1"/2\n"$10"\n+\n"$11' + fq2_target + "; }"
        cmd += "'"  # end the awk command
    else:
        fq2 = None
        cmd = self.tools.samtools + " view " + bam_file + " | awk '"
        cmd += r'{ print "@"$1"\n"$10"\n+\n"$11' + fq1_target + "; }"
        cmd += "'"
    return cmd, fq1, fq2

bam_to_fastq_bedtools

bam_to_fastq_bedtools(bam_file, out_fastq_pre, paired_end)

Converts bam to fastq; A version using bedtools

Source code in pypiper/ngstk.py
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
def bam_to_fastq_bedtools(self, bam_file, out_fastq_pre, paired_end):
    """
    Converts bam to fastq; A version using bedtools
    """
    self.make_sure_path_exists(os.path.dirname(out_fastq_pre))
    fq1 = out_fastq_pre + "_R1.fastq"
    fq2 = None
    cmd = (
        self.tools.bedtools
        + " bamtofastq -i "
        + bam_file
        + " -fq "
        + fq1
        + ".fastq"
    )
    if paired_end:
        fq2 = out_fastq_pre + "_R2.fastq"
        cmd += " -fq2 " + fq2

    return cmd, fq1, fq2

calc_frip

calc_frip(input_bam, input_bed, threads=4)

Calculate fraction of reads in peaks.

A file of with a pool of sequencing reads and a file with peak call regions define the operation that will be performed. Thread count for samtools can be specified as well.

:param str input_bam: sequencing reads file :param str input_bed: file with called peak regions :param int threads: number of threads samtools may use :return float: fraction of reads in peaks defined in given peaks file

Source code in pypiper/ngstk.py
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
def calc_frip(self, input_bam, input_bed, threads=4):
    """
    Calculate fraction of reads in peaks.

    A file of with a pool of sequencing reads and a file with peak call
    regions define the operation that will be performed. Thread count
    for samtools can be specified as well.

    :param str input_bam: sequencing reads file
    :param str input_bed: file with called peak regions
    :param int threads: number of threads samtools may use
    :return float: fraction of reads in peaks defined in given peaks file
    """
    cmd = self.simple_frip(input_bam, input_bed, threads)
    return subprocess.check_output(cmd.split(" "), shell=True).decode().strip()

check_command

check_command(command)

Check if command can be called.

Source code in pypiper/ngstk.py
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
def check_command(self, command):
    """
    Check if command can be called.
    """

    # Use `command` to see if command is callable, store exit code
    code = os.system(
        "command -v {0} >/dev/null 2>&1 || {{ exit 1; }}".format(command)
    )

    # If exit code is not 0, report which command failed and return False, else return True
    if code != 0:
        print("Command is not callable: {0}".format(command))
        return False
    else:
        return True

check_fastq

check_fastq(input_files, output_files, paired_end)

Returns a follow sanity-check function to be run after a fastq conversion. Run following a command that will produce the fastq files.

This function will make sure any input files have the same number of reads as the output files.

Source code in pypiper/ngstk.py
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
def check_fastq(self, input_files, output_files, paired_end):
    """
    Returns a follow sanity-check function to be run after a fastq conversion.
    Run following a command that will produce the fastq files.

    This function will make sure any input files have the same number of reads as the
    output files.
    """

    # Define a temporary function which we will return, to be called by the
    # pipeline.
    # Must define default parameters here based on the parameters passed in. This locks
    # these values in place, so that the variables will be defined when this function
    # is called without parameters as a follow function by pm.run.

    # This is AFTER merge, so if there are multiple files it means the
    # files were split into read1/read2; therefore I must divide by number
    # of files for final reads.
    def temp_func(
        input_files=input_files, output_files=output_files, paired_end=paired_end
    ):
        if type(input_files) != list:
            input_files = [input_files]
        if type(output_files) != list:
            output_files = [output_files]

        n_input_files = len(list(filter(bool, input_files)))
        n_output_files = len(list(filter(bool, output_files)))

        total_reads = sum(
            [
                int(self.count_reads(input_file, paired_end))
                for input_file in input_files
            ]
        )
        raw_reads = int(total_reads / n_input_files)
        self.pm.pipestat.report(values={"Raw_reads": str(raw_reads)})

        total_fastq_reads = sum(
            [
                int(self.count_reads(output_file, paired_end))
                for output_file in output_files
            ]
        )
        fastq_reads = int(total_fastq_reads / n_output_files)

        self.pm.pipestat.report(values={"Fastq_reads": fastq_reads})
        input_ext = self.get_input_ext(input_files[0])
        # We can only assess pass filter reads in bam files with flags.
        if input_ext == ".bam":
            num_failed_filter = sum(
                [int(self.count_fail_reads(f, paired_end)) for f in input_files]
            )
            pf_reads = int(raw_reads) - num_failed_filter
            self.pm.pipestat.report(values={"PF_reads": str(pf_reads)})
        if fastq_reads != int(raw_reads):
            raise Exception(
                "Fastq conversion error? Number of input reads "
                "doesn't number of output reads."
            )

        return fastq_reads

    return temp_func

check_trim

check_trim(trimmed_fastq, paired_end, trimmed_fastq_R2=None, fastqc_folder=None)

Build function to evaluate read trimming, and optionally run fastqc.

This is useful to construct an argument for the 'follow' parameter of a PipelineManager's 'run' method.

:param str trimmed_fastq: Path to trimmed reads file. :param bool paired_end: Whether the processing is being done with paired-end sequencing data. :param str trimmed_fastq_R2: Path to read 2 file for the paired-end case. :param str fastqc_folder: Path to folder within which to place fastqc output files; if unspecified, fastqc will not be run. :return callable: Function to evaluate read trimming and possibly run fastqc.

Source code in pypiper/ngstk.py
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
def check_trim(
    self, trimmed_fastq, paired_end, trimmed_fastq_R2=None, fastqc_folder=None
):
    """
    Build function to evaluate read trimming, and optionally run fastqc.

    This is useful to construct an argument for the 'follow' parameter of
    a PipelineManager's 'run' method.

    :param str trimmed_fastq: Path to trimmed reads file.
    :param bool paired_end: Whether the processing is being done with
        paired-end sequencing data.
    :param str trimmed_fastq_R2: Path to read 2 file for the paired-end case.
    :param str fastqc_folder: Path to folder within which to place fastqc
        output files; if unspecified, fastqc will not be run.
    :return callable: Function to evaluate read trimming and possibly run
        fastqc.
    """

    def temp_func():
        print("Evaluating read trimming")

        if paired_end and not trimmed_fastq_R2:
            print("WARNING: specified paired-end but no R2 file")

        n_trim = float(self.count_reads(trimmed_fastq, paired_end))
        self.pm.pipestat.report(values={"Trimmed_reads": int(n_trim)})
        try:
            rr = float(self.pm.pipestat.retrieve("Raw_reads"))
        except:
            print("Can't calculate trim loss rate without raw read result.")
        else:
            self.pm.report_result(
                "Trim_loss_rate", round((rr - n_trim) * 100 / rr, 2)
            )

        # Also run a fastqc (if installed/requested)
        if fastqc_folder:
            if fastqc_folder and os.path.isabs(fastqc_folder):
                self.make_sure_path_exists(fastqc_folder)
            cmd = self.fastqc(trimmed_fastq, fastqc_folder)
            self.pm.run(cmd, lock_name="trimmed_fastqc", nofail=True)
            fname, ext = os.path.splitext(os.path.basename(trimmed_fastq))
            fastqc_html = os.path.join(fastqc_folder, fname + "_fastqc.html")
            self.pm.pipestat.report(
                values={
                    "FastQC_report_R1": {
                        "path": fastqc_html,
                        "title": "FastQC report R1",
                    }
                }
            )

            if paired_end and trimmed_fastq_R2:
                cmd = self.fastqc(trimmed_fastq_R2, fastqc_folder)
                self.pm.run(cmd, lock_name="trimmed_fastqc_R2", nofail=True)
                fname, ext = os.path.splitext(os.path.basename(trimmed_fastq_R2))
                fastqc_html = os.path.join(fastqc_folder, fname + "_fastqc.html")
                self.pm.pipestat.report(
                    values={
                        "FastQC_report_R2": {
                            "path": fastqc_html,
                            "title": "FastQC report R2",
                        }
                    }
                )

    return temp_func

count_concordant

count_concordant(aligned_bam)

Count only reads that "aligned concordantly exactly 1 time."

:param str aligned_bam: File for which to count mapped reads.

Source code in pypiper/ngstk.py
964
965
966
967
968
969
970
971
972
973
def count_concordant(self, aligned_bam):
    """
    Count only reads that "aligned concordantly exactly 1 time."

    :param str aligned_bam: File for which to count mapped reads.
    """
    cmd = self.tools.samtools + " view " + aligned_bam + " | "
    cmd += "grep 'YT:Z:CP'" + " | uniq -u | wc -l | sed -E 's/^[[:space:]]+//'"

    return subprocess.check_output(cmd, shell=True).decode().strip()

count_fail_reads

count_fail_reads(file_name, paired_end)

Counts the number of reads that failed platform/vendor quality checks. :param paired_end: This parameter is ignored; samtools automatically correctly responds depending on the data in the bamfile. We leave the option here just for consistency, since all the other counting functions require the parameter. This makes it easier to swap counting functions during pipeline development.

Source code in pypiper/ngstk.py
906
907
908
909
910
911
912
913
914
def count_fail_reads(self, file_name, paired_end):
    """
    Counts the number of reads that failed platform/vendor quality checks.
    :param paired_end: This parameter is ignored; samtools automatically correctly responds depending
    on the data in the bamfile. We leave the option here just for consistency, since all the other
    counting functions require the parameter. This makes it easier to swap counting functions during
    pipeline development.
    """
    return int(self.count_flag_reads(file_name, 512, paired_end))

count_flag_reads

count_flag_reads(file_name, flag, paired_end)

Counts the number of reads with the specified flag.

:param str file_name: name of reads file :param str flag: sam flag value to be read :param bool paired_end: This parameter is ignored; samtools automatically correctly responds depending on the data in the bamfile. We leave the option here just for consistency, since all the other counting functions require the parameter. This makes it easier to swap counting functions during pipeline development.

Source code in pypiper/ngstk.py
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
def count_flag_reads(self, file_name, flag, paired_end):
    """
    Counts the number of reads with the specified flag.

    :param str file_name: name of reads file
    :param str flag: sam flag value to be read
    :param bool paired_end: This parameter is ignored; samtools automatically correctly responds depending
        on the data in the bamfile. We leave the option here just for consistency, since all the other
        counting functions require the parameter. This makes it easier to swap counting functions during
        pipeline development.
    """

    param = " -c -f" + str(flag)
    if file_name.endswith("sam"):
        param += " -S"
    return self.samtools_view(file_name, param=param)

count_lines

count_lines(file_name)

Uses the command-line utility wc to count the number of lines in a file. For MacOS, must strip leading whitespace from wc.

:param str file_name: name of file whose lines are to be counted

Source code in pypiper/ngstk.py
739
740
741
742
743
744
745
746
747
748
749
def count_lines(self, file_name):
    """
    Uses the command-line utility wc to count the number of lines in a file. For MacOS, must strip leading whitespace from wc.

    :param str file_name: name of file whose lines are to be counted
    """
    x = subprocess.check_output(
        "wc -l " + file_name + " | sed -E 's/^[[:space:]]+//' | cut -f1 -d' '",
        shell=True,
    )
    return x.decode().strip()

count_lines_zip

count_lines_zip(file_name)

Uses the command-line utility wc to count the number of lines in a file. For MacOS, must strip leading whitespace from wc. For compressed files. :param file: file_name

Source code in pypiper/ngstk.py
751
752
753
754
755
756
757
758
759
760
761
762
763
764
def count_lines_zip(self, file_name):
    """
    Uses the command-line utility wc to count the number of lines in a file. For MacOS, must strip leading whitespace from wc.
    For compressed files.
    :param file: file_name
    """
    x = subprocess.check_output(
        self.ziptool
        + " -d -c "
        + file_name
        + " | wc -l | sed -E 's/^[[:space:]]+//' | cut -f1 -d' '",
        shell=True,
    )
    return x.decode().strip()

count_mapped_reads

count_mapped_reads(file_name, paired_end)

Mapped_reads are not in fastq format, so this one doesn't need to accommodate fastq, and therefore, doesn't require a paired-end parameter because it only uses samtools view. Therefore, it's ok that it has a default parameter, since this is discarded.

:param str file_name: File for which to count mapped reads. :param bool paired_end: This parameter is ignored; samtools automatically correctly responds depending on the data in the bamfile. We leave the option here just for consistency, since all the other counting functions require the parameter. This makes it easier to swap counting functions during pipeline development. :return int: Either return code from samtools view command, or -1 to indicate an error state.

Source code in pypiper/ngstk.py
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
def count_mapped_reads(self, file_name, paired_end):
    """
    Mapped_reads are not in fastq format, so this one doesn't need to accommodate fastq,
    and therefore, doesn't require a paired-end parameter because it only uses samtools view.
    Therefore, it's ok that it has a default parameter, since this is discarded.

    :param str file_name: File for which to count mapped reads.
    :param bool paired_end: This parameter is ignored; samtools automatically correctly responds depending
        on the data in the bamfile. We leave the option here just for consistency, since all the other
        counting functions require the parameter. This makes it easier to swap counting functions during
        pipeline development.
    :return int: Either return code from samtools view command, or -1 to indicate an error state.
    """
    if file_name.endswith("bam"):
        return self.samtools_view(file_name, param="-c -F4")
    if file_name.endswith("sam"):
        return self.samtools_view(file_name, param="-c -F4 -S")
    return -1

count_multimapping_reads

count_multimapping_reads(file_name, paired_end)

Counts the number of reads that mapped to multiple locations. Warning: currently, if the alignment software includes the reads at multiple locations, this function will count those more than once. This function is for software that randomly assigns, but flags reads as multimappers.

:param str file_name: name of reads file :param paired_end: This parameter is ignored; samtools automatically correctly responds depending on the data in the bamfile. We leave the option here just for consistency, since all the other counting functions require the parameter. This makes it easier to swap counting functions during pipeline development.

Source code in pypiper/ngstk.py
879
880
881
882
883
884
885
886
887
888
889
890
891
892
def count_multimapping_reads(self, file_name, paired_end):
    """
    Counts the number of reads that mapped to multiple locations. Warning:
    currently, if the alignment software includes the reads at multiple locations, this function
    will count those more than once. This function is for software that randomly assigns,
    but flags reads as multimappers.

    :param str file_name: name of reads file
    :param paired_end: This parameter is ignored; samtools automatically correctly responds depending
        on the data in the bamfile. We leave the option here just for consistency, since all the other
        counting functions require the parameter. This makes it easier to swap counting functions during
        pipeline development.
    """
    return int(self.count_flag_reads(file_name, 256, paired_end))

count_reads

count_reads(file_name, paired_end)

Count reads in a file.

Paired-end reads count as 2 in this function. For paired-end reads, this function assumes that the reads are split into 2 files, so it divides line count by 2 instead of 4. This will thus give an incorrect result if your paired-end fastq files are in only a single file (you must divide by 2 again).

:param str file_name: Name/path of file whose reads are to be counted. :param bool paired_end: Whether the file contains paired-end reads.

Source code in pypiper/ngstk.py
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
def count_reads(self, file_name, paired_end):
    """
    Count reads in a file.

    Paired-end reads count as 2 in this function.
    For paired-end reads, this function assumes that the reads are split
    into 2 files, so it divides line count by 2 instead of 4.
    This will thus give an incorrect result if your paired-end fastq files
    are in only a single file (you must divide by 2 again).

    :param str file_name: Name/path of file whose reads are to be counted.
    :param bool paired_end: Whether the file contains paired-end reads.
    """

    _, ext = os.path.splitext(file_name)
    if not (is_sam_or_bam(file_name) or is_fastq(file_name)):
        # TODO: make this an exception and force caller to handle that
        # rather than relying on knowledge of possibility of negative value.
        return -1

    if is_sam_or_bam(file_name):
        param_text = "-c" if ext == ".bam" else "-c -S"
        return self.samtools_view(file_name, param=param_text)
    else:
        num_lines = (
            self.count_lines_zip(file_name)
            if is_gzipped_fastq(file_name)
            else self.count_lines(file_name)
        )
        divisor = 2 if paired_end else 4
        return int(num_lines) / divisor

count_unique_mapped_reads

count_unique_mapped_reads(file_name, paired_end)

For a bam or sam file with paired or or single-end reads, returns the number of mapped reads, counting each read only once, even if it appears mapped at multiple locations.

:param str file_name: name of reads file :param bool paired_end: True/False paired end data :return int: Number of uniquely mapped reads.

Source code in pypiper/ngstk.py
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
def count_unique_mapped_reads(self, file_name, paired_end):
    """
    For a bam or sam file with paired or or single-end reads, returns the
    number of mapped reads, counting each read only once, even if it appears
    mapped at multiple locations.

    :param str file_name: name of reads file
    :param bool paired_end: True/False paired end data
    :return int: Number of uniquely mapped reads.
    """

    _, ext = os.path.splitext(file_name)
    ext = ext.lower()

    if ext == ".sam":
        param = "-S -F4"
    elif ext == ".bam":
        param = "-F4"
    else:
        raise ValueError("Not a SAM or BAM: '{}'".format(file_name))

    if paired_end:
        r1 = self.samtools_view(
            file_name,
            param=param + " -f64",
            postpend=" | cut -f1 | sort -k1,1 -u | wc -l | sed -E 's/^[[:space:]]+//'",
        )
        r2 = self.samtools_view(
            file_name,
            param=param + " -f128",
            postpend=" | cut -f1 | sort -k1,1 -u | wc -l | sed -E 's/^[[:space:]]+//'",
        )
    else:
        r1 = self.samtools_view(
            file_name,
            param=param + "",
            postpend=" | cut -f1 | sort -k1,1 -u | wc -l | sed -E 's/^[[:space:]]+//'",
        )
        r2 = 0

    return int(r1) + int(r2)

count_unique_reads

count_unique_reads(file_name, paired_end)

Sometimes alignment software puts multiple locations for a single read; if you just count those reads, you will get an inaccurate count. This is not the same as multimapping reads, which may or may not be actually duplicated in the bam file (depending on the alignment software). This function counts each read only once. This accounts for paired end or not for free because pairs have the same read name. In this function, a paired-end read would count as 2 reads.

Source code in pypiper/ngstk.py
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
def count_unique_reads(self, file_name, paired_end):
    """
    Sometimes alignment software puts multiple locations for a single read; if you just count
    those reads, you will get an inaccurate count. This is _not_ the same as multimapping reads,
    which may or may not be actually duplicated in the bam file (depending on the alignment
    software).
    This function counts each read only once.
    This accounts for paired end or not for free because pairs have the same read name.
    In this function, a paired-end read would count as 2 reads.
    """
    if file_name.endswith("sam"):
        param = "-S"
    if file_name.endswith("bam"):
        param = ""
    if paired_end:
        r1 = self.samtools_view(
            file_name,
            param=param + " -f64",
            postpend=" | cut -f1 | sort -k1,1 -u | wc -l | sed -E 's/^[[:space:]]+//'",
        )
        r2 = self.samtools_view(
            file_name,
            param=param + " -f128",
            postpend=" | cut -f1 | sort -k1,1 -u | wc -l | sed -E 's/^[[:space:]]+//'",
        )
    else:
        r1 = self.samtools_view(
            file_name,
            param=param + "",
            postpend=" | cut -f1 | sort -k1,1 -u | wc -l | sed -E 's/^[[:space:]]+//'",
        )
        r2 = 0
    return int(r1) + int(r2)

count_uniquelymapping_reads

count_uniquelymapping_reads(file_name, paired_end)

Counts the number of reads that mapped to a unique position.

:param str file_name: name of reads file :param bool paired_end: This parameter is ignored.

Source code in pypiper/ngstk.py
894
895
896
897
898
899
900
901
902
903
904
def count_uniquelymapping_reads(self, file_name, paired_end):
    """
    Counts the number of reads that mapped to a unique position.

    :param str file_name: name of reads file
    :param bool paired_end: This parameter is ignored.
    """
    param = " -c -F256"
    if file_name.endswith("sam"):
        param += " -S"
    return self.samtools_view(file_name, param=param)

fastqc

fastqc(file, output_dir)

Create command to run fastqc on a FASTQ file

:param str file: Path to file with sequencing reads :param str output_dir: Path to folder in which to place output :return str: Command with which to run fastqc

Source code in pypiper/ngstk.py
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
def fastqc(self, file, output_dir):
    """
    Create command to run fastqc on a FASTQ file

    :param str file: Path to file with sequencing reads
    :param str output_dir: Path to folder in which to place output
    :return str: Command with which to run fastqc
    """
    # You can find the fastqc help with fastqc --help
    try:
        pm = self.pm
    except AttributeError:
        # Do nothing, this is just for path construction.
        pass
    else:
        if not os.path.isabs(output_dir) and pm is not None:
            output_dir = os.path.join(pm.outfolder, output_dir)
    self.make_sure_path_exists(output_dir)
    return "{} --noextract --outdir {} {}".format(
        self.tools.fastqc, output_dir, file
    )

fastqc_rename

fastqc_rename(input_bam, output_dir, sample_name)

Create pair of commands to run fastqc and organize files.

The first command returned is the one that actually runs fastqc when it's executed; the second moves the output files to the output folder for the sample indicated.

:param str input_bam: Path to file for which to run fastqc. :param str output_dir: Path to folder in which fastqc output will be written, and within which the sample's output folder lives. :param str sample_name: Sample name, which determines subfolder within output_dir for the fastqc files. :return list[str]: Pair of commands, to run fastqc and then move the files to their intended destination based on sample name.

Source code in pypiper/ngstk.py
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
def fastqc_rename(self, input_bam, output_dir, sample_name):
    """
    Create pair of commands to run fastqc and organize files.

    The first command returned is the one that actually runs fastqc when
    it's executed; the second moves the output files to the output
    folder for the sample indicated.

    :param str input_bam: Path to file for which to run fastqc.
    :param str output_dir: Path to folder in which fastqc output will be
        written, and within which the sample's output folder lives.
    :param str sample_name: Sample name, which determines subfolder within
        output_dir for the fastqc files.
    :return list[str]: Pair of commands, to run fastqc and then move the files to
        their intended destination based on sample name.
    """
    cmds = list()
    initial = os.path.splitext(os.path.basename(input_bam))[0]
    cmd1 = self.fastqc(input_bam, output_dir)
    cmds.append(cmd1)
    cmd2 = "if [[ ! -s {1}_fastqc.html ]]; then mv {0}_fastqc.html {1}_fastqc.html; mv {0}_fastqc.zip {1}_fastqc.zip; fi".format(
        os.path.join(output_dir, initial), os.path.join(output_dir, sample_name)
    )
    cmds.append(cmd2)
    return cmds

filter_reads

filter_reads(input_bam, output_bam, metrics_file, paired=False, cpus=16, Q=30)

Remove duplicates, filter for >Q, remove multiple mapping reads. For paired-end reads, keep only proper pairs.

Source code in pypiper/ngstk.py
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
def filter_reads(
    self, input_bam, output_bam, metrics_file, paired=False, cpus=16, Q=30
):
    """
    Remove duplicates, filter for >Q, remove multiple mapping reads.
    For paired-end reads, keep only proper pairs.
    """
    nodups = re.sub(r"\.bam$", "", output_bam) + ".nodups.nofilter.bam"
    cmd1 = (
        self.tools.sambamba
        + " markdup -t {0} -r --compression-level=0 {1} {2} 2> {3}".format(
            cpus, input_bam, nodups, metrics_file
        )
    )
    cmd2 = self.tools.sambamba + " view -t {0} -f bam --valid".format(cpus)
    if paired:
        cmd2 += ' -F "not (unmapped or mate_is_unmapped) and proper_pair'
    else:
        cmd2 += ' -F "not unmapped'
    cmd2 += ' and not (secondary_alignment or supplementary) and mapping_quality >= {0}"'.format(
        Q
    )
    cmd2 += " {0} |".format(nodups)
    cmd2 += self.tools.sambamba + " sort -t {0} /dev/stdin -o {1}".format(
        cpus, output_bam
    )
    cmd3 = "if [[ -s {0} ]]; then rm {0}; fi".format(nodups)
    cmd4 = "if [[ -s {0} ]]; then rm {0}; fi".format(nodups + ".bai")
    return [cmd1, cmd2, cmd3, cmd4]

get_chrs_from_bam

get_chrs_from_bam(file_name)

Uses samtools to grab the chromosomes from the header that are contained in this bam file.

Source code in pypiper/ngstk.py
766
767
768
769
770
771
772
773
774
775
776
777
778
779
def get_chrs_from_bam(self, file_name):
    """
    Uses samtools to grab the chromosomes from the header that are contained
    in this bam file.
    """
    x = subprocess.check_output(
        self.tools.samtools
        + " view -H "
        + file_name
        + " | grep '^@SQ' | cut -f2| sed s'/SN://'",
        shell=True,
    )
    # Chromosomes will be separated by newlines; split into list to return
    return x.decode().split()

get_file_size

get_file_size(filenames)

Get size of all files in string (space-separated) in megabytes (Mb).

:param str filenames: a space-separated string of filenames

Source code in pypiper/ngstk.py
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
def get_file_size(self, filenames):
    """
    Get size of all files in string (space-separated) in megabytes (Mb).

    :param str filenames: a space-separated string of filenames
    """
    # use (1024 ** 3) for gigabytes
    # equivalent to: stat -Lc '%s' filename

    # If given a list, recurse through it.
    if type(filenames) is list:
        return sum([self.get_file_size(filename) for filename in filenames])

    return round(
        sum([float(os.stat(f).st_size) for f in filenames.split(" ")]) / (1024**2),
        4,
    )

get_frip

get_frip(sample)

Calculates the fraction of reads in peaks for a given sample.

:param pipelines.Sample sample: Sample object with "peaks" attribute.

Source code in pypiper/ngstk.py
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
def get_frip(self, sample):
    """
    Calculates the fraction of reads in peaks for a given sample.

    :param pipelines.Sample sample: Sample object with "peaks" attribute.
    """
    import pandas as pd

    with open(sample.frip, "r") as handle:
        content = handle.readlines()
    reads_in_peaks = int(re.sub(r"\D", "", content[0]))
    mapped_reads = sample["readCount"] - sample["unaligned"]
    return pd.Series(reads_in_peaks / mapped_reads, index="FRiP")

get_input_ext

get_input_ext(input_file)

Get the extension of the input_file. Assumes you're using either .bam or .fastq/.fq or .fastq.gz/.fq.gz.

Source code in pypiper/ngstk.py
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
def get_input_ext(self, input_file):
    """
    Get the extension of the input_file. Assumes you're using either
    .bam or .fastq/.fq or .fastq.gz/.fq.gz.
    """
    if input_file.endswith(".bam"):
        input_ext = ".bam"
    elif input_file.endswith(".fastq.gz") or input_file.endswith(".fq.gz"):
        input_ext = ".fastq.gz"
    elif input_file.endswith(".fastq") or input_file.endswith(".fq"):
        input_ext = ".fastq"
    else:
        errmsg = (
            "'{}'; this pipeline can only deal with .bam, .fastq, "
            "or .fastq.gz files".format(input_file)
        )
        raise UnsupportedFiletypeException(errmsg)
    return input_ext

get_mitochondrial_reads

get_mitochondrial_reads(bam_file, output, cpus=4)
Source code in pypiper/ngstk.py
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
def get_mitochondrial_reads(self, bam_file, output, cpus=4):
    """ """
    tmp_bam = bam_file + "tmp_rmMe"
    cmd1 = self.tools.sambamba + " index -t {0} {1}".format(cpus, bam_file)
    cmd2 = (
        self.tools.sambamba
        + " slice {0} chrM | {1} markdup -t 4 /dev/stdin {2} 2> {3}".format(
            bam_file, self.tools.sambamba, tmp_bam, output
        )
    )
    cmd3 = "rm {}".format(tmp_bam)
    return [cmd1, cmd2, cmd3]

get_peak_number

get_peak_number(sample)

Counts number of peaks from a sample's peak file.

:param pipelines.Sample sample: Sample object with "peaks" attribute.

Source code in pypiper/ngstk.py
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
def get_peak_number(self, sample):
    """
    Counts number of peaks from a sample's peak file.

    :param pipelines.Sample sample: Sample object with "peaks" attribute.
    """
    proc = subprocess.Popen(["wc", "-l", sample.peaks], stdout=subprocess.PIPE)
    out, err = proc.communicate()
    sample["peakNumber"] = re.sub(r"\D.*", "", out)
    return sample

get_read_type

get_read_type(bam_file, n=10)

Gets the read type (single, paired) and length of bam file. :param str bam_file: Bam file to determine read attributes. :param int n: Number of lines to read from bam file. :return str, int: tuple of read type and read length

Source code in pypiper/ngstk.py
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
def get_read_type(self, bam_file, n=10):
    """
    Gets the read type (single, paired) and length of bam file.
    :param str bam_file: Bam file to determine read attributes.
    :param int n: Number of lines to read from bam file.
    :return str, int: tuple of read type and read length
    """

    from collections.abc import Counter

    try:
        p = subprocess.Popen(
            [self.tools.samtools, "view", bam_file], stdout=subprocess.PIPE
        )
        # Count paired alignments
        paired = 0
        read_length = Counter()
        while n > 0:
            line = p.stdout.next().split("\t")
            flag = int(line[1])
            read_length[len(line[9])] += 1
            if 1 & flag:  # check decimal flag contains 1 (paired)
                paired += 1
            n -= 1
        p.kill()
    except IOError("Cannot read provided bam file.") as e:
        raise e
    # Get most abundant read read_length
    read_length = sorted(read_length)[-1]
    # If at least half is paired, return True
    if paired > (n / 2.0):
        return "PE", read_length
    else:
        return "SE", read_length

input_to_fastq

input_to_fastq(input_file, sample_name, paired_end, fastq_folder, output_file=None, multiclass=False, zipmode=False)

Builds a command to convert input file to fastq, for various inputs.

Takes either .bam, .fastq.gz, or .fastq input and returns commands that will create the .fastq file, regardless of input type. This is useful to made your pipeline easily accept any of these input types seamlessly, standardizing you to fastq which is still the most common format for adapter trimmers, etc. You can specify you want output either zipped or not.

Commands will place the output fastq file in given fastq_folder.

:param str input_file: filename of input you want to convert to fastq :param bool multiclass: Are both read1 and read2 included in a single file? User should not need to set this; it will be inferred and used in recursive calls, based on number files, and the paired_end arg. :param bool zipmode: Should the output be .fastq.gz? Otherwise, just fastq :return str: A command (to be run with PipelineManager) that will ensure your fastq file exists.

Source code in pypiper/ngstk.py
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
def input_to_fastq(
    self,
    input_file,
    sample_name,
    paired_end,
    fastq_folder,
    output_file=None,
    multiclass=False,
    zipmode=False,
):
    """
    Builds a command to convert input file to fastq, for various inputs.

    Takes either .bam, .fastq.gz, or .fastq input and returns
    commands that will create the .fastq file, regardless of input type.
    This is useful to made your pipeline easily accept any of these input
    types seamlessly, standardizing you to fastq which is still the
    most common format for adapter trimmers, etc. You can specify you want
    output either zipped or not.

    Commands will place the output fastq file in given `fastq_folder`.

    :param str input_file: filename of input you want to convert to fastq
    :param bool multiclass: Are both read1 and read2 included in a single
        file? User should not need to set this; it will be inferred and used
        in recursive calls, based on number files, and the paired_end arg.
    :param bool zipmode: Should the output be .fastq.gz? Otherwise, just fastq
    :return str: A command (to be run with PipelineManager) that will ensure
        your fastq file exists.
    """

    fastq_prefix = os.path.join(fastq_folder, sample_name)
    self.make_sure_path_exists(fastq_folder)

    # this expects a list; if it gets a string, wrap it in a list.
    if type(input_file) != list:
        input_file = [input_file]

    # If multiple files were provided, recurse on each file individually
    if len(input_file) > 1:
        cmd = []
        output_file = []
        for in_i, in_arg in enumerate(input_file):
            output = fastq_prefix + "_R" + str(in_i + 1) + ".fastq"
            result_cmd, uf, result_file = self.input_to_fastq(
                in_arg,
                sample_name,
                paired_end,
                fastq_folder,
                output,
                multiclass=True,
                zipmode=zipmode,
            )
            cmd.append(result_cmd)
            output_file.append(result_file)

    else:
        # There was only 1 input class.
        # Convert back into a string
        input_file = input_file[0]
        if not output_file:
            output_file = fastq_prefix + "_R1.fastq"
        if zipmode:
            output_file = output_file + ".gz"

        input_ext = self.get_input_ext(input_file)  # handles .fq or .fastq

        if input_ext == ".bam":
            print("Found .bam file")
            # cmd = self.bam_to_fastq(input_file, fastq_prefix, paired_end)
            cmd, fq1, fq2 = self.bam_to_fastq_awk(
                input_file, fastq_prefix, paired_end, zipmode
            )
            # pm.run(cmd, output_file, follow=check_fastq)
            if fq2:
                output_file = [fq1, fq2]
            else:
                output_file = fq1
        elif input_ext == ".fastq.gz":
            print("Found .fastq.gz file")
            if paired_end and not multiclass:
                if zipmode:
                    raise NotImplementedError(
                        "Can't use zipmode on interleaved fastq data."
                    )
                # For paired-end reads in one fastq file, we must split the
                # file into 2. The pipeline author will need to include this
                # python script in the scripts directory.
                # TODO: make this self-contained in pypiper. This is a rare
                # use case these days, as fastq files are almost never
                # interleaved anymore.
                script_path = os.path.join(self.tools.scripts_dir, "fastq_split.py")
                cmd = self.tools.python + " -u " + script_path
                cmd += " -i " + input_file
                cmd += " -o " + fastq_prefix
                # Must also return the set of output files
                output_file = [
                    fastq_prefix + "_R1.fastq",
                    fastq_prefix + "_R2.fastq",
                ]
            else:
                if zipmode:
                    # we do nothing!
                    cmd = "ln -sf " + input_file + " " + output_file
                    print("Found .fq.gz file; no conversion necessary")
                else:
                    # For single-end reads, we just unzip the fastq.gz file.
                    # or, paired-end reads that were already split.
                    cmd = (
                        self.ziptool + " -d -c " + input_file + " > " + output_file
                    )
                    # a non-shell version
                    # cmd1 = "gunzip --force " + input_file
                    # cmd2 = "mv " + os.path.splitext(input_file)[0] + " " + output_file
                    # cmd = [cmd1, cmd2]
        elif input_ext == ".fastq":
            if zipmode:
                cmd = self.ziptool + " -c " + input_file + " > " + output_file
            else:
                cmd = "ln -sf " + input_file + " " + output_file
                print("Found .fastq file; no conversion necessary")

    return [cmd, fastq_prefix, output_file]

macs2_call_peaks

macs2_call_peaks(treatment_bams, output_dir, sample_name, genome, control_bams=None, broad=False, paired=False, pvalue=None, qvalue=None, include_significance=None)

Use MACS2 to call peaks.

:param str | Iterable[str] treatment_bams: Paths to files with data to regard as treatment. :param str output_dir: Path to output folder. :param str sample_name: Name for the sample involved. :param str genome: Name of the genome assembly to use. :param str | Iterable[str] control_bams: Paths to files with data to regard as control :param bool broad: Whether to do broad peak calling. :param bool paired: Whether reads are paired-end :param float | NoneType pvalue: Statistical significance measure to pass as --pvalue to peak calling with MACS :param float | NoneType qvalue: Statistical significance measure to pass as --qvalue to peak calling with MACS :param bool | NoneType include_significance: Whether to pass a statistical significance argument to peak calling with MACS; if omitted, this will be True if the peak calling is broad or if either p-value or q-value is specified; default significance specification is a p-value of 0.001 if a significance is to be specified but no value is provided for p-value or q-value. :return str: Command to run.

Source code in pypiper/ngstk.py
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
def macs2_call_peaks(
    self,
    treatment_bams,
    output_dir,
    sample_name,
    genome,
    control_bams=None,
    broad=False,
    paired=False,
    pvalue=None,
    qvalue=None,
    include_significance=None,
):
    """
    Use MACS2 to call peaks.

    :param str | Iterable[str] treatment_bams: Paths to files with data to
        regard as treatment.
    :param str output_dir: Path to output folder.
    :param str sample_name: Name for the sample involved.
    :param str genome: Name of the genome assembly to use.
    :param str | Iterable[str] control_bams: Paths to files with data to
        regard as control
    :param bool broad: Whether to do broad peak calling.
    :param bool paired: Whether reads are paired-end
    :param float | NoneType pvalue: Statistical significance measure to
        pass as --pvalue to peak calling with MACS
    :param float | NoneType qvalue: Statistical significance measure to
        pass as --qvalue to peak calling with MACS
    :param bool | NoneType include_significance: Whether to pass a
        statistical significance argument to peak calling with MACS; if
        omitted, this will be True if the peak calling is broad or if
        either p-value or q-value is specified; default significance
        specification is a p-value of 0.001 if a significance is to be
        specified but no value is provided for p-value or q-value.
    :return str: Command to run.
    """
    sizes = {
        "hg38": 2.7e9,
        "hg19": 2.7e9,
        "mm10": 1.87e9,
        "dr7": 1.412e9,
        "mm9": 1.87e9,
    }

    # Whether to specify to MACS2 a value for statistical significance
    # can be either directly indicated, but if not, it's determined by
    # whether the mark is associated with broad peaks. By default, we
    # specify a significance value to MACS2 for a mark associated with a
    # broad peak.
    if include_significance is None:
        include_significance = broad

    cmd = self.tools.macs2 + " callpeak -t {0}".format(
        treatment_bams if type(treatment_bams) is str else " ".join(treatment_bams)
    )

    if control_bams is not None:
        cmd += " -c {0}".format(
            control_bams if type(control_bams) is str else " ".join(control_bams)
        )

    if paired:
        cmd += " -f BAMPE "

    # Additional settings based on whether the marks is associated with
    # broad peaks
    if broad:
        cmd += " --broad --nomodel --extsize 73"
    else:
        cmd += " --fix-bimodal --extsize 180 --bw 200"

    if include_significance:
        # Allow significance specification via either p- or q-value,
        # giving preference to q-value if both are provided but falling
        # back on a default p-value if neither is provided but inclusion
        # of statistical significance measure is desired.
        if qvalue is not None:
            cmd += " --qvalue {}".format(qvalue)
        else:
            cmd += " --pvalue {}".format(pvalue or 0.00001)
    cmd += " -g {0} -n {1} --outdir {2}".format(
        sizes[genome], sample_name, output_dir
    )

    return cmd

make_dir

make_dir(path)

Forge path to directory, creating intermediates as needed.

:param str path: Path to create.

Source code in pypiper/ngstk.py
108
109
110
111
112
113
114
115
116
117
118
def make_dir(self, path):
    """
    Forge path to directory, creating intermediates as needed.

    :param str path: Path to create.
    """
    try:
        os.makedirs(path)
    except OSError as exception:
        if exception.errno != errno.EEXIST:
            raise

make_sure_path_exists

make_sure_path_exists(path)

Alias for make_dir

Source code in pypiper/ngstk.py
120
121
122
def make_sure_path_exists(self, path):
    """Alias for make_dir"""
    self.make_dir(path)

merge_bams

merge_bams(input_bams, merged_bam, in_sorted='TRUE', tmp_dir=None)

Combine multiple files into one.

The tmp_dir parameter is important because on poorly configured systems, the default can sometimes fill up.

:param Iterable[str] input_bams: Paths to files to combine :param str merged_bam: Path to which to write combined result. :param bool | str in_sorted: Whether the inputs are sorted :param str tmp_dir: Path to temporary directory.

Source code in pypiper/ngstk.py
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
def merge_bams(self, input_bams, merged_bam, in_sorted="TRUE", tmp_dir=None):
    """
    Combine multiple files into one.

    The tmp_dir parameter is important because on poorly configured
    systems, the default can sometimes fill up.

    :param Iterable[str] input_bams: Paths to files to combine
    :param str merged_bam: Path to which to write combined result.
    :param bool | str in_sorted: Whether the inputs are sorted
    :param str tmp_dir: Path to temporary directory.
    """
    if not len(input_bams) > 1:
        print("No merge required")
        return 0

    outdir, _ = os.path.split(merged_bam)
    if outdir and not os.path.exists(outdir):
        print("Creating path to merge file's folder: '{}'".format(outdir))
        os.makedirs(outdir)

    # Handle more intuitive boolean argument.
    if in_sorted in [False, True]:
        in_sorted = "TRUE" if in_sorted else "FALSE"

    input_string = " INPUT=" + " INPUT=".join(input_bams)
    cmd = self.tools.java + " -Xmx" + self.pm.javamem
    cmd += " -jar " + self.tools.picard + " MergeSamFiles"
    cmd += input_string
    cmd += " OUTPUT=" + merged_bam
    cmd += " ASSUME_SORTED=" + str(in_sorted)
    cmd += " CREATE_INDEX=TRUE"
    cmd += " VALIDATION_STRINGENCY=SILENT"
    if tmp_dir:
        cmd += " TMP_DIR=" + tmp_dir

    return cmd

merge_fastq

merge_fastq(inputs, output, run=False, remove_inputs=False)

Merge FASTQ files (zipped or not) into one.

:param Iterable[str] inputs: Collection of paths to files to merge. :param str output: Path to single output file. :param bool run: Whether to run the command. :param bool remove_inputs: Whether to keep the original files. :return NoneType | str: Null if running the command, otherwise the command itself :raise ValueError: Raise ValueError if the call is such that inputs are to be deleted but command is not run.

Source code in pypiper/ngstk.py
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
def merge_fastq(self, inputs, output, run=False, remove_inputs=False):
    """
    Merge FASTQ files (zipped or not) into one.

    :param Iterable[str] inputs: Collection of paths to files to merge.
    :param str output: Path to single output file.
    :param bool run: Whether to run the command.
    :param bool remove_inputs: Whether to keep the original files.
    :return NoneType | str: Null if running the command, otherwise the
        command itself
    :raise ValueError: Raise ValueError if the call is such that
        inputs are to be deleted but command is not run.
    """
    if remove_inputs and not run:
        raise ValueError("Can't delete files if command isn't run")
    cmd = "cat {} > {}".format(" ".join(inputs), output)
    if run:
        subprocess.check_call(cmd.split(), shell=True)
        if remove_inputs:
            cmd = "rm {}".format(" ".join(inputs))
            subprocess.check_call(cmd.split(), shell=True)
    else:
        return cmd
merge_or_link(input_args, raw_folder, local_base='sample')

Standardizes various input possibilities by converting either .bam, .fastq, or .fastq.gz files into a local file; merging those if multiple files given.

:param list input_args: This is a list of arguments, each one is a class of inputs (which can in turn be a string or a list). Typically, input_args is a list with 2 elements: first a list of read1 files; second an (optional!) list of read2 files. :param str raw_folder: Name/path of folder for the merge/link. :param str local_base: Usually the sample name. This (plus file extension) will be the name of the local file linked (or merged) by this function.

Source code in pypiper/ngstk.py
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
def merge_or_link(self, input_args, raw_folder, local_base="sample"):
    """
    Standardizes various input possibilities by converting either .bam,
    .fastq, or .fastq.gz files into a local file; merging those if multiple
    files given.

    :param list input_args: This is a list of arguments, each one is a
        class of inputs (which can in turn be a string or a list).
        Typically, input_args is a list with 2 elements: first a list of
        read1 files; second an (optional!) list of read2 files.
    :param str raw_folder: Name/path of folder for the merge/link.
    :param str local_base: Usually the sample name. This (plus file
        extension) will be the name of the local file linked (or merged)
        by this function.
    """
    self.make_sure_path_exists(raw_folder)

    if not isinstance(input_args, list):
        raise Exception("Input must be a list")

    if any(isinstance(i, list) for i in input_args):
        # We have a list of lists. Process each individually.
        local_input_files = list()
        n_input_files = len(list(filter(bool, input_args)))
        print("Number of input file sets: " + str(n_input_files))

        for input_i, input_arg in enumerate(input_args):
            # Count how many non-null items there are in the list;
            # we only append _R1 (etc.) if there are multiple input files.
            if n_input_files > 1:
                local_base_extended = local_base + "_R" + str(input_i + 1)
            else:
                local_base_extended = local_base
            if input_arg:
                out = self.merge_or_link(input_arg, raw_folder, local_base_extended)

                print("Local input file: '{}'".format(out))
                # Make sure file exists:
                if not os.path.isfile(out):
                    print("Not a file: '{}'".format(out))

                local_input_files.append(out)

        return local_input_files

    else:
        # We have a list of individual arguments. Merge them.

        if len(input_args) == 1:
            # Only one argument in this list. A single input file; we just link
            # it, regardless of file type:
            # Pull the value out of the list
            input_arg = input_args[0]
            input_ext = self.get_input_ext(input_arg)

            # Convert to absolute path
            if not os.path.isabs(input_arg):
                input_arg = os.path.abspath(input_arg)

            # Link it to into the raw folder
            local_input_abs = os.path.join(raw_folder, local_base + input_ext)
            self.pm.run(
                "ln -sf " + input_arg + " " + local_input_abs,
                target=local_input_abs,
                shell=True,
            )
            # return the local (linked) filename absolute path
            return local_input_abs

        else:
            # Otherwise, there are multiple inputs.
            # If more than 1 input file is given, then these are to be merged
            # if they are in bam format.
            if all([self.get_input_ext(x) == ".bam" for x in input_args]):
                sample_merged = local_base + ".merged.bam"
                output_merge = os.path.join(raw_folder, sample_merged)
                cmd = self.merge_bams_samtools(input_args, output_merge)
                self.pm.debug("cmd: {}".format(cmd))
                self.pm.run(cmd, output_merge)
                cmd2 = self.validate_bam(output_merge)
                self.pm.run(cmd, output_merge, nofail=True)
                return output_merge

            # if multiple fastq
            if all([self.get_input_ext(x) == ".fastq.gz" for x in input_args]):
                sample_merged_gz = local_base + ".merged.fastq.gz"
                output_merge_gz = os.path.join(raw_folder, sample_merged_gz)
                # cmd1 = self.ziptool + "-d -c " + " ".join(input_args) + " > " + output_merge
                # cmd2 = self.ziptool + " " + output_merge
                # self.pm.run([cmd1, cmd2], output_merge_gz)
                # you can save yourself the decompression/recompression:
                cmd = "cat " + " ".join(input_args) + " > " + output_merge_gz
                self.pm.run(cmd, output_merge_gz)
                return output_merge_gz

            if all([self.get_input_ext(x) == ".fastq" for x in input_args]):
                sample_merged = local_base + ".merged.fastq"
                output_merge = os.path.join(raw_folder, sample_merged)
                cmd = "cat " + " ".join(input_args) + " > " + output_merge
                self.pm.run(cmd, output_merge)
                return output_merge

            # At this point, we don't recognize the input file types or they
            # do not match.
            raise NotImplementedError(
                "Input files must be of the same type, and can only "
                "merge bam or fastq."
            )

parse_bowtie_stats

parse_bowtie_stats(stats_file)

Parses Bowtie2 stats file, returns series with values. :param str stats_file: Bowtie2 output file with alignment statistics.

Source code in pypiper/ngstk.py
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
def parse_bowtie_stats(self, stats_file):
    """
    Parses Bowtie2 stats file, returns series with values.
    :param str  stats_file: Bowtie2 output file with alignment statistics.
    """
    import pandas as pd

    stats = pd.Series(
        index=[
            "readCount",
            "unpaired",
            "unaligned",
            "unique",
            "multiple",
            "alignmentRate",
        ]
    )
    try:
        with open(stats_file) as handle:
            content = handle.readlines()  # list of strings per line
    except:
        return stats
    # total reads
    try:
        line = [
            i for i in range(len(content)) if " reads; of these:" in content[i]
        ][0]
        stats["readCount"] = re.sub(r"\D.*", "", content[line])
        if 7 > len(content) > 2:
            line = [
                i
                for i in range(len(content))
                if "were unpaired; of these:" in content[i]
            ][0]
            stats["unpaired"] = re.sub(
                r"\D", "", re.sub(r"\(.*", "", content[line])
            )
        else:
            line = [
                i
                for i in range(len(content))
                if "were paired; of these:" in content[i]
            ][0]
            stats["unpaired"] = stats["readCount"] - int(
                re.sub(r"\D", "", re.sub(r"\(.*", "", content[line]))
            )
        line = [i for i in range(len(content)) if "aligned 0 times" in content[i]][
            0
        ]
        stats["unaligned"] = re.sub(r"\D", "", re.sub(r"\(.*", "", content[line]))
        line = [
            i for i in range(len(content)) if "aligned exactly 1 time" in content[i]
        ][0]
        stats["unique"] = re.sub(r"\D", "", re.sub(r"\(.*", "", content[line]))
        line = [i for i in range(len(content)) if "aligned >1 times" in content[i]][
            0
        ]
        stats["multiple"] = re.sub(r"\D", "", re.sub(r"\(.*", "", content[line]))
        line = [
            i for i in range(len(content)) if "overall alignment rate" in content[i]
        ][0]
        stats["alignmentRate"] = re.sub(r"\%.*", "", content[line]).strip()
    except IndexError:
        pass
    return stats

parse_duplicate_stats

parse_duplicate_stats(stats_file)

Parses sambamba markdup output, returns series with values.

:param str stats_file: sambamba output file with duplicate statistics.

Source code in pypiper/ngstk.py
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
def parse_duplicate_stats(self, stats_file):
    """
    Parses sambamba markdup output, returns series with values.

    :param str stats_file: sambamba output file with duplicate statistics.
    """
    import pandas as pd

    series = pd.Series()
    try:
        with open(stats_file) as handle:
            content = handle.readlines()  # list of strings per line
    except:
        return series
    try:
        line = [
            i
            for i in range(len(content))
            if "single ends (among them " in content[i]
        ][0]
        series["single-ends"] = re.sub(
            r"\D", "", re.sub(r"\(.*", "", content[line])
        )
        line = [
            i
            for i in range(len(content))
            if " end pairs...   done in " in content[i]
        ][0]
        series["paired-ends"] = re.sub(
            r"\D", "", re.sub(r"\.\.\..*", "", content[line])
        )
        line = [
            i
            for i in range(len(content))
            if " duplicates, sorting the list...   done in " in content[i]
        ][0]
        series["duplicates"] = re.sub(
            r"\D", "", re.sub(r"\.\.\..*", "", content[line])
        )
    except IndexError:
        pass
    return series

parse_qc

parse_qc(qc_file)

Parse phantompeakqualtools (spp) QC table and return quality metrics.

:param str qc_file: Path to phantompeakqualtools output file, which contains sample quality measurements.

Source code in pypiper/ngstk.py
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
def parse_qc(self, qc_file):
    """
    Parse phantompeakqualtools (spp) QC table and return quality metrics.

    :param str qc_file: Path to phantompeakqualtools output file, which
        contains sample quality measurements.
    """
    import pandas as pd

    series = pd.Series()
    try:
        with open(qc_file) as handle:
            line = (
                handle.readlines()[0].strip().split("\t")
            )  # list of strings per line
        series["NSC"] = line[-3]
        series["RSC"] = line[-2]
        series["qualityTag"] = line[-1]
    except:
        pass
    return series

plot_atacseq_insert_sizes

plot_atacseq_insert_sizes(bam, plot, output_csv, max_insert=1500, smallest_insert=30)

Heavy inspiration from here: https://github.com/dbrg77/ATAC/blob/master/ATAC_seq_read_length_curve_fitting.ipynb

Source code in pypiper/ngstk.py
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
def plot_atacseq_insert_sizes(
    self, bam, plot, output_csv, max_insert=1500, smallest_insert=30
):
    """
    Heavy inspiration from here:
    https://github.com/dbrg77/ATAC/blob/master/ATAC_seq_read_length_curve_fitting.ipynb
    """
    try:
        import matplotlib
        import matplotlib.mlab as mlab
        import numpy as np
        import pysam
        from scipy.integrate import simps
        from scipy.optimize import curve_fit

        matplotlib.use("Agg")
        import matplotlib.pyplot as plt
    except:
        print("Necessary Python modules couldn't be loaded.")
        return

    try:
        import seaborn as sns

        sns.set_style("whitegrid")
    except:
        pass

    def get_fragment_sizes(bam, max_insert=1500):
        frag_sizes = list()

        bam = pysam.Samfile(bam, "rb")

        for i, read in enumerate(bam):
            if read.tlen < max_insert:
                frag_sizes.append(read.tlen)
        bam.close()

        return np.array(frag_sizes)

    def mixture_function(x, *p):
        """
        Mixture function to model four gaussian (nucleosomal)
        and one exponential (nucleosome-free) distributions.
        """
        m1, s1, w1, m2, s2, w2, m3, s3, w3, m4, s4, w4, q, r = p
        nfr = expo(x, 2.9e-02, 2.8e-02)
        nfr[:smallest_insert] = 0

        return (
            mlab.normpdf(x, m1, s1) * w1
            + mlab.normpdf(x, m2, s2) * w2
            + mlab.normpdf(x, m3, s3) * w3
            + mlab.normpdf(x, m4, s4) * w4
            + nfr
        )

    def expo(x, q, r):
        """
        Exponential function.
        """
        return q * np.exp(-r * x)

    # get fragment sizes
    frag_sizes = get_fragment_sizes(bam)

    # bin
    numBins = np.linspace(0, max_insert, max_insert + 1)
    y, scatter_x = np.histogram(frag_sizes, numBins, density=1)
    # get the mid-point of each bin
    x = (scatter_x[:-1] + scatter_x[1:]) / 2

    # Parameters are empirical, need to check
    paramGuess = [
        200,
        50,
        0.7,  # gaussians
        400,
        50,
        0.15,
        600,
        50,
        0.1,
        800,
        55,
        0.045,
        2.9e-02,
        2.8e-02,  # exponential
    ]

    try:
        popt3, pcov3 = curve_fit(
            mixture_function,
            x[smallest_insert:],
            y[smallest_insert:],
            p0=paramGuess,
            maxfev=100000,
        )
    except:
        print("Nucleosomal fit could not be found.")
        return

    m1, s1, w1, m2, s2, w2, m3, s3, w3, m4, s4, w4, q, r = popt3

    # Plot
    plt.figure(figsize=(12, 12))

    # Plot distribution
    plt.hist(frag_sizes, numBins, histtype="step", ec="k", normed=1, alpha=0.5)

    # Plot nucleosomal fits
    plt.plot(x, mlab.normpdf(x, m1, s1) * w1, "r-", lw=1.5, label="1st nucleosome")
    plt.plot(x, mlab.normpdf(x, m2, s2) * w2, "g-", lw=1.5, label="2nd nucleosome")
    plt.plot(x, mlab.normpdf(x, m3, s3) * w3, "b-", lw=1.5, label="3rd nucleosome")
    plt.plot(x, mlab.normpdf(x, m4, s4) * w4, "c-", lw=1.5, label="4th nucleosome")

    # Plot nucleosome-free fit
    nfr = expo(x, 2.9e-02, 2.8e-02)
    nfr[:smallest_insert] = 0
    plt.plot(x, nfr, "k-", lw=1.5, label="nucleosome-free")

    # Plot sum of fits
    ys = mixture_function(x, *popt3)
    plt.plot(x, ys, "k--", lw=3.5, label="fit sum")

    plt.legend()
    plt.xlabel("Fragment size (bp)")
    plt.ylabel("Density")
    plt.savefig(plot, bbox_inches="tight")

    # Integrate curves and get areas under curve
    areas = [
        ["fraction", "area under curve", "max density"],
        ["Nucleosome-free fragments", simps(nfr), max(nfr)],
        [
            "1st nucleosome",
            simps(mlab.normpdf(x, m1, s1) * w1),
            max(mlab.normpdf(x, m1, s1) * w1),
        ],
        [
            "2nd nucleosome",
            simps(mlab.normpdf(x, m2, s2) * w1),
            max(mlab.normpdf(x, m2, s2) * w2),
        ],
        [
            "3rd nucleosome",
            simps(mlab.normpdf(x, m3, s3) * w1),
            max(mlab.normpdf(x, m3, s3) * w3),
        ],
        [
            "4th nucleosome",
            simps(mlab.normpdf(x, m4, s4) * w1),
            max(mlab.normpdf(x, m4, s4) * w4),
        ],
    ]

    try:
        import csv

        with open(output_csv, "w") as f:
            writer = csv.writer(f)
            writer.writerows(areas)
    except:
        pass

run_spp

run_spp(input_bam, output, plot, cpus)

Run the SPP read peak analysis tool.

:param str input_bam: Path to reads file :param str output: Path to output file :param str plot: Path to plot file :param int cpus: Number of processors to use :return str: Command with which to run SPP

Source code in pypiper/ngstk.py
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
def run_spp(self, input_bam, output, plot, cpus):
    """
    Run the SPP read peak analysis tool.

    :param str input_bam: Path to reads file
    :param str output: Path to output file
    :param str plot: Path to plot file
    :param int cpus: Number of processors to use
    :return str: Command with which to run SPP
    """
    base = "{} {} -rf -savp".format(self.tools.Rscript, self.tools.spp)
    cmd = base + " -savp={} -s=0:5:500 -c={} -out={} -p={}".format(
        plot, input_bam, output, cpus
    )
    return cmd

sam_conversions

sam_conversions(sam_file, depth=True)

Convert sam files to bam files, then sort and index them for later use.

:param bool depth: also calculate coverage over each position

Source code in pypiper/ngstk.py
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
def sam_conversions(self, sam_file, depth=True):
    """
    Convert sam files to bam files, then sort and index them for later use.

    :param bool depth: also calculate coverage over each position
    """
    cmd = (
        self.tools.samtools
        + " view -bS "
        + sam_file
        + " > "
        + sam_file.replace(".sam", ".bam")
        + "\n"
    )
    cmd += (
        self.tools.samtools
        + " sort "
        + sam_file.replace(".sam", ".bam")
        + " -o "
        + sam_file.replace(".sam", "_sorted.bam")
        + "\n"
    )
    cmd += (
        self.tools.samtools
        + " index "
        + sam_file.replace(".sam", "_sorted.bam")
        + "\n"
    )
    if depth:
        cmd += (
            self.tools.samtools
            + " depth "
            + sam_file.replace(".sam", "_sorted.bam")
            + " > "
            + sam_file.replace(".sam", "_sorted.depth")
            + "\n"
        )
    return cmd

samtools_index

samtools_index(bam_file)

Index a bam file.

Source code in pypiper/ngstk.py
1120
1121
1122
1123
def samtools_index(self, bam_file):
    """Index a bam file."""
    cmd = self.tools.samtools + " index {0}".format(bam_file)
    return cmd

samtools_view

samtools_view(file_name, param, postpend='')

Run samtools view, with flexible parameters and post-processing.

This is used internally to implement the various count_reads functions.

:param str file_name: file_name :param str param: String of parameters to pass to samtools view :param str postpend: String to append to the samtools command; useful to add cut, sort, wc operations to the samtools view output.

Source code in pypiper/ngstk.py
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
def samtools_view(self, file_name, param, postpend=""):
    """
    Run samtools view, with flexible parameters and post-processing.

    This is used internally to implement the various count_reads functions.

    :param str file_name: file_name
    :param str param: String of parameters to pass to samtools view
    :param str postpend: String to append to the samtools command;
        useful to add cut, sort, wc operations to the samtools view output.
    """
    cmd = "{} view {} {} {}".format(self.tools.samtools, param, file_name, postpend)
    # in python 3, check_output returns a byte string which causes issues.
    # with python 3.6 we could use argument: "encoding='UTF-8'""
    return subprocess.check_output(cmd, shell=True).decode().strip()

skewer

skewer(input_fastq1, output_prefix, output_fastq1, log, cpus, adapters, input_fastq2=None, output_fastq2=None)

Create commands with which to run skewer.

:param str input_fastq1: Path to input (read 1) FASTQ file :param str output_prefix: Prefix for output FASTQ file names :param str output_fastq1: Path to (read 1) output FASTQ file :param str log: Path to file to which to write logging information :param int | str cpus: Number of processing cores to allow :param str adapters: Path to file with sequencing adapters :param str input_fastq2: Path to read 2 input FASTQ file :param str output_fastq2: Path to read 2 output FASTQ file :return list[str]: Sequence of commands to run to trim reads with skewer and rename files as desired.

Source code in pypiper/ngstk.py
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
def skewer(
    self,
    input_fastq1,
    output_prefix,
    output_fastq1,
    log,
    cpus,
    adapters,
    input_fastq2=None,
    output_fastq2=None,
):
    """
    Create commands with which to run skewer.

    :param str input_fastq1: Path to input (read 1) FASTQ file
    :param str output_prefix: Prefix for output FASTQ file names
    :param str output_fastq1: Path to (read 1) output FASTQ file
    :param str log: Path to file to which to write logging information
    :param int | str cpus: Number of processing cores to allow
    :param str adapters: Path to file with sequencing adapters
    :param str input_fastq2: Path to read 2 input FASTQ file
    :param str output_fastq2: Path to read 2 output FASTQ file
    :return list[str]: Sequence of commands to run to trim reads with
        skewer and rename files as desired.
    """

    pe = input_fastq2 is not None
    mode = "pe" if pe else "any"
    cmds = list()
    cmd1 = self.tools.skewer + " --quiet"
    cmd1 += " -f sanger"
    cmd1 += " -t {0}".format(cpus)
    cmd1 += " -m {0}".format(mode)
    cmd1 += " -x {0}".format(adapters)
    cmd1 += " -o {0}".format(output_prefix)
    cmd1 += " {0}".format(input_fastq1)
    if input_fastq2 is None:
        cmds.append(cmd1)
    else:
        cmd1 += " {0}".format(input_fastq2)
        cmds.append(cmd1)
    if input_fastq2 is None:
        cmd2 = "mv {0} {1}".format(output_prefix + "-trimmed.fastq", output_fastq1)
        cmds.append(cmd2)
    else:
        cmd2 = "mv {0} {1}".format(
            output_prefix + "-trimmed-pair1.fastq", output_fastq1
        )
        cmds.append(cmd2)
        cmd3 = "mv {0} {1}".format(
            output_prefix + "-trimmed-pair2.fastq", output_fastq2
        )
        cmds.append(cmd3)
    cmd4 = "mv {0} {1}".format(output_prefix + "-trimmed.log", log)
    cmds.append(cmd4)
    return cmds

spp_call_peaks

spp_call_peaks(treatment_bam, control_bam, treatment_name, control_name, output_dir, broad, cpus, qvalue=None)

Build command for R script to call peaks with SPP.

:param str treatment_bam: Path to file with data for treatment sample. :param str control_bam: Path to file with data for control sample. :param str treatment_name: Name for the treatment sample. :param str control_name: Name for the control sample. :param str output_dir: Path to folder for output. :param str | bool broad: Whether to specify broad peak calling mode. :param int cpus: Number of cores the script may use. :param float qvalue: FDR, as decimal value :return str: Command to run.

Source code in pypiper/ngstk.py
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
def spp_call_peaks(
    self,
    treatment_bam,
    control_bam,
    treatment_name,
    control_name,
    output_dir,
    broad,
    cpus,
    qvalue=None,
):
    """
    Build command for R script to call peaks with SPP.

    :param str treatment_bam: Path to file with data for treatment sample.
    :param str control_bam: Path to file with data for control sample.
    :param str treatment_name: Name for the treatment sample.
    :param str control_name: Name for the control sample.
    :param str output_dir: Path to folder for output.
    :param str | bool broad: Whether to specify broad peak calling mode.
    :param int cpus: Number of cores the script may use.
    :param float qvalue: FDR, as decimal value
    :return str: Command to run.
    """
    broad = "TRUE" if broad else "FALSE"
    cmd = (
        self.tools.Rscript
        + " `which spp_peak_calling.R` {0} {1} {2} {3} {4} {5} {6}".format(
            treatment_bam,
            control_bam,
            treatment_name,
            control_name,
            broad,
            cpus,
            output_dir,
        )
    )
    if qvalue is not None:
        cmd += " {}".format(qvalue)
    return cmd

validate_bam

validate_bam(input_bam)

Wrapper for Picard's ValidateSamFile.

:param str input_bam: Path to file to validate. :return str: Command to run for the validation.

Source code in pypiper/ngstk.py
658
659
660
661
662
663
664
665
666
667
668
def validate_bam(self, input_bam):
    """
    Wrapper for Picard's ValidateSamFile.

    :param str input_bam: Path to file to validate.
    :return str: Command to run for the validation.
    """
    cmd = self.tools.java + " -Xmx" + self.pm.javamem
    cmd += " -jar " + self.tools.picard + " ValidateSamFile"
    cmd += " INPUT=" + input_bam
    return cmd