Tutorial
============

This tutorial introduces the text-based usage and workflow in GenoKit.
GenoKit is designed for GFF and GTF file, meanwhile GenBankExtract is suited for GenBank file.


`GenoKit` is `tested <https://github.com/SitaoZ/GenoKit>`_ with Python 3.8

Quick-start
-----------

Find featurextract in command line and show the commands provided.

.. code-block:: bash
    :linenos:
     
    # find the GenoKit exec path
    which GenoKit
    /opt/anaconda3/bin/GenoKit
    # show help info
    GenoKit -h 
    
    #    ______                 __ __ _ __ 
    #   / ____/__  ____  ____  / //_/(_) /_
    #  / / __/ _ \/ __ \/ __ \/ ,<  / / __/
    # / /_/ /  __/ / / / /_/ / /| |/ / /_  
    # \____/\___/_/ /_/\____/_/ |_/_/\__/ 
    #     
    #   GenoKit - Genome and Gene ToolKit v0.2.6
    #   Contact: Sitao Zhu <zhusitao1990@163.com>
    # 
    #   Usage:
    #     GenoKit <command> [parameters]
    # 
    #   Database:
    #     create      Create GFF/GTF database
    #     stat        Database statistics
    # 
    #   Extract:
    #     gene        Extract gene sequence
    #     mrna        Extract mRNA sequence
    #     transcript  Extract transcript sequence
    #     exon        Extract exon sequence
    #     intron      Extract intron sequence
    #     cds         Extract CDS sequence
    #     utr         Extract 5'/3'UTR sequence
    #     uorf        Extract uORF sequence
    #     dorf        Extract dORF sequence
    #     promoter    Extract promoter sequence
    #     terminator  Extract terminator sequence
    #     igr         Extract intergenic region
    # 
    #   Design:
    #     primer      Primer design
    #     sgrna       Single guide RNA design
    #     sirna       Small interfering RNA design
    #     motif       Motif search
    # 
    #   Visualize:
    #     vision      Snapshot gene structure
    #     circos      Circlize genome structure
    # 
 
Main functional commands
------------------------
In this section, we will introduce commands provided by GenoKit modual.
How to get help info, and how to use them. Subcommands are show helpdoc when 
the specific command is provided as args in command line.

Create
~~~~~~
The create command use `gffutils <https://github.com/daler/gffutils>`_ to build a sqlite database, which 
stores the genome and gene structure need for other functional commands in GenoKit.

.. code-block:: bash
    :linenos:
    
    GenoKit create -h 
    #    usage: GenoKit create [-h] [-f {GFF,GTF}] -g GENOMEFEATURE -o OUTPUT_PREFIX
    #    optional arguments:
    #        -h, --help            show this help message and exit
    #        -f {GFF,GTF}, --file_type {GFF,GTF}
    #                      genome annotation file
    #        -g GENOMEFEATURE, --genomefeature GENOMEFEATURE
    #                      genome annotation file
    #        -o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
    #                      database absolute path
    
    # GFF  
    GenoKit create -f GFF -g ath.gff3 -o ath
    # or GTF 
    GenoKit create -f GTF -g ath.gtf  -o ath

Promoter
~~~~~~~~
Promoter sequence is essential for gene functional resaerch. Usually.

.. code-block:: bash
    :linenos:
    
    GenoKit promoter -h 
    #usage: GenoKit promoter [-h] -d DATABASE -f GENOME [-g GENE]
    #                           [-l PROMOTER_LENGTH] [-u UTR5_UPPER_LENGTH]
    #                           [-o OUTPUT] [-t {csv,fasta}] [-p]
    #optional arguments:
    #  -h, --help            show this help message and exit
    #  -d DATABASE, --database DATABASE
    #                        database generated by subcommand create
    #  -f GENOME, --genome GENOME
    #                        genome fasta path
    #  -g GENE, --gene GENE  specific gene; if not given, return whole genes
    #  -l PROMOTER_LENGTH, --promoter_length PROMOTER_LENGTH
    #                        promoter length before TSS (default 100 nt)
    #  -u UTR5_UPPER_LENGTH, --utr5_upper_length UTR5_UPPER_LENGTH
    #                        5' utr length after TSS (default 10 nt)
    #  -o OUTPUT, --output OUTPUT
    #                        output file path
    #  -t {csv,fasta}, --output_format {csv,fasta}
    #                        output format
    #  -p, --print           output to stdout
    
    # All genes in whole gnome
    GenoKit promoter -d ath.GFF -f ath.fa -l 200 -u 100 -o promoter.csv --output_format fasta
    # A given gene
    GenoKit promoter -d ath.GFF -f ath.fa -l 200 -u 100 -g AT1G01010 -p --output_format fasta
    
 
UTR
~~~
UTR (untranslated region) sequence is essential for gene functional resaerch. In molecular genetics, 
an UTR refers to either of two sections, one on each side of a coding sequence on a strand of mRNA. 
If it is found on the 5' side, it is called the 5' UTR (or leader sequence), or if it is found on 
the 3' side, it is called the 3' UTR (or trailer sequence).

.. code-block:: bash
    :linenos:
    
    GenoKit UTR -h 
    #usage: GenoKit UTR [-h] -d DATABASE -f GENOME [-i TRANSCRIPT]
    #                     [-o OUTPUT] [-p] [-s {GFF,GTF}]
    #
    #optional arguments:
    #  -h, --help            show this help message and exit
    #  -d DATABASE, --database DATABASE
    #                        database generated by subcommand create
    #  -f GENOME, --genome GENOME
    #                        genome fasta file
    #  -i TRANSCRIPT, --transcript TRANSCRIPT
    #                        specific transcript id; if not given, whole transcript
    #                        will return
    #  -o OUTPUT, --output OUTPUT
    #                        output file path
    #  -p, --print           output to stdout
    #  -s {GFF,GTF}, --style {GFF,GTF}
    #                        GTF database or GFF database
    #
    # All transcripts in whole gnome
    GenoKit promoter -d ath.GFF -f ath.fa -o utr.csv -s GFF
    # A given gene
    GenoKit promoter -d ath.GFF -f ath.fa -i AT1G01010.1 -p -s GFF

uORF
~~~~~
uORF (upstream open reading frame), is an open reading frame (ORF) within the 5' untranslated region (5'UTR) of an mRNA.
uORFs can regulate eukaryotic gene expression and repress downstream expression of the primary ORF.
    
.. code-block:: bash
    :linenos:
    
    GenoKit uORF -h 

    #usage: GenoKit uORF [-h] -d DATABASE -f GENOME [-i TRANSCRIPT]
    #                      [-t {csv,fasta,gff}] [-o OUTPUT] [-m] [-n]
    #                      [-s {GFF,GTF}]
    
    #optional arguments:
    #  -h, --help            show this help message and exit
    #  -d DATABASE, --database DATABASE
    #                        database generated by subcommand create
    #  -f GENOME, --genome GENOME
    #                        genome fasta
    #  -i TRANSCRIPT, --transcript TRANSCRIPT
    #                        specific transcript id; if not given, whole transcript
    #                        will return
    #  -t {csv,fasta,gff}, --output_format {csv,fasta,gff}
    #                        output format
    #  -o OUTPUT, --output OUTPUT
    #                        output file path
    #  -m, --schematic_without_intron
    #                        schematic figure file for uORF, CDS and transcript
    #                        without intron
    #  -n, --schematic_with_intron
    #                        schematic figure file for uORF, CDS and transcript
    #                        with intron
    #  -s {GFF,GTF}, --style {GFF,GTF}
    #                        GTF database or GFF database
    
CDS
~~~~
The CDS (coding sequence), is the portion of a gene's DNA or RNA that codes for protein.

.. code-block:: bash
    :linenos:
    
    GenoKit CDS -h 

    #usage: GenoKit uORF [-h] -d DATABASE -f GENOME [-i TRANSCRIPT]
    #                      [-t {csv,fasta,gff}] [-o OUTPUT] [-m] [-n]
    #                      [-s {GFF,GTF}]

    #optional arguments:
    #  -h, --help            show this help message and exit
    #  -d DATABASE, --database DATABASE
    #                        database generated by subcommand create
    #  -f GENOME, --genome GENOME
    #                        genome fasta
    #  -i TRANSCRIPT, --transcript TRANSCRIPT
    #                        specific transcript id; if not given, whole transcript
    #                        will return
    #  -t {csv,fasta,gff}, --output_format {csv,fasta,gff}
    #                        output format
    #  -o OUTPUT, --output OUTPUT
    #                        output file path
    #  -m, --schematic_without_intron
    #                        schematic figure file for uORF, CDS and transcript
    #                        without intron
    #  -n, --schematic_with_intron
    #                        schematic figure file for uORF, CDS and transcript
    #                        with intron
    #  -s {GFF,GTF}, --style {GFF,GTF}
    #                        GTF database or GFF database

dORF
~~~~

.. code-block:: bash
    :linenos:
    
    GenoKit dORF -h
    #usage: GenoKit dORF [-h] -d DATABASE -f GENOME [-i TRANSCRIPT]
    #                      [-t {csv,fasta,gff}] [-o OUTPUT]
    #                      [-m SCHEMATIC_WITHOUT_INTRON]
    #                      [-n SCHEMATIC_WITH_INTRON] [-s {GFF,GTF}]

    #optional arguments:
    #  -h, --help            show this help message and exit
    #  -d DATABASE, --database DATABASE
    #                        database generated by subcommand create
    #  -f GENOME, --genome GENOME
    #                        genome fasta
    #  -i TRANSCRIPT, --transcript TRANSCRIPT
    #                        specific transcript id; if not given, whole transcript
    #                        will return
    #  -t {csv,fasta,gff}, --output_format {csv,fasta,gff}
    #                        output format
    #  -o OUTPUT, --output OUTPUT
    #                        output file path
    #  -m SCHEMATIC_WITHOUT_INTRON, --schematic_without_intron SCHEMATIC_WITHOUT_INTRON
    #                        schematic figure file for dORF, CDS and transcript
    #                        without intron
    #  -n SCHEMATIC_WITH_INTRON, --schematic_with_intron SCHEMATIC_WITH_INTRON
    #                        schematic figure file for dORF, CDS and transcript
    #                        with intron
    #  -s {GFF,GTF}, --style {GFF,GTF}
    #                        GTF database or GFF database

cDNA/mRNA
~~~~~~~~~
In genetics, complementary DNA (cDNA) is DNA synthesized from a single-stranded RNA (e.g., messenger RNA (mRNA) or microRNA (miRNA)) template in a reaction catalyzed by the enzyme reverse transcriptase.

.. code-block:: bash
    :linenos:
    
    GenoKit cdna -h
    #usage: GenoKit cdna [-h] -d DATABASE -f GENOME [-i TRANSCRIPT]
    #                      [-o OUTPUT] [-t {csv,fasta}] [-p] [-u]
    #                      [-s {GFF,GTF}]

    #optional arguments:
    #  -h, --help            show this help message and exit
    #  -d DATABASE, --database DATABASE
    #                        database generated by subcommand create
    #  -f GENOME, --genome GENOME
    #                        genome fasta
    #  -i TRANSCRIPT, --transcript TRANSCRIPT
    #                        specific transcript; if not given, return whole
    #                        transcripts
    #  -o OUTPUT, --output OUTPUT
    #                        output file path
    #  -t {csv,fasta}, --output_format {csv,fasta}
    #                        output format
    #  -p, --print           output to stdout
    #  -u, --upper           upper CDS and lower utr
    #  -s {GFF,GTF}, --style {GFF,GTF}
    #                        GTF database or GFF database
    
Gene
~~~~
gene

.. code-block:: bash 
    :linenos:
    
    GenoKit gene -h
    #usage: GenoKit gene [-h] -d DATABASE -f GENOME [-g GENE] [-o OUTPUT]
    #                      [-p]

    #optional arguments:
    #  -h, --help            show this help message and exit
    #  -d DATABASE, --database DATABASE
    #                        database generated by subcommand create
    #  -f GENOME, --genome GENOME
    #                        genome fasta
    #  -g GENE, --gene GENE  specific gene; if not given, return whole genes
    #  -o OUTPUT, --output OUTPUT
    #                        output file path
    #  -p, --print           output to stdout
    
Exon
~~~~
An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing.

.. code-block:: bash
    :linenos:
    
    GenoKit exon -h
    #usage: GenoKit exon [-h] -d DATABASE -f GENOME [-i TRANSCRIPT]
    #                      [-o OUTPUT] [-p] [-s {GFF,GTF}]

    #optional arguments:
    #  -h, --help            show this help message and exit
    #  -d DATABASE, --database DATABASE
    #                        database generated by subcommand create
    #  -f GENOME, --genome GENOME
    #                        genome fasta
    #  -i TRANSCRIPT, --transcript TRANSCRIPT
    #                        specific transcript id; needed
    #  -o OUTPUT, --output OUTPUT
    #                        output file path
    #  -p, --print           output to stdout
    #  -s {GFF,GTF}, --style {GFF,GTF}
    #                        GTF database or GFF database    

Intron
~~~~~~
An intron is any nucleotide sequence within a gene that is removed by RNA processing during production of the final RNA product.

.. code-block:: bash
    :linenos:
    
    GenoKit intron -h
    #usage: GenoKit intron [-h] -d DATABASE -f GENOME [-i TRANSCRIPT]
    #                        [-o OUTPUT] [-p] [-s {GFF,GTF}]

    #optional arguments:
    #  -h, --help            show this help message and exit
    #  -d DATABASE, --database DATABASE
    #                        database generated by subcommand create
    #  -f GENOME, --genome GENOME
    #                        genome fasta
    #  -i TRANSCRIPT, --transcript TRANSCRIPT
    #                        specific transcript id; needed
    #  -o OUTPUT, --output OUTPUT
    #                        output file path
    #  -p, --print           output to stdout
    #  -s {GFF,GTF}, --style {GFF,GTF}
    #                        GTF database or GFF database

IGR
~~~
An IGR (intergenic region) is a stretch of DNA sequences located between genes.

.. code-block:: bash
    :linenos:
    
    GenoKit IGR -h
    #usage: GenoKit IGR [-h] -d DATABASE -f GENOME [-l IGR_LENGTH]
    #                         [-o OUTPUT] [-p] [-s {GFF,GTF}]
    
    #optional arguments:
    #  -h, --help            show this help message and exit
    #  -d DATABASE, --database DATABASE
    #                        database generated by subcommand create
    #  -f GENOME, --genome GENOME
    #                        genome fasta
    #  -l IGR_LENGTH, --IGR_length IGR_LENGTH
    #                        IGR length threshold
    #  -o OUTPUT, --output OUTPUT
    #                        output fasta file path
    #  -p, --print           output to stdout
    #  -s {GFF,GTF}, --style {GFF,GTF}
    #                        GTF database only contain protein genes, while GFF
    #                        database contain protein genes and nocoding genes
    
* python >= 3.7.6 `python <https://www.python.org/>`_
* pandas >= 1.2.4 `pandas <https://pandas.pydata.org/docs/>`_
* gffutils >= 0.10.1 `gffutils <https://pythonhosted.org/gffutils/>`_
* setuptools >= 49.2.0 `setuptools <https://pypi.org/project/setuptools/>`_
* BioPython >= 1.78 `biopython <https://biopython.org/wiki/Documentation/>`_

Install them all with `conda`::

    conda install --channel conda-forge --channel python pandas gffutils setuptools biopython