Tutorial

This tutorial introduces the text-based usage and workflow in GenoKit. GenoKit is designed for GFF and GTF file, meanwhile GenBankExtract is suited for GenBank file.

GenoKit is tested with Python 3.8

Quick-start

Find featurextract in command line and show the commands provided.

 1# find the GenoKit exec path
 2which GenoKit
 3/opt/anaconda3/bin/GenoKit
 4# show help info
 5GenoKit -h
 6
 7#    ______                 __ __ _ __
 8#   / ____/__  ____  ____  / //_/(_) /_
 9#  / / __/ _ \/ __ \/ __ \/ ,<  / / __/
10# / /_/ /  __/ / / / /_/ / /| |/ / /_
11# \____/\___/_/ /_/\____/_/ |_/_/\__/
12#
13#   GenoKit - Genome and Gene ToolKit v0.2.6
14#   Contact: Sitao Zhu <zhusitao1990@163.com>
15#
16#   Usage:
17#     GenoKit <command> [parameters]
18#
19#   Database:
20#     create      Create GFF/GTF database
21#     stat        Database statistics
22#
23#   Extract:
24#     gene        Extract gene sequence
25#     mrna        Extract mRNA sequence
26#     transcript  Extract transcript sequence
27#     exon        Extract exon sequence
28#     intron      Extract intron sequence
29#     cds         Extract CDS sequence
30#     utr         Extract 5'/3'UTR sequence
31#     uorf        Extract uORF sequence
32#     dorf        Extract dORF sequence
33#     promoter    Extract promoter sequence
34#     terminator  Extract terminator sequence
35#     igr         Extract intergenic region
36#
37#   Design:
38#     primer      Primer design
39#     sgrna       Single guide RNA design
40#     sirna       Small interfering RNA design
41#     motif       Motif search
42#
43#   Visualize:
44#     vision      Snapshot gene structure
45#     circos      Circlize genome structure
46#

Main functional commands

In this section, we will introduce commands provided by GenoKit modual. How to get help info, and how to use them. Subcommands are show helpdoc when the specific command is provided as args in command line.

Create

The create command use gffutils to build a sqlite database, which stores the genome and gene structure need for other functional commands in GenoKit.

 1GenoKit create -h
 2#    usage: GenoKit create [-h] [-f {GFF,GTF}] -g GENOMEFEATURE -o OUTPUT_PREFIX
 3#    optional arguments:
 4#        -h, --help            show this help message and exit
 5#        -f {GFF,GTF}, --file_type {GFF,GTF}
 6#                      genome annotation file
 7#        -g GENOMEFEATURE, --genomefeature GENOMEFEATURE
 8#                      genome annotation file
 9#        -o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
10#                      database absolute path
11
12# GFF
13GenoKit create -f GFF -g ath.gff3 -o ath
14# or GTF
15GenoKit create -f GTF -g ath.gtf  -o ath

Promoter

Promoter sequence is essential for gene functional resaerch. Usually.

 1GenoKit promoter -h
 2#usage: GenoKit promoter [-h] -d DATABASE -f GENOME [-g GENE]
 3#                           [-l PROMOTER_LENGTH] [-u UTR5_UPPER_LENGTH]
 4#                           [-o OUTPUT] [-t {csv,fasta}] [-p]
 5#optional arguments:
 6#  -h, --help            show this help message and exit
 7#  -d DATABASE, --database DATABASE
 8#                        database generated by subcommand create
 9#  -f GENOME, --genome GENOME
10#                        genome fasta path
11#  -g GENE, --gene GENE  specific gene; if not given, return whole genes
12#  -l PROMOTER_LENGTH, --promoter_length PROMOTER_LENGTH
13#                        promoter length before TSS (default 100 nt)
14#  -u UTR5_UPPER_LENGTH, --utr5_upper_length UTR5_UPPER_LENGTH
15#                        5' utr length after TSS (default 10 nt)
16#  -o OUTPUT, --output OUTPUT
17#                        output file path
18#  -t {csv,fasta}, --output_format {csv,fasta}
19#                        output format
20#  -p, --print           output to stdout
21
22# All genes in whole gnome
23GenoKit promoter -d ath.GFF -f ath.fa -l 200 -u 100 -o promoter.csv --output_format fasta
24# A given gene
25GenoKit promoter -d ath.GFF -f ath.fa -l 200 -u 100 -g AT1G01010 -p --output_format fasta

UTR

UTR (untranslated region) sequence is essential for gene functional resaerch. In molecular genetics, an UTR refers to either of two sections, one on each side of a coding sequence on a strand of mRNA. If it is found on the 5’ side, it is called the 5’ UTR (or leader sequence), or if it is found on the 3’ side, it is called the 3’ UTR (or trailer sequence).

 1GenoKit UTR -h
 2#usage: GenoKit UTR [-h] -d DATABASE -f GENOME [-i TRANSCRIPT]
 3#                     [-o OUTPUT] [-p] [-s {GFF,GTF}]
 4#
 5#optional arguments:
 6#  -h, --help            show this help message and exit
 7#  -d DATABASE, --database DATABASE
 8#                        database generated by subcommand create
 9#  -f GENOME, --genome GENOME
10#                        genome fasta file
11#  -i TRANSCRIPT, --transcript TRANSCRIPT
12#                        specific transcript id; if not given, whole transcript
13#                        will return
14#  -o OUTPUT, --output OUTPUT
15#                        output file path
16#  -p, --print           output to stdout
17#  -s {GFF,GTF}, --style {GFF,GTF}
18#                        GTF database or GFF database
19#
20# All transcripts in whole gnome
21GenoKit promoter -d ath.GFF -f ath.fa -o utr.csv -s GFF
22# A given gene
23GenoKit promoter -d ath.GFF -f ath.fa -i AT1G01010.1 -p -s GFF

uORF

uORF (upstream open reading frame), is an open reading frame (ORF) within the 5’ untranslated region (5’UTR) of an mRNA. uORFs can regulate eukaryotic gene expression and repress downstream expression of the primary ORF.

 1GenoKit uORF -h
 2
 3#usage: GenoKit uORF [-h] -d DATABASE -f GENOME [-i TRANSCRIPT]
 4#                      [-t {csv,fasta,gff}] [-o OUTPUT] [-m] [-n]
 5#                      [-s {GFF,GTF}]
 6
 7#optional arguments:
 8#  -h, --help            show this help message and exit
 9#  -d DATABASE, --database DATABASE
10#                        database generated by subcommand create
11#  -f GENOME, --genome GENOME
12#                        genome fasta
13#  -i TRANSCRIPT, --transcript TRANSCRIPT
14#                        specific transcript id; if not given, whole transcript
15#                        will return
16#  -t {csv,fasta,gff}, --output_format {csv,fasta,gff}
17#                        output format
18#  -o OUTPUT, --output OUTPUT
19#                        output file path
20#  -m, --schematic_without_intron
21#                        schematic figure file for uORF, CDS and transcript
22#                        without intron
23#  -n, --schematic_with_intron
24#                        schematic figure file for uORF, CDS and transcript
25#                        with intron
26#  -s {GFF,GTF}, --style {GFF,GTF}
27#                        GTF database or GFF database

CDS

The CDS (coding sequence), is the portion of a gene’s DNA or RNA that codes for protein.

 1GenoKit CDS -h
 2
 3#usage: GenoKit uORF [-h] -d DATABASE -f GENOME [-i TRANSCRIPT]
 4#                      [-t {csv,fasta,gff}] [-o OUTPUT] [-m] [-n]
 5#                      [-s {GFF,GTF}]
 6
 7#optional arguments:
 8#  -h, --help            show this help message and exit
 9#  -d DATABASE, --database DATABASE
10#                        database generated by subcommand create
11#  -f GENOME, --genome GENOME
12#                        genome fasta
13#  -i TRANSCRIPT, --transcript TRANSCRIPT
14#                        specific transcript id; if not given, whole transcript
15#                        will return
16#  -t {csv,fasta,gff}, --output_format {csv,fasta,gff}
17#                        output format
18#  -o OUTPUT, --output OUTPUT
19#                        output file path
20#  -m, --schematic_without_intron
21#                        schematic figure file for uORF, CDS and transcript
22#                        without intron
23#  -n, --schematic_with_intron
24#                        schematic figure file for uORF, CDS and transcript
25#                        with intron
26#  -s {GFF,GTF}, --style {GFF,GTF}
27#                        GTF database or GFF database

dORF

 1GenoKit dORF -h
 2#usage: GenoKit dORF [-h] -d DATABASE -f GENOME [-i TRANSCRIPT]
 3#                      [-t {csv,fasta,gff}] [-o OUTPUT]
 4#                      [-m SCHEMATIC_WITHOUT_INTRON]
 5#                      [-n SCHEMATIC_WITH_INTRON] [-s {GFF,GTF}]
 6
 7#optional arguments:
 8#  -h, --help            show this help message and exit
 9#  -d DATABASE, --database DATABASE
10#                        database generated by subcommand create
11#  -f GENOME, --genome GENOME
12#                        genome fasta
13#  -i TRANSCRIPT, --transcript TRANSCRIPT
14#                        specific transcript id; if not given, whole transcript
15#                        will return
16#  -t {csv,fasta,gff}, --output_format {csv,fasta,gff}
17#                        output format
18#  -o OUTPUT, --output OUTPUT
19#                        output file path
20#  -m SCHEMATIC_WITHOUT_INTRON, --schematic_without_intron SCHEMATIC_WITHOUT_INTRON
21#                        schematic figure file for dORF, CDS and transcript
22#                        without intron
23#  -n SCHEMATIC_WITH_INTRON, --schematic_with_intron SCHEMATIC_WITH_INTRON
24#                        schematic figure file for dORF, CDS and transcript
25#                        with intron
26#  -s {GFF,GTF}, --style {GFF,GTF}
27#                        GTF database or GFF database

cDNA/mRNA

In genetics, complementary DNA (cDNA) is DNA synthesized from a single-stranded RNA (e.g., messenger RNA (mRNA) or microRNA (miRNA)) template in a reaction catalyzed by the enzyme reverse transcriptase.

 1GenoKit cdna -h
 2#usage: GenoKit cdna [-h] -d DATABASE -f GENOME [-i TRANSCRIPT]
 3#                      [-o OUTPUT] [-t {csv,fasta}] [-p] [-u]
 4#                      [-s {GFF,GTF}]
 5
 6#optional arguments:
 7#  -h, --help            show this help message and exit
 8#  -d DATABASE, --database DATABASE
 9#                        database generated by subcommand create
10#  -f GENOME, --genome GENOME
11#                        genome fasta
12#  -i TRANSCRIPT, --transcript TRANSCRIPT
13#                        specific transcript; if not given, return whole
14#                        transcripts
15#  -o OUTPUT, --output OUTPUT
16#                        output file path
17#  -t {csv,fasta}, --output_format {csv,fasta}
18#                        output format
19#  -p, --print           output to stdout
20#  -u, --upper           upper CDS and lower utr
21#  -s {GFF,GTF}, --style {GFF,GTF}
22#                        GTF database or GFF database

Gene

gene

 1GenoKit gene -h
 2#usage: GenoKit gene [-h] -d DATABASE -f GENOME [-g GENE] [-o OUTPUT]
 3#                      [-p]
 4
 5#optional arguments:
 6#  -h, --help            show this help message and exit
 7#  -d DATABASE, --database DATABASE
 8#                        database generated by subcommand create
 9#  -f GENOME, --genome GENOME
10#                        genome fasta
11#  -g GENE, --gene GENE  specific gene; if not given, return whole genes
12#  -o OUTPUT, --output OUTPUT
13#                        output file path
14#  -p, --print           output to stdout

Exon

An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing.

 1GenoKit exon -h
 2#usage: GenoKit exon [-h] -d DATABASE -f GENOME [-i TRANSCRIPT]
 3#                      [-o OUTPUT] [-p] [-s {GFF,GTF}]
 4
 5#optional arguments:
 6#  -h, --help            show this help message and exit
 7#  -d DATABASE, --database DATABASE
 8#                        database generated by subcommand create
 9#  -f GENOME, --genome GENOME
10#                        genome fasta
11#  -i TRANSCRIPT, --transcript TRANSCRIPT
12#                        specific transcript id; needed
13#  -o OUTPUT, --output OUTPUT
14#                        output file path
15#  -p, --print           output to stdout
16#  -s {GFF,GTF}, --style {GFF,GTF}
17#                        GTF database or GFF database

Intron

An intron is any nucleotide sequence within a gene that is removed by RNA processing during production of the final RNA product.

 1GenoKit intron -h
 2#usage: GenoKit intron [-h] -d DATABASE -f GENOME [-i TRANSCRIPT]
 3#                        [-o OUTPUT] [-p] [-s {GFF,GTF}]
 4
 5#optional arguments:
 6#  -h, --help            show this help message and exit
 7#  -d DATABASE, --database DATABASE
 8#                        database generated by subcommand create
 9#  -f GENOME, --genome GENOME
10#                        genome fasta
11#  -i TRANSCRIPT, --transcript TRANSCRIPT
12#                        specific transcript id; needed
13#  -o OUTPUT, --output OUTPUT
14#                        output file path
15#  -p, --print           output to stdout
16#  -s {GFF,GTF}, --style {GFF,GTF}
17#                        GTF database or GFF database

IGR

An IGR (intergenic region) is a stretch of DNA sequences located between genes.

 1GenoKit IGR -h
 2#usage: GenoKit IGR [-h] -d DATABASE -f GENOME [-l IGR_LENGTH]
 3#                         [-o OUTPUT] [-p] [-s {GFF,GTF}]
 4
 5#optional arguments:
 6#  -h, --help            show this help message and exit
 7#  -d DATABASE, --database DATABASE
 8#                        database generated by subcommand create
 9#  -f GENOME, --genome GENOME
10#                        genome fasta
11#  -l IGR_LENGTH, --IGR_length IGR_LENGTH
12#                        IGR length threshold
13#  -o OUTPUT, --output OUTPUT
14#                        output fasta file path
15#  -p, --print           output to stdout
16#  -s {GFF,GTF}, --style {GFF,GTF}
17#                        GTF database only contain protein genes, while GFF
18#                        database contain protein genes and nocoding genes

Install them all with conda:

conda install --channel conda-forge --channel python pandas gffutils setuptools biopython