Tutorial¶
This tutorial introduces the text-based usage and workflow in GenoKit. GenoKit is designed for GFF and GTF file, meanwhile GenBankExtract is suited for GenBank file.
GenoKit is tested with Python 3.8
Quick-start¶
Find featurextract in command line and show the commands provided.
1# find the GenoKit exec path
2which GenoKit
3/opt/anaconda3/bin/GenoKit
4# show help info
5GenoKit -h
6
7# ______ __ __ _ __
8# / ____/__ ____ ____ / //_/(_) /_
9# / / __/ _ \/ __ \/ __ \/ ,< / / __/
10# / /_/ / __/ / / / /_/ / /| |/ / /_
11# \____/\___/_/ /_/\____/_/ |_/_/\__/
12#
13# GenoKit - Genome and Gene ToolKit v0.2.6
14# Contact: Sitao Zhu <zhusitao1990@163.com>
15#
16# Usage:
17# GenoKit <command> [parameters]
18#
19# Database:
20# create Create GFF/GTF database
21# stat Database statistics
22#
23# Extract:
24# gene Extract gene sequence
25# mrna Extract mRNA sequence
26# transcript Extract transcript sequence
27# exon Extract exon sequence
28# intron Extract intron sequence
29# cds Extract CDS sequence
30# utr Extract 5'/3'UTR sequence
31# uorf Extract uORF sequence
32# dorf Extract dORF sequence
33# promoter Extract promoter sequence
34# terminator Extract terminator sequence
35# igr Extract intergenic region
36#
37# Design:
38# primer Primer design
39# sgrna Single guide RNA design
40# sirna Small interfering RNA design
41# motif Motif search
42#
43# Visualize:
44# vision Snapshot gene structure
45# circos Circlize genome structure
46#
Main functional commands¶
In this section, we will introduce commands provided by GenoKit modual. How to get help info, and how to use them. Subcommands are show helpdoc when the specific command is provided as args in command line.
Create¶
The create command use gffutils to build a sqlite database, which stores the genome and gene structure need for other functional commands in GenoKit.
1GenoKit create -h
2# usage: GenoKit create [-h] [-f {GFF,GTF}] -g GENOMEFEATURE -o OUTPUT_PREFIX
3# optional arguments:
4# -h, --help show this help message and exit
5# -f {GFF,GTF}, --file_type {GFF,GTF}
6# genome annotation file
7# -g GENOMEFEATURE, --genomefeature GENOMEFEATURE
8# genome annotation file
9# -o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
10# database absolute path
11
12# GFF
13GenoKit create -f GFF -g ath.gff3 -o ath
14# or GTF
15GenoKit create -f GTF -g ath.gtf -o ath
Promoter¶
Promoter sequence is essential for gene functional resaerch. Usually.
1GenoKit promoter -h
2#usage: GenoKit promoter [-h] -d DATABASE -f GENOME [-g GENE]
3# [-l PROMOTER_LENGTH] [-u UTR5_UPPER_LENGTH]
4# [-o OUTPUT] [-t {csv,fasta}] [-p]
5#optional arguments:
6# -h, --help show this help message and exit
7# -d DATABASE, --database DATABASE
8# database generated by subcommand create
9# -f GENOME, --genome GENOME
10# genome fasta path
11# -g GENE, --gene GENE specific gene; if not given, return whole genes
12# -l PROMOTER_LENGTH, --promoter_length PROMOTER_LENGTH
13# promoter length before TSS (default 100 nt)
14# -u UTR5_UPPER_LENGTH, --utr5_upper_length UTR5_UPPER_LENGTH
15# 5' utr length after TSS (default 10 nt)
16# -o OUTPUT, --output OUTPUT
17# output file path
18# -t {csv,fasta}, --output_format {csv,fasta}
19# output format
20# -p, --print output to stdout
21
22# All genes in whole gnome
23GenoKit promoter -d ath.GFF -f ath.fa -l 200 -u 100 -o promoter.csv --output_format fasta
24# A given gene
25GenoKit promoter -d ath.GFF -f ath.fa -l 200 -u 100 -g AT1G01010 -p --output_format fasta
UTR¶
UTR (untranslated region) sequence is essential for gene functional resaerch. In molecular genetics, an UTR refers to either of two sections, one on each side of a coding sequence on a strand of mRNA. If it is found on the 5’ side, it is called the 5’ UTR (or leader sequence), or if it is found on the 3’ side, it is called the 3’ UTR (or trailer sequence).
1GenoKit UTR -h
2#usage: GenoKit UTR [-h] -d DATABASE -f GENOME [-i TRANSCRIPT]
3# [-o OUTPUT] [-p] [-s {GFF,GTF}]
4#
5#optional arguments:
6# -h, --help show this help message and exit
7# -d DATABASE, --database DATABASE
8# database generated by subcommand create
9# -f GENOME, --genome GENOME
10# genome fasta file
11# -i TRANSCRIPT, --transcript TRANSCRIPT
12# specific transcript id; if not given, whole transcript
13# will return
14# -o OUTPUT, --output OUTPUT
15# output file path
16# -p, --print output to stdout
17# -s {GFF,GTF}, --style {GFF,GTF}
18# GTF database or GFF database
19#
20# All transcripts in whole gnome
21GenoKit promoter -d ath.GFF -f ath.fa -o utr.csv -s GFF
22# A given gene
23GenoKit promoter -d ath.GFF -f ath.fa -i AT1G01010.1 -p -s GFF
uORF¶
uORF (upstream open reading frame), is an open reading frame (ORF) within the 5’ untranslated region (5’UTR) of an mRNA. uORFs can regulate eukaryotic gene expression and repress downstream expression of the primary ORF.
1GenoKit uORF -h
2
3#usage: GenoKit uORF [-h] -d DATABASE -f GENOME [-i TRANSCRIPT]
4# [-t {csv,fasta,gff}] [-o OUTPUT] [-m] [-n]
5# [-s {GFF,GTF}]
6
7#optional arguments:
8# -h, --help show this help message and exit
9# -d DATABASE, --database DATABASE
10# database generated by subcommand create
11# -f GENOME, --genome GENOME
12# genome fasta
13# -i TRANSCRIPT, --transcript TRANSCRIPT
14# specific transcript id; if not given, whole transcript
15# will return
16# -t {csv,fasta,gff}, --output_format {csv,fasta,gff}
17# output format
18# -o OUTPUT, --output OUTPUT
19# output file path
20# -m, --schematic_without_intron
21# schematic figure file for uORF, CDS and transcript
22# without intron
23# -n, --schematic_with_intron
24# schematic figure file for uORF, CDS and transcript
25# with intron
26# -s {GFF,GTF}, --style {GFF,GTF}
27# GTF database or GFF database
CDS¶
The CDS (coding sequence), is the portion of a gene’s DNA or RNA that codes for protein.
1GenoKit CDS -h
2
3#usage: GenoKit uORF [-h] -d DATABASE -f GENOME [-i TRANSCRIPT]
4# [-t {csv,fasta,gff}] [-o OUTPUT] [-m] [-n]
5# [-s {GFF,GTF}]
6
7#optional arguments:
8# -h, --help show this help message and exit
9# -d DATABASE, --database DATABASE
10# database generated by subcommand create
11# -f GENOME, --genome GENOME
12# genome fasta
13# -i TRANSCRIPT, --transcript TRANSCRIPT
14# specific transcript id; if not given, whole transcript
15# will return
16# -t {csv,fasta,gff}, --output_format {csv,fasta,gff}
17# output format
18# -o OUTPUT, --output OUTPUT
19# output file path
20# -m, --schematic_without_intron
21# schematic figure file for uORF, CDS and transcript
22# without intron
23# -n, --schematic_with_intron
24# schematic figure file for uORF, CDS and transcript
25# with intron
26# -s {GFF,GTF}, --style {GFF,GTF}
27# GTF database or GFF database
dORF¶
1GenoKit dORF -h
2#usage: GenoKit dORF [-h] -d DATABASE -f GENOME [-i TRANSCRIPT]
3# [-t {csv,fasta,gff}] [-o OUTPUT]
4# [-m SCHEMATIC_WITHOUT_INTRON]
5# [-n SCHEMATIC_WITH_INTRON] [-s {GFF,GTF}]
6
7#optional arguments:
8# -h, --help show this help message and exit
9# -d DATABASE, --database DATABASE
10# database generated by subcommand create
11# -f GENOME, --genome GENOME
12# genome fasta
13# -i TRANSCRIPT, --transcript TRANSCRIPT
14# specific transcript id; if not given, whole transcript
15# will return
16# -t {csv,fasta,gff}, --output_format {csv,fasta,gff}
17# output format
18# -o OUTPUT, --output OUTPUT
19# output file path
20# -m SCHEMATIC_WITHOUT_INTRON, --schematic_without_intron SCHEMATIC_WITHOUT_INTRON
21# schematic figure file for dORF, CDS and transcript
22# without intron
23# -n SCHEMATIC_WITH_INTRON, --schematic_with_intron SCHEMATIC_WITH_INTRON
24# schematic figure file for dORF, CDS and transcript
25# with intron
26# -s {GFF,GTF}, --style {GFF,GTF}
27# GTF database or GFF database
cDNA/mRNA¶
In genetics, complementary DNA (cDNA) is DNA synthesized from a single-stranded RNA (e.g., messenger RNA (mRNA) or microRNA (miRNA)) template in a reaction catalyzed by the enzyme reverse transcriptase.
1GenoKit cdna -h
2#usage: GenoKit cdna [-h] -d DATABASE -f GENOME [-i TRANSCRIPT]
3# [-o OUTPUT] [-t {csv,fasta}] [-p] [-u]
4# [-s {GFF,GTF}]
5
6#optional arguments:
7# -h, --help show this help message and exit
8# -d DATABASE, --database DATABASE
9# database generated by subcommand create
10# -f GENOME, --genome GENOME
11# genome fasta
12# -i TRANSCRIPT, --transcript TRANSCRIPT
13# specific transcript; if not given, return whole
14# transcripts
15# -o OUTPUT, --output OUTPUT
16# output file path
17# -t {csv,fasta}, --output_format {csv,fasta}
18# output format
19# -p, --print output to stdout
20# -u, --upper upper CDS and lower utr
21# -s {GFF,GTF}, --style {GFF,GTF}
22# GTF database or GFF database
Gene¶
gene
1GenoKit gene -h
2#usage: GenoKit gene [-h] -d DATABASE -f GENOME [-g GENE] [-o OUTPUT]
3# [-p]
4
5#optional arguments:
6# -h, --help show this help message and exit
7# -d DATABASE, --database DATABASE
8# database generated by subcommand create
9# -f GENOME, --genome GENOME
10# genome fasta
11# -g GENE, --gene GENE specific gene; if not given, return whole genes
12# -o OUTPUT, --output OUTPUT
13# output file path
14# -p, --print output to stdout
Exon¶
An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing.
1GenoKit exon -h
2#usage: GenoKit exon [-h] -d DATABASE -f GENOME [-i TRANSCRIPT]
3# [-o OUTPUT] [-p] [-s {GFF,GTF}]
4
5#optional arguments:
6# -h, --help show this help message and exit
7# -d DATABASE, --database DATABASE
8# database generated by subcommand create
9# -f GENOME, --genome GENOME
10# genome fasta
11# -i TRANSCRIPT, --transcript TRANSCRIPT
12# specific transcript id; needed
13# -o OUTPUT, --output OUTPUT
14# output file path
15# -p, --print output to stdout
16# -s {GFF,GTF}, --style {GFF,GTF}
17# GTF database or GFF database
Intron¶
An intron is any nucleotide sequence within a gene that is removed by RNA processing during production of the final RNA product.
1GenoKit intron -h
2#usage: GenoKit intron [-h] -d DATABASE -f GENOME [-i TRANSCRIPT]
3# [-o OUTPUT] [-p] [-s {GFF,GTF}]
4
5#optional arguments:
6# -h, --help show this help message and exit
7# -d DATABASE, --database DATABASE
8# database generated by subcommand create
9# -f GENOME, --genome GENOME
10# genome fasta
11# -i TRANSCRIPT, --transcript TRANSCRIPT
12# specific transcript id; needed
13# -o OUTPUT, --output OUTPUT
14# output file path
15# -p, --print output to stdout
16# -s {GFF,GTF}, --style {GFF,GTF}
17# GTF database or GFF database
IGR¶
An IGR (intergenic region) is a stretch of DNA sequences located between genes.
1GenoKit IGR -h
2#usage: GenoKit IGR [-h] -d DATABASE -f GENOME [-l IGR_LENGTH]
3# [-o OUTPUT] [-p] [-s {GFF,GTF}]
4
5#optional arguments:
6# -h, --help show this help message and exit
7# -d DATABASE, --database DATABASE
8# database generated by subcommand create
9# -f GENOME, --genome GENOME
10# genome fasta
11# -l IGR_LENGTH, --IGR_length IGR_LENGTH
12# IGR length threshold
13# -o OUTPUT, --output OUTPUT
14# output fasta file path
15# -p, --print output to stdout
16# -s {GFF,GTF}, --style {GFF,GTF}
17# GTF database only contain protein genes, while GFF
18# database contain protein genes and nocoding genes
python >= 3.7.6 python
pandas >= 1.2.4 pandas
gffutils >= 0.10.1 gffutils
setuptools >= 49.2.0 setuptools
BioPython >= 1.78 biopython
Install them all with conda:
conda install --channel conda-forge --channel python pandas gffutils setuptools biopython