Tutorial ============ This tutorial introduces the text-based usage and workflow in GenoKit. GenoKit is designed for GFF and GTF file, meanwhile GenBankExtract is suited for GenBank file. `GenoKit` is `tested `_ with Python 3.8 Quick-start ----------- Find featurextract in command line and show the commands provided. .. code-block:: bash :linenos: # find the GenoKit exec path which GenoKit /opt/anaconda3/bin/GenoKit # show help info GenoKit -h # ______ __ __ _ __ # / ____/__ ____ ____ / //_/(_) /_ # / / __/ _ \/ __ \/ __ \/ ,< / / __/ # / /_/ / __/ / / / /_/ / /| |/ / /_ # \____/\___/_/ /_/\____/_/ |_/_/\__/ # # GenoKit - Genome and Gene ToolKit v0.2.6 # Contact: Sitao Zhu # # Usage: # GenoKit [parameters] # # Database: # create Create GFF/GTF database # stat Database statistics # # Extract: # gene Extract gene sequence # mrna Extract mRNA sequence # transcript Extract transcript sequence # exon Extract exon sequence # intron Extract intron sequence # cds Extract CDS sequence # utr Extract 5'/3'UTR sequence # uorf Extract uORF sequence # dorf Extract dORF sequence # promoter Extract promoter sequence # terminator Extract terminator sequence # igr Extract intergenic region # # Design: # primer Primer design # sgrna Single guide RNA design # sirna Small interfering RNA design # motif Motif search # # Visualize: # vision Snapshot gene structure # circos Circlize genome structure # Main functional commands ------------------------ In this section, we will introduce commands provided by GenoKit modual. How to get help info, and how to use them. Subcommands are show helpdoc when the specific command is provided as args in command line. Create ~~~~~~ The create command use `gffutils `_ to build a sqlite database, which stores the genome and gene structure need for other functional commands in GenoKit. .. code-block:: bash :linenos: GenoKit create -h # usage: GenoKit create [-h] [-f {GFF,GTF}] -g GENOMEFEATURE -o OUTPUT_PREFIX # optional arguments: # -h, --help show this help message and exit # -f {GFF,GTF}, --file_type {GFF,GTF} # genome annotation file # -g GENOMEFEATURE, --genomefeature GENOMEFEATURE # genome annotation file # -o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX # database absolute path # GFF GenoKit create -f GFF -g ath.gff3 -o ath # or GTF GenoKit create -f GTF -g ath.gtf -o ath Promoter ~~~~~~~~ Promoter sequence is essential for gene functional resaerch. Usually. .. code-block:: bash :linenos: GenoKit promoter -h #usage: GenoKit promoter [-h] -d DATABASE -f GENOME [-g GENE] # [-l PROMOTER_LENGTH] [-u UTR5_UPPER_LENGTH] # [-o OUTPUT] [-t {csv,fasta}] [-p] #optional arguments: # -h, --help show this help message and exit # -d DATABASE, --database DATABASE # database generated by subcommand create # -f GENOME, --genome GENOME # genome fasta path # -g GENE, --gene GENE specific gene; if not given, return whole genes # -l PROMOTER_LENGTH, --promoter_length PROMOTER_LENGTH # promoter length before TSS (default 100 nt) # -u UTR5_UPPER_LENGTH, --utr5_upper_length UTR5_UPPER_LENGTH # 5' utr length after TSS (default 10 nt) # -o OUTPUT, --output OUTPUT # output file path # -t {csv,fasta}, --output_format {csv,fasta} # output format # -p, --print output to stdout # All genes in whole gnome GenoKit promoter -d ath.GFF -f ath.fa -l 200 -u 100 -o promoter.csv --output_format fasta # A given gene GenoKit promoter -d ath.GFF -f ath.fa -l 200 -u 100 -g AT1G01010 -p --output_format fasta UTR ~~~ UTR (untranslated region) sequence is essential for gene functional resaerch. In molecular genetics, an UTR refers to either of two sections, one on each side of a coding sequence on a strand of mRNA. If it is found on the 5' side, it is called the 5' UTR (or leader sequence), or if it is found on the 3' side, it is called the 3' UTR (or trailer sequence). .. code-block:: bash :linenos: GenoKit UTR -h #usage: GenoKit UTR [-h] -d DATABASE -f GENOME [-i TRANSCRIPT] # [-o OUTPUT] [-p] [-s {GFF,GTF}] # #optional arguments: # -h, --help show this help message and exit # -d DATABASE, --database DATABASE # database generated by subcommand create # -f GENOME, --genome GENOME # genome fasta file # -i TRANSCRIPT, --transcript TRANSCRIPT # specific transcript id; if not given, whole transcript # will return # -o OUTPUT, --output OUTPUT # output file path # -p, --print output to stdout # -s {GFF,GTF}, --style {GFF,GTF} # GTF database or GFF database # # All transcripts in whole gnome GenoKit promoter -d ath.GFF -f ath.fa -o utr.csv -s GFF # A given gene GenoKit promoter -d ath.GFF -f ath.fa -i AT1G01010.1 -p -s GFF uORF ~~~~~ uORF (upstream open reading frame), is an open reading frame (ORF) within the 5' untranslated region (5'UTR) of an mRNA. uORFs can regulate eukaryotic gene expression and repress downstream expression of the primary ORF. .. code-block:: bash :linenos: GenoKit uORF -h #usage: GenoKit uORF [-h] -d DATABASE -f GENOME [-i TRANSCRIPT] # [-t {csv,fasta,gff}] [-o OUTPUT] [-m] [-n] # [-s {GFF,GTF}] #optional arguments: # -h, --help show this help message and exit # -d DATABASE, --database DATABASE # database generated by subcommand create # -f GENOME, --genome GENOME # genome fasta # -i TRANSCRIPT, --transcript TRANSCRIPT # specific transcript id; if not given, whole transcript # will return # -t {csv,fasta,gff}, --output_format {csv,fasta,gff} # output format # -o OUTPUT, --output OUTPUT # output file path # -m, --schematic_without_intron # schematic figure file for uORF, CDS and transcript # without intron # -n, --schematic_with_intron # schematic figure file for uORF, CDS and transcript # with intron # -s {GFF,GTF}, --style {GFF,GTF} # GTF database or GFF database CDS ~~~~ The CDS (coding sequence), is the portion of a gene's DNA or RNA that codes for protein. .. code-block:: bash :linenos: GenoKit CDS -h #usage: GenoKit uORF [-h] -d DATABASE -f GENOME [-i TRANSCRIPT] # [-t {csv,fasta,gff}] [-o OUTPUT] [-m] [-n] # [-s {GFF,GTF}] #optional arguments: # -h, --help show this help message and exit # -d DATABASE, --database DATABASE # database generated by subcommand create # -f GENOME, --genome GENOME # genome fasta # -i TRANSCRIPT, --transcript TRANSCRIPT # specific transcript id; if not given, whole transcript # will return # -t {csv,fasta,gff}, --output_format {csv,fasta,gff} # output format # -o OUTPUT, --output OUTPUT # output file path # -m, --schematic_without_intron # schematic figure file for uORF, CDS and transcript # without intron # -n, --schematic_with_intron # schematic figure file for uORF, CDS and transcript # with intron # -s {GFF,GTF}, --style {GFF,GTF} # GTF database or GFF database dORF ~~~~ .. code-block:: bash :linenos: GenoKit dORF -h #usage: GenoKit dORF [-h] -d DATABASE -f GENOME [-i TRANSCRIPT] # [-t {csv,fasta,gff}] [-o OUTPUT] # [-m SCHEMATIC_WITHOUT_INTRON] # [-n SCHEMATIC_WITH_INTRON] [-s {GFF,GTF}] #optional arguments: # -h, --help show this help message and exit # -d DATABASE, --database DATABASE # database generated by subcommand create # -f GENOME, --genome GENOME # genome fasta # -i TRANSCRIPT, --transcript TRANSCRIPT # specific transcript id; if not given, whole transcript # will return # -t {csv,fasta,gff}, --output_format {csv,fasta,gff} # output format # -o OUTPUT, --output OUTPUT # output file path # -m SCHEMATIC_WITHOUT_INTRON, --schematic_without_intron SCHEMATIC_WITHOUT_INTRON # schematic figure file for dORF, CDS and transcript # without intron # -n SCHEMATIC_WITH_INTRON, --schematic_with_intron SCHEMATIC_WITH_INTRON # schematic figure file for dORF, CDS and transcript # with intron # -s {GFF,GTF}, --style {GFF,GTF} # GTF database or GFF database cDNA/mRNA ~~~~~~~~~ In genetics, complementary DNA (cDNA) is DNA synthesized from a single-stranded RNA (e.g., messenger RNA (mRNA) or microRNA (miRNA)) template in a reaction catalyzed by the enzyme reverse transcriptase. .. code-block:: bash :linenos: GenoKit cdna -h #usage: GenoKit cdna [-h] -d DATABASE -f GENOME [-i TRANSCRIPT] # [-o OUTPUT] [-t {csv,fasta}] [-p] [-u] # [-s {GFF,GTF}] #optional arguments: # -h, --help show this help message and exit # -d DATABASE, --database DATABASE # database generated by subcommand create # -f GENOME, --genome GENOME # genome fasta # -i TRANSCRIPT, --transcript TRANSCRIPT # specific transcript; if not given, return whole # transcripts # -o OUTPUT, --output OUTPUT # output file path # -t {csv,fasta}, --output_format {csv,fasta} # output format # -p, --print output to stdout # -u, --upper upper CDS and lower utr # -s {GFF,GTF}, --style {GFF,GTF} # GTF database or GFF database Gene ~~~~ gene .. code-block:: bash :linenos: GenoKit gene -h #usage: GenoKit gene [-h] -d DATABASE -f GENOME [-g GENE] [-o OUTPUT] # [-p] #optional arguments: # -h, --help show this help message and exit # -d DATABASE, --database DATABASE # database generated by subcommand create # -f GENOME, --genome GENOME # genome fasta # -g GENE, --gene GENE specific gene; if not given, return whole genes # -o OUTPUT, --output OUTPUT # output file path # -p, --print output to stdout Exon ~~~~ An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. .. code-block:: bash :linenos: GenoKit exon -h #usage: GenoKit exon [-h] -d DATABASE -f GENOME [-i TRANSCRIPT] # [-o OUTPUT] [-p] [-s {GFF,GTF}] #optional arguments: # -h, --help show this help message and exit # -d DATABASE, --database DATABASE # database generated by subcommand create # -f GENOME, --genome GENOME # genome fasta # -i TRANSCRIPT, --transcript TRANSCRIPT # specific transcript id; needed # -o OUTPUT, --output OUTPUT # output file path # -p, --print output to stdout # -s {GFF,GTF}, --style {GFF,GTF} # GTF database or GFF database Intron ~~~~~~ An intron is any nucleotide sequence within a gene that is removed by RNA processing during production of the final RNA product. .. code-block:: bash :linenos: GenoKit intron -h #usage: GenoKit intron [-h] -d DATABASE -f GENOME [-i TRANSCRIPT] # [-o OUTPUT] [-p] [-s {GFF,GTF}] #optional arguments: # -h, --help show this help message and exit # -d DATABASE, --database DATABASE # database generated by subcommand create # -f GENOME, --genome GENOME # genome fasta # -i TRANSCRIPT, --transcript TRANSCRIPT # specific transcript id; needed # -o OUTPUT, --output OUTPUT # output file path # -p, --print output to stdout # -s {GFF,GTF}, --style {GFF,GTF} # GTF database or GFF database IGR ~~~ An IGR (intergenic region) is a stretch of DNA sequences located between genes. .. code-block:: bash :linenos: GenoKit IGR -h #usage: GenoKit IGR [-h] -d DATABASE -f GENOME [-l IGR_LENGTH] # [-o OUTPUT] [-p] [-s {GFF,GTF}] #optional arguments: # -h, --help show this help message and exit # -d DATABASE, --database DATABASE # database generated by subcommand create # -f GENOME, --genome GENOME # genome fasta # -l IGR_LENGTH, --IGR_length IGR_LENGTH # IGR length threshold # -o OUTPUT, --output OUTPUT # output fasta file path # -p, --print output to stdout # -s {GFF,GTF}, --style {GFF,GTF} # GTF database only contain protein genes, while GFF # database contain protein genes and nocoding genes * python >= 3.7.6 `python `_ * pandas >= 1.2.4 `pandas `_ * gffutils >= 0.10.1 `gffutils `_ * setuptools >= 49.2.0 `setuptools `_ * BioPython >= 1.78 `biopython `_ Install them all with `conda`:: conda install --channel conda-forge --channel python pandas gffutils setuptools biopython