Commit 41f05f09 authored by wdyrka's avatar wdyrka

a protein sequences example added

parent fde8973e
......@@ -20,6 +20,8 @@ To compile this repository, you will need package `libboost-program-options-dev`
# Usage instructions
## Help
Enter in terminal
bin/pcfg_scan --help
......@@ -30,6 +32,73 @@ or
to print full list of program options.
## Scanning example
Directory `./example` contains files:
- `hets.lex.wcfg` (lexical part of grammar)
- `hets.struct.wcfg` (structural part of grammar)
- `hets.fasta` (example sequences to be parsed)
- `hets.cmap` (list of contact pairs)
- `hets-scan.conf` (configuration file used in this example)
- `hets-evolve.conf` (configuration file used in the next example)
- `hets-galib.conf` (configuration file also used in the next example)
Let's run the scanner on these files:
bin/pcfg_scan --conf ./example/hets-scan.conf --out trees.txt --tree
Here is the meaning of options used above:
- `--conf ...` specifies configuration file
- `--out ...` specifies name of output file
- `--tree` orders scanner to print parse trees
There are also other options, which are loaded from configuration file.
All options except `--conf` can be set in either way.
- `clist=..` specifies file containing contact pairs
- `fasta=...` specifies file containing input protein sequences
- `lex=...` specifies file containing lexical rules of grammar
- `struct=...` specifies file containing structural part of grammar
- `null=...` chooses null model
- `winmax=...` sets maximum window size
- `winmin=...` sets minimum window size
- `outdir=...` specifies directory where output file(s) are created
Scanner should then produce output file `trees.txt` in directory `./example`.
## Learning example
Please read the paragraph about scanning first.
In this example we use the same files as in previous example,
except the configuration file used now is `./example/hets-evolve.conf`.
Let's run the learning algorithm on these files:
bin/pcfg_evolve --conf ./example/hets-evolve.conf --seed 1571856417
Option `--seed ...` explicitly specifies seed for pseudorandom numbers used,
which is useful for reproducing results.
There are some options which are loaded from the configuration file,
but were not present in the previous example.
- `obj=G_MX` specifies objective function, here we specified that we are looking for the most likely grammar(G) given contacts (M) and sequences (X)
- `galib-conf=...` specifies configuration file with options specific to GALIB library
- `sharing-cutoff=...` specifies sharing cutoff
- `grammar-flush-frequency=...` specifies how often program saves the current best grammar to file
Scanner should then produce output files in directory `./example`:
- `1571856417-galib.csv`
- `1571856417-50-0-lex.wcfg`
- `1571856417-50-0-struct.wcfg`
- `1571856417-final-0-lex.wcfg`
- `1571856417-final-0-struct.wcfg`
- `1571856417-galib.conf`
- `1571856417-pcfg_evolve.conf`
## Common options
### `--clist arg`
......@@ -83,7 +152,7 @@ Default value effectively places no user-defined upper bound on number of thread
Switch to Viterbi parsing.
By default, probabilities are added, which corresponds to Baum-Welch mode of operation, wihch gives overall probability of the sequence in grammar. When this option is enabled, during parsing probabilities are combined with `max` function, giving probability of the most likely derivation. Notably, option `--tree` in `pcfg_scan` operates as if this option was set, among other things.
By default, probabilities are added, which corresponds to Baum-Welch mode of operation, wihch gives overall probability of the sequence in grammar. When this option is enabled, during parsing probabilities are combined with `max` function, giving probability of the most likely derivation. Notably, option `--tree` in `pcfg_scan` always operates as if this option was set.
## Options specific for `pcfg_scan`
......
clist=./example/hets.cmap
fasta=./example/hets.fasta
lex=./example/hets.lex.wcfg
struct=./example/hets.struct.wcfg
outdir=./example
obj=G_MX
galib-conf=./example/hets-galib.conf
sharing-cutoff=20.0 # 1.0 - default
grammar-flush-frequency=50 # 0 (only after finish) - default
# sample settings for GAlib applications
# GAlib expects parameters in name-value pairs. The name should be a single
# string (no whitespace allowed). The value should be of a type appropriate
# for the named parameter. Anything after a # character will be ignored.
# The file must end with a blank line. If you specify parameters that depend
# on other parameters, the last parameter will override the first or act on
# data modified by the first (the parameters are applied in the order they
# are listed).
minimaxi -1
number_of_generations 100
convergence_percentage 1.001
generations_to_convergence 10
crossover_probability 0.90
mutation_probability 0.01
population_size 100
replacement_percentage 0.50
number_of_best 1
score_frequency 1
flush_frequency 100
select_scores 31
record_diversity 1
clist=./example/hets.cmap
fasta=./example/hets.fasta
lex=./example/hets.lex.wcfg
struct=./example/hets.struct.wcfg
outdir=./example
null=p0
winmax=21
winmin=21
>jr1_Melbi2_547107_e_gw1.
8 # 16
10 # 14
17 # 21
>jr1_Aurpu_var_mel1_81883
8 # 16
10 # 14
17 # 21
>jr1_Aaoar1_429219_fgenes
8 # 16
10 # 14
17 # 21
>jr1_Antav1_424285_fgenes
8 # 16
10 # 14
17 # 21
>r2_682456863_gb_KFZ14600
8 # 16
10 # 14
17 # 21
>jr2_Amore1_33010_fgenesh
8 # 16
10 # 14
17 # 21
>jr2_Aspsy1_33517_fgenesh
8 # 16
10 # 14
17 # 21
>jr1_Thihy1_493338_fgenes
8 # 16
10 # 14
17 # 21
>jr2_Aciri1_iso_159232_fg
8 # 16
10 # 14
17 # 21
>jr2_Clagr3_6461_CLAGR_00
8 # 16
10 # 14
17 # 21
>jr1_Thiar1_783280_fgenes
8 # 16
10 # 14
17 # 21
>r2_596672318_ref_XP_0072
8 # 16
10 # 14
17 # 21
>jr2_Cadsp1_662565_estExt
8 # 16
10 # 14
17 # 21
>jr2_Coclu2_34207_gm1.295
8 # 16
10 # 14
17 # 21
>jr2_Cocsa1_192976_estExt
8 # 16
10 # 14
17 # 21
>jr2_Gloci1_1938808_MIX33
8 # 16
10 # 14
17 # 21
>jr1_Necha2_48298_e_gw1.2
8 # 16
10 # 14
17 # 21
>jr1_TriviGv29_8_2_28501_
8 # 16
10 # 14
17 # 21
>r1_671156951_ref_XP_0087
8 # 16
10 # 14
17 # 21
>jr2_Clafu1_191299_scf718
8 # 16
10 # 14
17 # 21
>jr1_Melbi2_547107_e_gw1.
TGHTYKYLEASNEARMLAGDL
>jr1_Aurpu_var_mel1_81883
TNQDIGNVTIAELGYGAVGIS
>jr1_Aaoar1_429219_fgenes
AGTTIKYAEAMEDSRQLFGQI
>jr1_Antav1_424285_fgenes
TSHRIHDQTVTDNARVQVGHT
>r2_682456863_gb_KFZ14600
LSHTYDGVQVDVSGKALLGNS
>jr2_Amore1_33010_fgenesh
ASHTYDGVEVENNGKALIGNK
>jr2_Aspsy1_33517_fgenesh
PGNVYSGIHISGETRVRNGTN
>jr1_Thihy1_493338_fgenes
TGQKFGAMRTDNESIAMQGIV
>jr2_Aciri1_iso_159232_fg
KNRSFDNVKITGDARVRFDDT
>jr2_Clagr3_6461_CLAGR_00
SANTFDVLIAQDRARQMAGSI
>jr1_Thiar1_783280_fgenes
VTNVAENIKVGQEARAHVGNV
>r2_596672318_ref_XP_0072
PGHSYGVTIITGGTKLIQGDS
>jr2_Cadsp1_662565_estExt
ISQDISDVSADNRGFVIAGVA
>jr2_Coclu2_34207_gm1.295
APHVYEQIILEDNGNIQIGNK
>jr2_Cocsa1_192976_estExt
KNHSYDGNEANNETRAVYGNI
>jr2_Gloci1_1938808_MIX33
PGSLYEKNEASGDVTVHYGDA
>jr1_Necha2_48298_e_gw1.2
VRNYVREIQGEENAKVRLGND
>jr1_TriviGv29_8_2_28501_
GKNSARNVTTEDKVRFHVGNV
>r1_671156951_ref_XP_0087
DGHVFHNNKIGGRARVAQGDL
>jr2_Clafu1_191299_scf718
GEHTYDGMYTSGTARALYGNK
grammar_type WCFG
start_symbol -1
0 0.000188846 A
0 0.130849 R
0 0.205679 N
0 0.0704233 D
0 0.00655031 C
0 0.0654449 Q
0 0.0509114 E
0 0.000337493 G
0 0.0487234 H
0 0.038263 I
0 0.00587046 L
0 0.126179 K
0 0.0175833 M
0 0.0132206 F
0 0.00528536 P
0 0.0608073 S
0 0.112964 T
0 0.00120025 W
0 0.0248503 Y
0 0.0146684 V
1 0.0125018 A
1 0.0121747 R
1 0.0157071 N
1 0.137128 D
1 0.000101067 C
1 0.000442223 Q
1 0.136029 E
1 0.603976 G
1 0.00528373 H
1 0.000185778 I
1 0.0028782 L
1 0.0400038 K
1 0.00187471 M
1 0.000208914 F
1 9.80061e-05 P
1 0.0250313 S
1 8.7697e-05 T
1 0.000179377 W
1 0.00427448 Y
1 0.00183305 V
2 0.197024 A
2 0.0134679 R
2 0.00012841 N
2 1.57631e-05 D
2 4.13874e-05 C
2 0.0645218 Q
2 0.00920857 E
2 0.00123783 G
2 0.0206294 H
2 0.125664 I
2 0.0568829 L
2 0.00336953 K
2 0.0239606 M
2 0.0407201 F
2 0.0083922 P
2 0.112916 S
2 0.0774552 T
2 0.000766533 W
2 0.0220751 Y
2 0.221522 V
grammar_type WCFG
start_symbol 6
3 0.0122553 0 0
3 0.161678 0 1
3 0.000161446 0 2
3 8.81021e-05 0 3
3 5.22001e-05 0 3 0
3 2.98799e-05 0 3 1
3 8.69527e-05 0 3 2
3 5.92391e-05 0 4
3 3.17688e-05 0 4 0
3 4.24844e-05 0 4 1
3 5.55965e-05 0 4 2
3 0.206586 0 5
3 5.36384e-05 0 5 0
3 5.86876e-05 0 5 1
3 6.83004e-05 0 5 2
3 2.51488e-05 0 6
3 7.92518e-05 0 6 0
3 3.9029e-05 0 6 1
3 9.27906e-05 0 6 2
3 0.0102947 1 0
3 0.0547063 1 1
3 0.000104229 1 2
3 3.91791e-05 1 3
3 0.000159919 1 3 0
3 6.41021e-05 1 3 1
3 8.50809e-06 1 3 2
3 7.82215e-05 1 4
3 6.63982e-05 1 4 0
3 0.000171717 1 4 1
3 6.63271e-05 1 4 2
3 0.00405488 1 5
3 0.000264838 1 5 0
3 0.000139908 1 5 1
3 9.53751e-05 1 5 2
3 5.23184e-05 1 6
3 9.31212e-05 1 6 0
3 7.60322e-05 1 6 1
3 3.85582e-06 1 6 2
3 0.00105761 2 0
3 0.471872 2 1
3 0.00357874 2 2
3 0.00017424 2 3
3 5.19367e-05 2 3 0
3 2.14221e-05 2 3 1
3 8.86922e-05 2 3 2
3 9.48978e-05 2 4
3 4.2888e-05 2 4 0
3 2.19993e-05 2 4 1
3 7.09823e-05 2 4 2
3 0.0680698 2 5
3 0.000415331 2 5 0
3 7.24478e-05 2 5 1
3 8.50486e-05 2 5 2
3 0.000123846 2 6
3 5.29662e-05 2 6 0
3 0.000124928 2 6 1
3 9.14911e-05 2 6 2
3 3.51281e-05 3 0
3 3.59058e-05 3 1
3 1.91865e-05 3 2
3 4.45254e-05 3 3
3 9.20728e-05 3 4
3 1.77571e-05 3 5
3 2.59357e-05 3 6
3 9.19769e-05 4 0
3 9.73515e-05 4 1
3 5.39217e-05 4 2
3 8.84006e-05 4 3
3 3.96079e-05 4 4
3 9.66695e-05 4 5
3 1.64051e-05 4 6
3 5.2202e-05 5 0
3 2.04238e-05 5 1
3 0.000323671 5 2
3 2.57338e-05 5 3
3 5.96313e-06 5 4
3 8.65642e-05 5 5
3 0.000127382 5 6
3 9.02373e-05 6 0
3 4.12456e-05 6 1
3 7.20411e-05 6 2
3 4.70181e-05 6 3
3 3.59921e-05 6 4
3 5.87697e-05 6 5
3 6.23437e-05 6 6
4 7.80384e-05 0 0
4 9.28619e-05 0 1
4 5.30791e-05 0 2
4 0.0114091 0 3
4 6.7926e-05 0 3 0
4 8.4787e-05 0 3 1
4 6.91042e-05 0 3 2
4 0.000130315 0 4
4 6.3045e-05 0 4 0
4 1.31105e-05 0 4 1
4 8.07702e-05 0 4 2
4 0.000101509 0 5
4 7.2073e-05 0 5 0
4 2.91655e-05 0 5 1
4 7.10942e-05 0 5 2
4 8.64632e-05 0 6
4 0.000104096 0 6 0
4 9.62928e-05 0 6 1
4 3.487e-05 0 6 2
4 8.24413e-05 1 0
4 1.91009e-05 1 1
4 0.000169333 1 2
4 0.0116987 1 3
4 3.14578e-05 1 3 0
4 5.5318e-05 1 3 1
4 0.000111517 1 3 2
4 0.000197289 1 4
4 0.00022405 1 4 0
4 3.62308e-05 1 4 1
4 0.000100918 1 4 2
4 0.000108718 1 5
4 0.00011994 1 5 0
4 0.000115299 1 5 1
4 7.68645e-05 1 5 2
4 4.84887e-05 1 6
4 0.000172986 1 6 0
4 7.8351e-05 1 6 1
4 5.86827e-05 1 6 2
4 6.294e-05 2 0
4 5.26977e-05 2 1
4 6.3008e-05 2 2
4 4.68102e-05 2 3
4 3.05455e-05 2 3 0
4 2.8196e-05 2 3 1
4 7.86908e-05 2 3 2
4 8.2974e-05 2 4
4 2.25501e-05 2 4 0
4 2.58136e-05 2 4 1
4 2.7664e-05 2 4 2
4 0.000111506 2 5
4 7.00548e-05 2 5 0
4 9.99615e-05 2 5 1
4 7.06356e-05 2 5 2
4 0.00011982 2 6
4 5.0658e-05 2 6 0
4 3.08841e-05 2 6 1
4 7.76744e-05 2 6 2
4 0.823365 3 0
4 0.0482371 3 1
4 0.0989807 3 2
4 7.0079e-05 3 3
4 6.92351e-05 3 4
4 7.08008e-05 3 5
4 6.08972e-05 3 6
4 5.36346e-05 4 0
4 7.22912e-05 4 1
4 0.000112523 4 2
4 0.000126831 4 3
4 0.000124586 4 4
4 3.53947e-05 4 5
4 9.35069e-05 4 6
4 1.77422e-05 5 0
4 9.71272e-05 5 1
4 3.92449e-05 5 2
4 0.00010292 5 3
4 7.06221e-05 5 4
4 1.89563e-05 5 5
4 0.000287363 5 6
4 2.79776e-05 6 0
4 0.000139138 6 1
4 7.56101e-05 6 2
4 2.61395e-05 6 3
4 5.70226e-05 6 4
4 6.0733e-05 6 5
4 0.000110274 6 6
5 4.34696e-05 0 0
5 0.000151267 0 1
5 3.92401e-05 0 2
5 5.58792e-05 0 3
5 0.000115817 0 3 0
5 3.39137e-05 0 3 1
5 2.99034e-05 0 3 2
5 0.000173964 0 4
5 0.00966037 0 4 0
5 0.0300377 0 4 1
5 0.0857262 0 4 2
5 7.97819e-05 0 5
5 2.40013e-05 0 5 0
5 6.48752e-05 0 5 1
5 0.000110961 0 5 2
5 0.000108121 0 6
5 8.70324e-05 0 6 0
5 4.15074e-05 0 6 1
5 5.40443e-05 0 6 2
5 0.000127968 1 0
5 5.67897e-05 1 1
5 0.000203433 1 2
5 0.000111176 1 3
5 0.000140535 1 3 0
5 0.000206718 1 3 1
5 0.000149241 1 3 2
5 7.98089e-05 1 4
5 3.60087e-05 1 4 0
5 8.9186e-05 1 4 1
5 0.00992148 1 4 2
5 6.97748e-05 1 5
5 0.000113216 1 5 0
5 2.54049e-05 1 5 1
5 9.17698e-05 1 5 2
5 6.73961e-05 1 6
5 7.79372e-05 1 6 0
5 2.44943e-05 1 6 1
5 5.96097e-05 1 6 2
5 0.000104952 2 0
5 0.000108951 2 1
5 7.86383e-05 2 2
5 3.03459e-05 2 3
5 9.33481e-05 2 3 0
5 0.000114764 2 3 1
5 6.66646e-05 2 3 2
5 4.07964e-05 2 4
5 0.000112624 2 4 0
5 0.0556841 2 4 1
5 0.802058 2 4 2
5 2.3042e-05 2 5
5 2.83316e-05 2 5 0
5 7.47208e-05 2 5 1
5 1.38779e-05 2 5 2
5 7.31807e-05 2 6
5 2.91983e-05 2 6 0
5 0.000124791 2 6 1
5 0.000129268 2 6 2
5 0.000105717 3 0
5 6.21899e-05 3 1
5 0.000117539 3 2
5 7.31988e-05 3 3
5 0.000127644 3 4
5 0.000142091 3 5
5 0.000200298 3 6
5 4.84658e-05 4 0
5 3.33244e-05 4 1
5 0.000191316 4 2
5 6.84701e-05 4 3
5 0.00013268 4 4
5 3.90502e-05 4 5
5 5.04247e-05 4 6
5 0.000117066 5 0
5 3.65463e-05 5 1
5 1.79054e-05 5 2
5 2.93175e-05 5 3
5 2.28523e-05 5 4
5 3.25032e-05 5 5
5 0.000413369 5 6
5 9.46601e-05 6 0
5 0.000112047 6 1
5 0.000135342 6 2
5 8.69346e-05 6 3
5 9.23421e-05 6 4
5 7.28879e-05 6 5
5 6.4444e-05 6 6
6 0.000127351 0 0
6 5.46911e-05 0 1
6 4.22435e-05 0 2
6 0.000618208 0 3
6 5.61748e-05 0 3 0
6 4.33904e-05 0 3 1
6 1.72206e-05 0 3 2
6 6.57217e-05 0 4
6 7.21105e-05 0 4 0
6 8.2506e-05 0 4 1
6 9.88608e-05 0 4 2
6 0.0120734 0 5
6 3.15156e-05 0 5 0
6 2.74337e-05 0 5 1
6 5.79607e-06 0 5 2
6 0.375148 0 6
6 4.26031e-05 0 6 0
6 5.78375e-05 0 6 1
6 4.82785e-05 0 6 2
6 0.000108974 1 0
6 0.000100012 1 1
6 7.43562e-05 1 2
6 4.66826e-05 1 3
6 8.61587e-05 1 3 0
6 1.13795e-05 1 3 1
6 5.59386e-05 1 3 2
6 1.18898e-05 1 4
6 0.00130284 1 4 0
6 2.3932e-05 1 4 1
6 0.000141945 1 4 2
6 0.00562221 1 5
6 2.88747e-05 1 5 0
6 7.05784e-05 1 5 1
6 5.00873e-05 1 5 2
6 0.0796979 1 6
6 5.46616e-05 1 6 0
6 2.55967e-05 1 6 1
6 1.48919e-05 1 6 2
6 2.24526e-05 2 0
6 0.000172897 2 1
6 5.15636e-05 2 2
6 0.000433582 2 3
6 1.52433e-05 2 3 0
6 6.42116e-05 2 3 1
6 2.14124e-05 2 3 2
6 4.6077e-05 2 4
6 5.94257e-05 2 4 0
6 6.87972e-05 2 4 1
6 0.000163927 2 4 2
6 0.000127193 2 5
6 7.38973e-05 2 5 0
6 1.74394e-05 2 5 1
6 2.85956e-05 2 5 2
6 0.242313 2 6
6 3.05019e-05 2 6 0
6 3.66039e-05 2 6 1
6 2.93773e-05 2 6 2
6 0.000112407 3 0
6 2.80014e-05 3 1
6 6.04029e-05 3 2
6 5.35223e-05 3 3
6 3.97026e-05 3 4
6 0.0438648 3 5
6 7.66026e-05 3 6
6 1.2009e-05 4 0
6 6.52413e-05 4 1
6 5.39088e-05 4 2
6 3.84007e-05 4 3
6 4.44039e-05 4 4
6 0.0908073 4 5
6 0.000216024 4 6
6 5.31405e-05 5 0
6 5.60397e-05 5 1
6 3.57416e-05 5 2
6 3.8755e-05 5 3
6 2.52895e-05 5 4
6 2.88508e-05 5 5
6 3.47809e-05 5 6
6 1.11772e-05 6 0
6 1.68531e-05 6 1
6 2.50705e-05 6 2
6 3.31356e-05 6 3
6 7.47578e-05 6 4
6 0.144055 6 5
6 1.99967e-05 6 6
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment