Commit f69291a9 authored by Witold Dyrka's avatar Witold Dyrka
Browse files

version 3.1.0

parent 41f05f09
Loading
Loading
Loading
Loading
+114 −65
Original line number Diff line number Diff line
# Install
# PCFG-CM
##### current version: 3.1.0

The PCFG-CM is a set of programs for modeling protein sequences using probabilistic context-free grammars. PCFG-CM can estimate a probabilistic context-free grammar for a set of sample sequences taking advantage from contact constraints if available. PCFG-CM can be used to scan protein sequences for presence of fragments consistent with a given grammar, and to generate their syntactic trees.

The PCFG-CM package is developed and maintained in [KotulskaLab](http://www.kotulska-lab.pwr.wroc.pl) at [Politechnika Wroclawska](http://pwr.edu.pl) with financial support from [National Science Centre](https://www.ncn.gov.pl?language=en) grant no. 2015/17/D/ST6/04054 (*Formal linguistics for proteomics - modeling, analysis and hypotheses testing*).

## License
The PCFG-CM package is distributed under [the GNU Public License v3.0](/LICENSE).

## Install

To compile, first make sure that all __dependencies__ are installed.
If dependencies are in place, just run
@@ -7,7 +17,7 @@ If dependencies are in place, just run

Executables will appear in directory `./bin`.

## Dependencies
### Dependencies

There is a dependency on BOOST C++ library, specifically `libboost-program-options`.
If you only need to run the program, on Debian or Ubuntu Linux derivatives you can install these with
@@ -18,9 +28,9 @@ To compile this repository, you will need package `libboost-program-options-dev`

    sudo apt install libboost-program-options-dev

# Usage instructions
## Usage instructions

## Help
### Help

Enter in terminal

@@ -32,7 +42,7 @@ or

to print full list of program options.

## Scanning example
### Scanning example

Directory `./example` contains files:

@@ -44,7 +54,7 @@ Directory `./example` contains files:
- `hets-evolve.conf` (configuration file used in the next example)
- `hets-galib.conf` (configuration file also used in the next example)

Let's run the scanner on these files:
Let's run the scanner using these files:

    bin/pcfg_scan --conf ./example/hets-scan.conf --out trees.txt --tree

@@ -58,6 +68,7 @@ There are also other options, which are loaded from configuration file.
All options except `--conf` can be set in either way.

- `clist=..` specifies file containing contact pairs
- `cmode=..` specifies interpretation of contact constraints
- `fasta=...` specifies file containing input protein sequences
- `lex=...` specifies file containing lexical rules of grammar
- `struct=...` specifies file containing structural part of grammar
@@ -68,7 +79,7 @@ All options except `--conf` can be set in either way.

Scanner should then produce output file `trees.txt` in directory `./example`.

## Learning example
### Learning example

Please read the paragraph about scanning first.
In this example we use the same files as in previous example,
@@ -99,9 +110,9 @@ Scanner should then produce output files in directory `./example`:
- `1571856417-galib.conf`
- `1571856417-pcfg_evolve.conf`

## Common options
### Common options

### `--clist arg`
#### `--clist arg`

File path to list of contact pairs. Optional.

@@ -115,93 +126,110 @@ followed by 0 or more lines of the form:

    POSITION_1 # POSITION_2

### `--conf arg`
#### `--cmode arg (=at_least)`

Select the contact constraints mode. Default: at_least

The options define that contact rules should be used:
- at least for contact pairs (at_least);
- exactly for contact pairs (exactly);
- at most for contact pairs (at_most).

#### `--conf arg`

File path to program configuration. Optional.

File should consist of `key=value` entries, one in each line. For syntax details consult documentation of `boost::program_options` library.
All options that can be specified via file can be specified via terminal and vice versa. Settings from terminal takes precedence over settings from file.

### `--fasta arg`
#### `--fasta arg`

File path to sequences in fasta format. Required.

### `--help`
#### `--help`

Show help message.

### `--lex arg`
#### `--lex arg`

File path to lexical part of grammar. Required.

### `--outdir arg (=.)`
#### `--outdir arg (=.)`

Path to directory where output files should be placed. Optional, default: current directory.

### `--struct arg`
#### `--struct arg`

File path to structural part of grammar. Required.

### `--threads (=2147483647)`
#### `--threads (=<max available>)`

Upper bound on number of worker threads. Optional, default: INT_MAX integer constant.
Upper bound on number of worker threads. Optional, default: max available.

Default value effectively places no user-defined upper bound on number of threads, so machine-specific limit shall be used. In general case, the lesser of these shall be used as the number of threads but not less than one thread.
Default value is the machine-specific limit. The user can limit the number of threads in use.

### `--viterbi`
#### `--viterbi`

Switch to Viterbi parsing.

By default, probabilities are added, which corresponds to Baum-Welch mode of operation, wihch gives overall probability of the sequence in grammar. When this option is enabled, during parsing probabilities are combined with `max` function, giving probability of the most likely derivation. Notably, option `--tree` in `pcfg_scan` always operates as if this option was set.
The default is the Baum-Welch mode, which calculates overall probability of the sequence in grammar over all possible derivations.
When the Viterbi switch is enabled, the probability of the most likely derivation is returned.
Notably, option `--tree` of `pcfg_scan` implies that Viterbi mode is on (though it may change in future releases).

## Options specific for `pcfg_scan`
### Options specific for `pcfg_scan`

### `--group-by arg (=seq)`
#### `--group-by arg (=seq)`

Print the best score for each group. Optional, default: seq.

Default option (seq) causes scanner to search for the best window in a sequence.
Default option (seq) causes scanner to search for the highest scoring parsing window in a sequence.
More generally, results are grouped by having the same:
- sequence (seq);
- sequence and starting position of window (pos);
- sequence, starting position of window and length of window (win).
- sequence and starting position of parsing window (pos);
- sequence, starting position of window, and length of window (win).
For each group, the highest score is printed.

### `--tree`
#### `--tree`

Print the most likely parse tree.

This option implies `--viterbi`.

### `--obj arg (=X_G)`
#### `--obj arg (=X_G)`

Objective function. Optional, default: X_G.
Objective function or the actual score to be calculated. Default: X_G

Possible objectives are: M_G, M_XG, XM_G, X_G and X_MG.
It is possible to calculate log (base of 10) probability of:
- M|G: contact list given grammar (M_G);
- M|X,G: contact list given sequence and grammar (M_XG);
- X|G: sequence given grammar (X_G);
- X|M,G: sequence given contact list and grammar (X_MG);
- X,M|G: sequence and contact list given grammar (XM_G).

### `--null arg`
#### `--null arg`

Null model. Optional.

Possible null models are: p0, p1.
Score on null model is subtracted from overall score.
Theese score differences are used when choosing the best match in group (see: `--group-by`).
Possible null models are: 
- terminal symbols equally probable (eq);
- terminal symbols (amino acid) probabilities set according to frequencies in Swiss-Prot (p0);
- as above but accounting for unoptimized lexical rule probabilities when using `--preserve-lex` for learning (p1).

### `--winmax arg (=2147483647)`
Score on null model is subtracted from overall score. These score differences are used when choosing the best match in group (see: `--group-by`).

Upper bound on window size. Optional, default: INT_MAX integer constant.
#### `--winmax arg (=0)`

An actual upper bound for a sequence is `min(winmax,sequence_length)`.
Upper bound on window size. Optional, default: full length of each sequence.

Note that the effective window size is upper-bound by each sequence length.

### `--winmin arg (=2147483647)`
#### `--winmin arg (=0)`

Lower bound on window size. Optional, default: INT_MAX integer constant.
Lower bound on window size. Optional, default: full length of each sequence if winmax=0 otherwise 2.

An actual lower bound for a sequence is `max(min(winmin,winmax,sequence_length),2)`.
It means that unless `winmin` is at least two, no windows are processed at all.
Note that the effective window size is lower-bound by 2.

### `--out arg`
#### `--out arg`

Output file path. Required.

@@ -211,41 +239,50 @@ Each line of output file is of the form:

where `TREE` appears only if `--tree` is set.

## Options specific for `pcfg_evolve`
### Options specific for `pcfg_evolve`

### `--galib-conf arg`
#### `--galib-conf arg`

GALIB configuration. Optional.

### `--grammar-flush-frequency arg (=0)`
#### `--grammar-flush-frequency arg (=0)`

Frequency of flushing learned grammars to files. Optional, default: 0.

Default value of 0 means no flushing. Positive values specify how many iterations happen between consecutive flushes.

### `--jobid arg`
#### `--jobid arg`

Job identifier (string). Optional.

Job identifier is used in default names of output files. Also, hash of job identifier is combined with random seed via XOR.
Job identifier is used in naming output files. Also, a hash of job identifier is combined with random seed via XOR.

### `--obj arg (=X_G)`
#### `--obj arg (=X_G)`

Objective function. Optional, default: G_MX.
Objective function or the actual score to be calculated. Default: G_MX.

Possible objectives are: G_M, G_MX, G_X, M_G, M_XG, XM_G, X_G, X_MG and Z.
Possible objectives for maximization are probability of:
- G|M: grammar given contact list (G_M);
- G|M,X: grammar given sequence and contact list (G_MX);
- G|X: grammar given sequence (G_X);
- M|G: contact list given grammar (M_G, equivalent to G_M);
- M|X,G: contact list given sequence and grammar (M_XG);
- X|G: sequence given grammar (X_G, equivalent to G_X);
- X|M,G: sequence given contact list and grammar (X_MG);
- X,M|G: sequence and contact list given grammar (XM_G, equivalent to G_MX);
- Z: experimental (may be removed in future releases).

### `--preserve-lex`
#### `--preserve-lex`

Preserve lexical part of grammar, do not evolve it.
Preserve lexical part of grammar, do not optimize it.

### `--seed arg (=0)`
#### `--seed arg (=0)`

Seed for random generator. Optional, default: 0.

Default value of 0 means that current time is set as seed. For all other values, seed is taken literally from option, leading to deterministic, reproducible outputs.
Default value of 0 means that current time is used as seed. For all other values, seed is taken literally from the option, leading to deterministic and reproducible outputs.

### `--sharing-cutoff arg (=1)`
#### `--sharing-cutoff arg (=1)`

Sharing cutoff. Optional, default: 1.

@@ -257,29 +294,41 @@ There is only one makefile, `./Makefile`.

_If dependencies between any files are changed, you need to state these in `./Makefile`_.

### Makefile commands
#### Makefile commands

- Command `make` or `make all` compiles all sources, producing executables in directory `bin`.
- Command `make clean` removes all compilation artifacts (object files) from `bin` directory.
- Command `make distclean` executes `make clean` but also removes executables from `bin`.
- Command `make check` performs all tests. Builds executables first, if needed.

## Tests
### Tests

### How to: run tests
#### How to: run tests

1. Execute command `make check` for build & test, or `scripts/run-tests.sh` for testing alone.

### How to: create trivial failing test
#### How to: create trivial failing test

1. Go to directory `cd tests`.
1. Pick a name (e.g. `foo`) for a new subdirectory and go there: `mkdir foo; cd foo`.
1. Create executable script (e.g. `run.sh`) which calls `false`: `echo -e "#!/bin/sh\nfalse" > run.sh; chmod a+x run.sh`.
1. Append new tests to the list like this: `test_cases+="foo/foo.sh "`. NB: that white space after path is there for purpose (separates paths, thus making the list understandable to `for` loop).
2. Pick a name (e.g. `foo`) for a new subdirectory and go there: `mkdir foo; cd foo`.
3. Create executable script (e.g. `run.sh`) which calls `false`: `echo -e "#!/bin/sh\nfalse" > run.sh; chmod a+x run.sh`.
4. Append new tests to the list like this: `test_cases+="foo/foo.sh "`. NB: that white space after path is there for purpose (separates paths, thus making the list understandable to `for` loop).

### How does tests work?
#### How does tests work?

1. Directory `bin` (from repository root) is available on $PATH during tests (easy access to binaries).
1. Main script `scripts/run-tests.sh` executes test scripts listed in `test_cases` variable. To disable some tests temporarily, just comment out the corresponding lines in the main script.
1. Each executable file (e.g. `foo/bar/run.sh`) is executed in its local dir: `cd foo/bar; ./run.sh`.
1. Test passes if and only if its executable exits successfully.
2. Main script `scripts/run-tests.sh` executes test scripts listed in `test_cases` variable. To disable some tests temporarily, just comment out the corresponding lines in the main script.
3. Each executable file (e.g. `foo/bar/run.sh`) is executed in its local dir: `cd foo/bar; ./run.sh`.
4. Test passes if and only if its executable exits successfully.

## Citing

If you publish results obtained using PCFG-CM, please cite:
- W. Dyrka, M. Pyzik, F. Coste, H. Talibart. Estimating probabilistic context-free grammars for proteins using contact map constraints. PeerJ 7:e6559, 2019.

You may also consider citing the orginal framework for evolutionary PCFG learning:
- W. Dyrka, J.-C. Nebel. A stochastic context free grammar based framework for analysis of protein sequences. BMC Bioinformatics 10:323, 2009.

## Contact

Please contact **witold** dot **dyrka** at **pwr** dot **edu** dot **pl**.
+1 −0
Original line number Diff line number Diff line
clist=./example/hets.cmap
cmode=at_least
fasta=./example/hets.fasta
lex=./example/hets.lex.wcfg
struct=./example/hets.struct.wcfg
+1 −0
Original line number Diff line number Diff line
clist=./example/hets.cmap
cmode=at_least
fasta=./example/hets.fasta
lex=./example/hets.lex.wcfg
struct=./example/hets.struct.wcfg
+14 −3
Original line number Diff line number Diff line
@@ -3,6 +3,7 @@
#ifndef PGE_COMMON_OPTIONS_H_
#define PGE_COMMON_OPTIONS_H_

#include <omp.h>
#include <boost/program_options.hpp>

#include <memory>
@@ -37,8 +38,12 @@ inline std::string AsString(const Container &vals) {
/// Base class for parsing, validation and storage of options.
class CommonOptions {
 public:

  /// True iff --help flag was given.
  bool help;
  /// Select mode for parsing with contact constraints
  std::set<std::string> cmode_values{"at_least", "at_most", "exactly"};
  std::string cmode;
  /// Switch between Viterbi and Baum-Welch parsing.
  bool viterbi;
  /// Output directory.
@@ -96,6 +101,7 @@ class CommonOptions {
  virtual void PrintOptions(std::ostream &outstream) const {
    if (!contacts.empty()) {
      outstream << "clist=" << contacts << '\n';
      outstream << "cmode=" << cmode << '\n';
    }
    if (!conf.empty()) {
      outstream << "conf=" << conf << '\n';
@@ -117,6 +123,8 @@ class CommonOptions {
    using boost::program_options::value;
    RegisterOption("clist", value(&contacts),
                   "list of contact pairs (file, optional)");
    RegisterOption("cmode", value(&cmode)->default_value("at_least"),
                   ("contact constraints mode " + AsString(cmode_values)).c_str());
    RegisterOption("conf", value(&conf),
                   "program configuration (file, optional)");
    RegisterOption("fasta", value(&sequences),
@@ -128,8 +136,11 @@ class CommonOptions {
                   "output directory");
    RegisterOption("struct", value(&structural_part),
                   "structural part of grammar (file)");
    RegisterOption("threads", value(&threads)->default_value(INT_MAX),
                   "number of worker threads in use");
    int num_threads;
#pragma omp parallel
    { num_threads = omp_get_num_threads(); }
    RegisterOption("threads", value(&threads)->default_value(num_threads),
                   "number of worker threads in use (default: all)");
    RegisterOption("viterbi", bool_switch(&viterbi),
                   "switch from Baum-Welch (sum) to Viterbi (max) parsing");
  }
@@ -145,7 +156,7 @@ class CommonOptions {
  /// Implementation of Validate().
  ///
  /// Correct override should call base method first and fail if base does.
  virtual bool ValidateOptions() { return true; }
  virtual bool ValidateOptions() { return cmode_values.count(cmode); }

 private:
  /// Details about parsing and usage of options.
+25 −0
Original line number Diff line number Diff line
@@ -40,6 +40,31 @@ BipartiteGrammar<char> EstimateNull0() {
  return {null_mapping, null_rs};
}

BipartiteGrammar<char> EstimateNullEqual(Grammar<char> lexical) {

  static std::unordered_map<char, int> ownAlphabet;

  int counter = 0;
  for (const Rule<char> &rule : lexical.rules)
    if (ownAlphabet.count(rule.rhs[0])==0)
      ownAlphabet.emplace(rule.rhs[0], counter++);

  Eigen::VectorXf b = Eigen::VectorXf::Constant(counter, 1.0/counter);

  Grammar<int> null_rs;
  null_rs.add({1, {0, 1}, 1.0});
  null_rs.start = 1;

  Grammar<char> null_mapping;
  for (auto it = ownAlphabet.begin(); it != ownAlphabet.end(); ++it) {
    null_mapping.add({0, {it->first}, b(it->second) / b.sum()});
  }
  null_mapping.start = -1;

  return {null_mapping, null_rs};
}


BipartiteGrammar<char> EstimateNull1(Grammar<char> lexical) {
  Eigen::VectorXf b =
      Eigen::VectorXf::Zero(static_cast<int64_t>(alphabet.size() + 1));
Loading