Explore Different Sentece Models and achieve F1 > 80

# Main Steps

* [x] Use combined sentences with current links
* [x] Run `D1` alone with pre-selected features from D0
  * All relations: `P=61, R=16, F=25`
  * --evaluate_only_on_edges_plausible_relations: `P=21, R=21, F1=21`

* [x] Run `D1` alone with ALL features:
  * All relations: `P=54, R=0.02, F=0.04`
  * --evaluate_only_on_edges_plausible_relations: `P=46, R=0.07, F1=12`

* [x] Do feature selection for `D1` --> **~206** features
  * All relations: `P=94, R=14, F=25` --> `49 tp`
  * --evaluate_only_on_edges_plausible_relations: `P=90, R=47, F=61` --> `38 tp`

* [x] Merge `D0` and `D1`
  * --> is F1 > 80 ??
    * --> NOPE :( -- First result: `f_measure=0.7536231884057971`, recall just improved few negligible decimals, precision dropped from 91 to 80...

* [x] Investigate 550 relations in `test_corpus_stats` (vs 1345 ?)
* [x] Investigate the lower recall for the baseline?
* [ ] Investigate further links for sentence combination
* [ ] Investigate multiple dependencies (I must get the shortest one!)
* ~~[ ] Investigate UNKNOWN normalization~~
* [ ] Investigate Shrikant's D1 features
* [ ] Do feature selection for all models up to `D6`
* [ ] Now do progressive combined sequences of all models: `[D0, D1, D2, D3, D4, D5, D6]`
  * --> when F1 drops?
* [x] (By Juanmi) Implement F-Beta scoring
* [ ] Experiment with optimal `Beta` values and optimal number of `m` models
* [ ] Put everything together

## Other

### Must Haves

* [x] Investigate possible very slight random changes performance
* [x] (By Juanmi) Do scaling independently for training & testing (?) `SklSVM._preprocess(X)`, and definitely keep the maximums of training data for scaling on new unseen instances (which come only a few, so the proportions would be totally different)
* [ ] (By Juanmi) Fix: evaluate in **macro** -- I'm actually evaluating on **micro** and the documentation in nalaf is likely wrong
* [ ] (By Juanmi) [Fix lemmatization in spaCy](https://github.com/explosion/spaCy/issues/781)
* [ ] Recheck why I cannot reproduce now best feature selection results
* [ ] Check feature of marker/enzyme/whatever by protein text
* [ ] Check feature of loc relation by loc text

### Nice to Haves, in rough order of priority

* [x] Overall experiment more with hyperparameter search (full pipeline)
* [ ] I get 1 as head of BRI 1 in : `[Plant, steroid, hormones, ,, brassinosteroids, (, BRs, ), ,, are, perceived, by, the, plasma, membrane, -, localized, leucine, -, rich, -, repeat, -, receptor, kinase, BRI, 1, ., Based, on, sequence, similarity, ,, we, have, identified, three, members, of, the, BRI, 1, family, ,, named, BRL, 1, ,, BRL, 2, and, BRL, 3, .]` ?
* [x] Overall experiment more with feature selection (full pipeline)
* [ ] Get features like `is_enzyme` from SwisProt
* [ ] Review Shrikant's D0 features
* [x] Do Lasso / L1 feature selection
* [x] Do randomized Lasso
* [x] Do RandomForest-based feature selection
* [x] Remove highly correlated features explicitly 
* [x] [Check: chi2 vs mutual info](http://scikit-learn.org/stable/auto_examples/feature_selection/plot_f_test_vs_mi.html#sphx-glr-auto-examples-feature-selection-plot-f-test-vs-mi-py) -- **In quick experiments it didn't show an improvement, plus I'm not using kbest anymore for feature selection**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore Different Sentece Models and achieve F1 > 80 #34

Main Steps

Other

Must Haves

Nice to Haves, in rough order of priority

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Explore Different Sentece Models and achieve F1 > 80 #34

Description

Main Steps

Other

Must Haves

Nice to Haves, in rough order of priority

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions