Skip to content

Explore Different Sentece Models and achieve F1 > 80 #34

@juanmirocks

Description

@juanmirocks

Main Steps

  • Use combined sentences with current links

  • Run D1 alone with pre-selected features from D0

    • All relations: P=61, R=16, F=25
    • --evaluate_only_on_edges_plausible_relations: P=21, R=21, F1=21
  • Run D1 alone with ALL features:

    • All relations: P=54, R=0.02, F=0.04
    • --evaluate_only_on_edges_plausible_relations: P=46, R=0.07, F1=12
  • Do feature selection for D1 --> ~206 features

    • All relations: P=94, R=14, F=25 --> 49 tp
    • --evaluate_only_on_edges_plausible_relations: P=90, R=47, F=61 --> 38 tp
  • Merge D0 and D1

    • --> is F1 > 80 ??
      • --> NOPE :( -- First result: f_measure=0.7536231884057971, recall just improved few negligible decimals, precision dropped from 91 to 80...
  • Investigate 550 relations in test_corpus_stats (vs 1345 ?)

  • Investigate the lower recall for the baseline?

  • Investigate further links for sentence combination

  • Investigate multiple dependencies (I must get the shortest one!)

  • [ ] Investigate UNKNOWN normalization

  • Investigate Shrikant's D1 features

  • Do feature selection for all models up to D6

  • Now do progressive combined sequences of all models: [D0, D1, D2, D3, D4, D5, D6]

    • --> when F1 drops?
  • (By Juanmi) Implement F-Beta scoring

  • Experiment with optimal Beta values and optimal number of m models

  • Put everything together

Other

Must Haves

  • Investigate possible very slight random changes performance
  • (By Juanmi) Do scaling independently for training & testing (?) SklSVM._preprocess(X), and definitely keep the maximums of training data for scaling on new unseen instances (which come only a few, so the proportions would be totally different)
  • (By Juanmi) Fix: evaluate in macro -- I'm actually evaluating on micro and the documentation in nalaf is likely wrong
  • (By Juanmi) Fix lemmatization in spaCy
  • Recheck why I cannot reproduce now best feature selection results
  • Check feature of marker/enzyme/whatever by protein text
  • Check feature of loc relation by loc text

Nice to Haves, in rough order of priority

  • Overall experiment more with hyperparameter search (full pipeline)
  • I get 1 as head of BRI 1 in : [Plant, steroid, hormones, ,, brassinosteroids, (, BRs, ), ,, are, perceived, by, the, plasma, membrane, -, localized, leucine, -, rich, -, repeat, -, receptor, kinase, BRI, 1, ., Based, on, sequence, similarity, ,, we, have, identified, three, members, of, the, BRI, 1, family, ,, named, BRL, 1, ,, BRL, 2, and, BRL, 3, .] ?
  • Overall experiment more with feature selection (full pipeline)
  • Get features like is_enzyme from SwisProt
  • Review Shrikant's D0 features
  • Do Lasso / L1 feature selection
  • Do randomized Lasso
  • Do RandomForest-based feature selection
  • Remove highly correlated features explicitly
  • Check: chi2 vs mutual info -- In quick experiments it didn't show an improvement, plus I'm not using kbest anymore for feature selection

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions