Skip to content

[EXPERIMENT] - Keyword-based Diversity Improvement #84

@patrickfleith

Description

@patrickfleith

Lack of word diversity

  • When generating many samples, we observe words or structure that are being repeated.

Describe the solution you'd like

  • We should explore the option to extract and and count keywords and add them back to prompt to prevent them from being overly used by LLMs by injecting in the prompt something like "Avoid using the following common words {most common words}". Maybe we can even do same deduplication of the keywords or do something like pagerank to get truely original words vs deduplication.
  • Similarly we could try to explore the structure of generated user query to encourage diversity or to rewrite them. Here we could use n-grams, beginning-of-sentence that often repeat.

What to do

Before implementing it would be great to experiment some of these ew different approaches in a few simple notebooks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions