-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Description
Using claude-code, a handy feature is that when it hits it's limit, it creates a summary of the context, and replaces it. This means that you have an infinite windowed context, which would be really awesome for limited memory setups (I use GeForce RTX 4060 Ti with only 8GB RAM with ollama.)
I think a python-wrapper around ollama could do it, too, so it doesn't necessarily need to be solved here, but it could help any model/service (as it does with claude-code, for longer-running sessions.)
Some options I think would be cool:
- provider/model for summary
- size of context, before compaction, based on percentage (like 80% of total context-size)
- good default summary prompt, but allow user to modify it
Here is an example compaction prompt:
summary_prompt = f"""
Create a comprehensive but concise summary of this conversation for a chat assistant.
This summary will replace the full conversation history to optimize memory usage.
Requirements:
1. Preserve all critical context and decisions
2. Maintain user preferences and constraints
3. Include any technical details or requirements
4. Preserve any code snippets or configurations discussed
5. Note the conversation flow and key topics
Keep the summary under {summary_tokens_max} tokens but be thorough.
Conversation History:
{json.dumps(messages_to_summarize, indent=2)}
Summary:
"""Is this something I should PR for? Would this be better as an extension?
Metadata
Metadata
Assignees
Labels
No labels