-
Notifications
You must be signed in to change notification settings - Fork 5
Implement RL #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Implement RL #3
Changes from all commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
114503f
hydra with rl including DQN training code
pokerme7777 1861a99
updating config
pokerme7777 9660f61
draft of rl feature is finished.
pokerme7777 34aab52
inference code checked
pokerme7777 b1d2384
fix gradio dependency issue
ControlNet 67a4994
reformat
ControlNet 7f5b662
separate the llm embedding func, making evaluation in lib, reformat
ControlNet eab4182
improve the checkpoint structure and fix loading buffer
ControlNet f6e7b2a
fix minor issues
ControlNet 108627d
adding worker bash
pokerme7777 c14ab0b
adding pandas and modify some bash script
pokerme7777 d0845e2
update image_patch llm_query
pokerme7777 0a7fdfd
upload gradio demo video
pokerme7777 40e756e
update readme
pokerme7777 510d942
updating readme
pokerme7777 476cd7a
update readme
pokerme7777 50bec81
update readme
pokerme7777 3e86a2e
fix llm query in image patch
ControlNet File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| model_name: dataset_nameaokvqa_learn_starts300 | ||
| llm_embedding_dim: 1536 # small, 3072 for large | ||
| mlp_hidden_dim: 512 | ||
| critic_layer_num: 4 | ||
| critic_lr: 0.0001 | ||
| train_log_interval: 1 | ||
| batch_size: 1 | ||
| update_times: 2 | ||
| save_interval: 1 | ||
| learn_starts: 2 | ||
| dqn_explore_epsilon: 0.2 | ||
| dqn_explore_epsilon_decay_rate: 0.02 | ||
| dqn_explore_epsilon_decay_interval: 100 | ||
| buffer_size: 100000 | ||
| training_epoch: 1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| model_name: dqn_model_example | ||
| llm_embedding_dim: 1536 # small, 3072 for large | ||
| mlp_hidden_dim: 512 | ||
| critic_layer_num: 4 | ||
| critic_lr: 0.0001 | ||
| train_log_interval: 100 | ||
| batch_size: 128 | ||
| update_times: 4 | ||
| save_interval: 100 | ||
| learn_starts: 1000 | ||
| dqn_explore_epsilon: 0.2 | ||
| dqn_explore_epsilon_decay_rate: 0.02 | ||
| dqn_explore_epsilon_decay_interval: 100 | ||
| buffer_size: 100000 | ||
| training_epoch: 4 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,16 +1,161 @@ | ||
| import abc | ||
| import pickle | ||
| import random | ||
| from collections import deque | ||
|
|
||
| import numpy as np | ||
| import torch | ||
|
|
||
| from hydra_vl4ai.util.console import logger | ||
| from .llm import llm_embedding | ||
| from .rl_dqn import DQN_EmbeddingViaLLM, ReplayBuffer | ||
| from .smb.state_memory_bank import StateMemoryBank | ||
| from ..util.config import Config | ||
| from ..util.misc import get_hydra_root_folder | ||
|
|
||
|
|
||
| class Controller(abc.ABC): | ||
ControlNet marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| @abc.abstractmethod | ||
| def __call__(self, instructions: list[str], probs: np.ndarray) -> str: | ||
| def __call__(self, *args, **kwargs) -> str: | ||
| pass | ||
|
|
||
|
|
||
| class ControllerLLM(Controller): | ||
ControlNet marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| """ | ||
| This is the function for not using RL controller | ||
| but directly use the LLM score to return optimal instruction | ||
| """ | ||
|
|
||
| def __call__(self, instructions: list[str], probs: np.ndarray) -> str: | ||
| return instructions[np.argmax(probs)] | ||
|
|
||
|
|
||
| class ControllerDQN(Controller): | ||
|
|
||
| def __init__(self, | ||
| embedding_prompt_base: str, | ||
| task_description_for_instruction: str, | ||
| instruction_example: str, | ||
| training: bool = False | ||
| ): | ||
| super().__init__() | ||
| self.instruction_example = instruction_example | ||
| self.embedding_prompt_base = embedding_prompt_base | ||
| self.model_name = Config.dqn_config["model_name"] | ||
| self.model_save_path = get_hydra_root_folder().parent / "ckpt" / self.model_name | ||
| self.task_description_for_instruction = task_description_for_instruction | ||
ControlNet marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| self.training = training | ||
|
|
||
| self.rl_agent_model = DQN_EmbeddingViaLLM( | ||
| device=torch.device('cuda:0'), | ||
| llm_embedding_dim_concat=Config.dqn_config["llm_embedding_dim"], | ||
| mlp_hidden_dim=Config.dqn_config["mlp_hidden_dim"], | ||
| action_dim=Config.base_config["num_actions"] + 1, | ||
| critic_layer_num=Config.dqn_config["critic_layer_num"], | ||
| critic_lr=float(Config.dqn_config["critic_lr"]) | ||
| ) | ||
| # load model | ||
| self.model_full_path = self.model_save_path / "critic.pt" | ||
| self.buffer_path = self.model_save_path / "buffer.pickle" | ||
| self.model_save_path.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| if self.model_full_path.exists(): | ||
| self.rl_agent_model.load_model(str(self.model_full_path)) | ||
| logger.info(f"Load Model Done from file: {str(self.model_full_path)}") | ||
| elif not self.training: # for inference, if no model, raise error | ||
| raise RuntimeError(f"Model is not found: {self.model_full_path}") | ||
|
|
||
| if self.training: # for training | ||
| self.rl_agent_model.train_mode() | ||
| self.train_log_interval = Config.dqn_config["train_log_interval"] | ||
| self.reward_window = deque(maxlen=self.train_log_interval) | ||
| self.obs_no = 0 | ||
| self.batch_size = Config.dqn_config["batch_size"] | ||
| self.update_times = Config.dqn_config["update_times"] | ||
| self.save_interval = Config.dqn_config["save_interval"] | ||
| self.save_model_obs_num = 0 # accumulate | ||
| self.best_cum_reward = -100 # TODO:MODIFY | ||
| self.best_score = 0 | ||
| self.learn_starts = Config.dqn_config["learn_starts"] | ||
| self.dqn_explore_epsilon = Config.dqn_config["dqn_explore_epsilon"] | ||
| self.dqn_explore_epsilon_decay_rate = Config.dqn_config["dqn_explore_epsilon_decay_rate"] | ||
ControlNet marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| self.dqn_explore_epsilon_decay_interval = Config.dqn_config["dqn_explore_epsilon_decay_interval"] | ||
ControlNet marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| self.dqn_explore_threshold = self.dqn_explore_epsilon - self.dqn_explore_epsilon_decay_rate \ | ||
| * (self.obs_no / self.dqn_explore_epsilon_decay_interval) | ||
ControlNet marked this conversation as resolved.
Show resolved
Hide resolved
ControlNet marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| # load buffer | ||
| if self.buffer_path.exists(): | ||
| with open(self.buffer_path, "rb") as reward_buffer_container: | ||
| self.replay_buffer = pickle.load(reward_buffer_container) | ||
| reward_buffer_container.close() | ||
| else: | ||
| self.replay_buffer = ReplayBuffer(capacity=Config.dqn_config["buffer_size"]) | ||
|
|
||
| else: | ||
| self.rl_agent_model.eval_mode() | ||
|
|
||
| def save(self): | ||
| self.rl_agent_model.save_model(self.model_full_path) | ||
| with open(self.buffer_path, "wb") as f: | ||
| pickle.dump(self.replay_buffer, f) | ||
|
|
||
| def load(self): | ||
| self.rl_agent_model.load_model(self.model_full_path) | ||
| if self.training and self.buffer_path.exists(): | ||
| with open(self.buffer_path, "rb") as f: | ||
| self.replay_buffer = pickle.load(f) | ||
|
|
||
| async def __call__(self, query: str, current_step_index: int, instructions: list[str], probs: np.ndarray, | ||
| state_memory_bank: StateMemoryBank | ||
ControlNet marked this conversation as resolved.
Show resolved
Hide resolved
ControlNet marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ) -> tuple[str, np.ndarray, int]: | ||
ControlNet marked this conversation as resolved.
Show resolved
Hide resolved
ControlNet marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| prompt = self.build_prompt(query, current_step_index, instructions, probs, state_memory_bank) | ||
ControlNet marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| # get embedding from llm | ||
| response_emb = await llm_embedding(Config.base_config["embedding_model"], prompt) | ||
|
|
||
| affordance_value_array = self.rl_agent_model.get_action(obs=response_emb) | ||
ControlNet marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| selected_idx = np.argmax(affordance_value_array) | ||
|
|
||
| # random exploration in the beginning. | ||
| if self.training: | ||
| # if it is in the beginning phase, do random exploration! | ||
| if self.obs_no <= self.learn_starts or np.random.random() <= self.dqn_explore_threshold: | ||
| selected_idx = random.choice(range(len(affordance_value_array))) | ||
ControlNet marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| if selected_idx != len(instructions): | ||
| selected_instruction = instructions[selected_idx] | ||
| else: | ||
| selected_instruction = "REJECT" | ||
| return selected_instruction, response_emb, selected_idx | ||
|
|
||
| def build_prompt(self, query: str, current_step_index: int, instructions: list[str], probs: np.ndarray, | ||
| state_memory_bank: StateMemoryBank | ||
ControlNet marked this conversation as resolved.
Show resolved
Hide resolved
ControlNet marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ): | ||
ControlNet marked this conversation as resolved.
Show resolved
Hide resolved
ControlNet marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| """Getting prompt based on template""" | ||
| # prompt-for-each-query | ||
| prompt = self.embedding_prompt_base.replace('[INSERT_QUERY_HERE]', query) # query insert | ||
| prompt = prompt.replace('[INSERT_CURRENT_STEP_NO]', str(current_step_index)) # step number insert | ||
|
|
||
| # prompt-for-query-type-about-the-dataset | ||
| prompt = prompt.replace('[INSERT_QUERY_TYPE_HERE]', self.task_description_for_instruction) # query type | ||
| prompt = prompt.replace('[EXAMPLE_HERE]', self.instruction_example) # query type demo/ exps | ||
|
|
||
| # previous instruction | ||
| prompt = prompt.replace('[NEED_TO_PROVIDE_PREVIOUS_INSTRUCTION]', | ||
| state_memory_bank.instructions_prompt) # previous code insert | ||
ControlNet marked this conversation as resolved.
Show resolved
Hide resolved
ControlNet marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| # previous executed code | ||
| prompt = prompt.replace('[MORE_CODE_WAITING]', state_memory_bank.codes_prompt) # previous code insert | ||
| prompt = prompt.replace('[CURRENTLY_RESULT_WAITING]', | ||
| state_memory_bank.feedbacks_prompt) # result description insert | ||
ControlNet marked this conversation as resolved.
Show resolved
Hide resolved
ControlNet marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| # variable details | ||
| prompt = prompt.replace('[VARIABLE_AND_DETAILS]', state_memory_bank.variables_prompt) | ||
|
|
||
| # current instructions/probs | ||
| prompt = prompt.replace('[CURRENT_OPTION]', str(instructions)) # instruction options | ||
| prompt = prompt.replace('[CURRENT_OPTION_PROBABILITY]', str(probs)) # probs of instruction options | ||
|
|
||
| return prompt | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.