Add SciBase rule report validators#421
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces recursive rule registration and adds several database validation rules and reporting scripts for SciBase (ebooks, papers, patents, and union metadata). The code reviewer identified critical thread-safety issues in the local and Spark executors caused by modifying class-level configurations globally, recommending refactoring the rule eval methods to standard instance methods. Additionally, the reviewer pointed out a temporary file leak in report_utils.py, missing database connection retry logic in union_unique_meta_data.py, and potential XML vulnerabilities in meta_patent_parsed_info.py from using standard xml.etree.ElementTree.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if getattr(model_cls, "__module__", "").startswith("dingo.model.rule.scibase."): | ||
| if "dynamic_config" not in model.__dict__: | ||
| model.dynamic_config = model.dynamic_config.model_copy(deep=True) | ||
| if model.dynamic_config.parameters is None: | ||
| model.dynamic_config.parameters = {} | ||
| model.dynamic_config.parameters.setdefault( | ||
| "_dingo_dataset_sql_config", | ||
| self.input_args.dataset.sql_config.model_dump(), | ||
| ) | ||
| model.dynamic_config.parameters.setdefault( | ||
| "_dingo_dataset_s3_config", | ||
| self.input_args.dataset.s3_config.model_dump(), | ||
| ) | ||
| model.dynamic_config.parameters.setdefault("_dingo_dataset_source", self.input_args.dataset.source) | ||
| model.dynamic_config.parameters.setdefault("_dingo_dataset_format", self.input_args.dataset.format) | ||
| model.dynamic_config.parameters.setdefault("_dingo_input_path", self.input_args.input_path) | ||
| setattr(model_cls, "dynamic_config", model.dynamic_config) | ||
| model = model_cls |
There was a problem hiding this comment.
Modifying the class-level dynamic_config attribute globally via setattr(model_cls, "dynamic_config", model.dynamic_config) and then assigning the class back to model (model = model_cls) is extremely dangerous in concurrent or multi-threaded environments. This completely defeats the purpose of instantiating the class to avoid multi-thread config overwrite (as explicitly noted in the comment on line 177).\n\nIf multiple threads or concurrent tasks evaluate the same SciBase rules with different configurations, they will overwrite each other's dynamic_config class attribute, leading to severe race conditions and incorrect evaluations.\n\nRecommended Fix:\nInstead of using @classmethod for eval in the SciBase rules and hacking the executor to modify the class globally, define eval as a standard instance method (i.e., def eval(self, input_data: Data)) in all SciBase rule classes. This allows Model.set_config_rule(model, e_c_i.config) to set the configuration on the instance model safely and thread-safely, without any class-level modifications.
| if getattr(model_cls, "__module__", "").startswith("dingo.model.rule.scibase."): | ||
| model = model_cls() | ||
| Model.set_config_rule(model, e_c_i.config) | ||
| if "dynamic_config" not in model.__dict__: | ||
| model.dynamic_config = model.dynamic_config.model_copy(deep=True) | ||
| if model.dynamic_config.parameters is None: | ||
| model.dynamic_config.parameters = {} | ||
| model.dynamic_config.parameters.setdefault( | ||
| "_dingo_dataset_sql_config", | ||
| self.input_args.dataset.sql_config.model_dump(), | ||
| ) | ||
| model.dynamic_config.parameters.setdefault( | ||
| "_dingo_dataset_s3_config", | ||
| self.input_args.dataset.s3_config.model_dump(), | ||
| ) | ||
| model.dynamic_config.parameters.setdefault("_dingo_dataset_source", self.input_args.dataset.source) | ||
| model.dynamic_config.parameters.setdefault("_dingo_dataset_format", self.input_args.dataset.format) | ||
| model.dynamic_config.parameters.setdefault("_dingo_input_path", self.input_args.input_path) | ||
| setattr(model_cls, "dynamic_config", model.dynamic_config) | ||
| model = model_cls |
There was a problem hiding this comment.
Modifying the class-level dynamic_config globally via setattr(model_cls, "dynamic_config", model.dynamic_config) is not thread-safe and will cause race conditions when running concurrent Spark tasks with different configurations.\n\nRecommended Fix:\nDefine eval as an instance method in the SciBase rules and keep the configuration isolated at the instance level, avoiding class-level modifications.
| @classmethod | ||
| def eval(cls, input_data: Data) -> EvalDetail: | ||
| del input_data | ||
| params = cls.dynamic_config.parameters or {} |
There was a problem hiding this comment.
Change the eval method from a @classmethod to a standard instance method. This ensures that the rule's configuration (dynamic_config) is read from the instance (self.dynamic_config) rather than the class (cls.dynamic_config), making the rule completely thread-safe and allowing us to remove the class-modifying hacks in local.py and spark.py.
| @classmethod | |
| def eval(cls, input_data: Data) -> EvalDetail: | |
| del input_data | |
| params = cls.dynamic_config.parameters or {} | |
| def eval(self, input_data: Data) -> EvalDetail: | |
| del input_data | |
| params = self.dynamic_config.parameters or {} |
| @classmethod | ||
| def eval(cls, input_data: Data) -> EvalDetail: | ||
| del input_data | ||
| params = cls.dynamic_config.parameters or {} |
There was a problem hiding this comment.
Change the eval method from a @classmethod to a standard instance method to ensure thread safety and instance-level configuration isolation.
| @classmethod | |
| def eval(cls, input_data: Data) -> EvalDetail: | |
| del input_data | |
| params = cls.dynamic_config.parameters or {} | |
| def eval(self, input_data: Data) -> EvalDetail: | |
| del input_data | |
| params = self.dynamic_config.parameters or {} |
| @classmethod | ||
| def eval(cls, input_data: Data) -> EvalDetail: | ||
| del input_data | ||
| params = cls.dynamic_config.parameters or {} |
There was a problem hiding this comment.
Change the eval method from a @classmethod to a standard instance method to ensure thread safety and instance-level configuration isolation.
| @classmethod | |
| def eval(cls, input_data: Data) -> EvalDetail: | |
| del input_data | |
| params = cls.dynamic_config.parameters or {} | |
| def eval(self, input_data: Data) -> EvalDetail: | |
| del input_data | |
| params = self.dynamic_config.parameters or {} |
| @classmethod | ||
| def eval(cls, input_data: Data) -> EvalDetail: | ||
| del input_data | ||
| params = cls.dynamic_config.parameters or {} |
There was a problem hiding this comment.
Change the eval method from a @classmethod to a standard instance method to ensure thread safety and instance-level configuration isolation.
| @classmethod | |
| def eval(cls, input_data: Data) -> EvalDetail: | |
| del input_data | |
| params = cls.dynamic_config.parameters or {} | |
| def eval(self, input_data: Data) -> EvalDetail: | |
| del input_data | |
| params = self.dynamic_config.parameters or {} |
| temp = tempfile.NamedTemporaryFile( | ||
| "w", | ||
| encoding="utf-8", | ||
| suffix=".json", | ||
| prefix="dingo_scibase_", | ||
| delete=False, | ||
| ) | ||
| with temp: | ||
| json.dump(payload, temp, ensure_ascii=False, indent=2) | ||
| return Path(temp.name) |
There was a problem hiding this comment.
The temporary file created by tempfile.NamedTemporaryFile with delete=False is never deleted/cleaned up after the validation finishes. This results in a temporary file leak every time a SciBase rule is evaluated, which can clutter the system's temp directory and eventually exhaust disk space or inodes in a long-running production environment.\n\nRecommended Fix:\nIntroduce a context manager in report_utils.py that automatically deletes the temporary file when exiting the block, and use it in the eval methods of the rules.
| def connect_starrocks(config_path: Path): | ||
| if pymysql is None: | ||
| raise RuntimeError("pymysql is required. Install pymysql before running DB validation.") | ||
| cfg = load_config(config_path) | ||
| mysql_cfg = cfg["mysql"] | ||
| return pymysql.connect( | ||
| host=mysql_cfg["host"], | ||
| port=int(mysql_cfg["port"]), | ||
| user=mysql_cfg["user"], | ||
| password=mysql_cfg["password"], | ||
| charset=mysql_cfg.get("charset", "utf8mb4"), | ||
| connect_timeout=int(mysql_cfg.get("connect_timeout", 30)), | ||
| read_timeout=int(mysql_cfg.get("read_timeout", 180)), | ||
| ) |
There was a problem hiding this comment.
Unlike meta_ebook_unique.py, meta_paper_unique.py, and meta_patent_parsed_info.py, the connect_starrocks function in union_unique_meta_data.py completely lacks retry logic for handling transient network or database connection errors. This makes the union metadata validation fragile to temporary blips in a big data environment.\n\nPlease refactor connect_starrocks in union_unique_meta_data.py to include the same robust retry logic used in the other validators.
| def parse_xml(raw: Any) -> ET.Element: | ||
| if isinstance(raw, (bytes, bytearray)): | ||
| raw = raw.decode("utf-8", errors="replace") | ||
| text = str(raw or "").strip() | ||
| if not text: | ||
| raise ValueError("empty XML content") | ||
| text = re.sub(r"^\s*<\?xml[^>]*\?>", "", text, count=1).lstrip() | ||
| return ET.fromstring(text) |
There was a problem hiding this comment.
Using Python's standard xml.etree.ElementTree to parse untrusted XML data can expose the application to XML vulnerabilities such as XML External Entity (XXE) injection or entity expansion (billion laughs) attacks.\n\nIf the patent XML data in the database can come from untrusted or external sources, it is highly recommended to use defusedxml.ElementTree instead of xml.etree.ElementTree to safely parse the XML.
No description provided.