Skip to content

Add SciBase rule report validators#421

Open
huaiyude1020 wants to merge 1 commit into
MigoXLab:mainfrom
huaiyude1020:codex-scibase-rule-reports
Open

Add SciBase rule report validators#421
huaiyude1020 wants to merge 1 commit into
MigoXLab:mainfrom
huaiyude1020:codex-scibase-rule-reports

Conversation

@huaiyude1020

Copy link
Copy Markdown

No description provided.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces recursive rule registration and adds several database validation rules and reporting scripts for SciBase (ebooks, papers, patents, and union metadata). The code reviewer identified critical thread-safety issues in the local and Spark executors caused by modifying class-level configurations globally, recommending refactoring the rule eval methods to standard instance methods. Additionally, the reviewer pointed out a temporary file leak in report_utils.py, missing database connection retry logic in union_unique_meta_data.py, and potential XML vulnerabilities in meta_patent_parsed_info.py from using standard xml.etree.ElementTree.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread dingo/exec/local.py
Comment on lines +179 to +196
if getattr(model_cls, "__module__", "").startswith("dingo.model.rule.scibase."):
if "dynamic_config" not in model.__dict__:
model.dynamic_config = model.dynamic_config.model_copy(deep=True)
if model.dynamic_config.parameters is None:
model.dynamic_config.parameters = {}
model.dynamic_config.parameters.setdefault(
"_dingo_dataset_sql_config",
self.input_args.dataset.sql_config.model_dump(),
)
model.dynamic_config.parameters.setdefault(
"_dingo_dataset_s3_config",
self.input_args.dataset.s3_config.model_dump(),
)
model.dynamic_config.parameters.setdefault("_dingo_dataset_source", self.input_args.dataset.source)
model.dynamic_config.parameters.setdefault("_dingo_dataset_format", self.input_args.dataset.format)
model.dynamic_config.parameters.setdefault("_dingo_input_path", self.input_args.input_path)
setattr(model_cls, "dynamic_config", model.dynamic_config)
model = model_cls

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Modifying the class-level dynamic_config attribute globally via setattr(model_cls, "dynamic_config", model.dynamic_config) and then assigning the class back to model (model = model_cls) is extremely dangerous in concurrent or multi-threaded environments. This completely defeats the purpose of instantiating the class to avoid multi-thread config overwrite (as explicitly noted in the comment on line 177).\n\nIf multiple threads or concurrent tasks evaluate the same SciBase rules with different configurations, they will overwrite each other's dynamic_config class attribute, leading to severe race conditions and incorrect evaluations.\n\nRecommended Fix:\nInstead of using @classmethod for eval in the SciBase rules and hacking the executor to modify the class globally, define eval as a standard instance method (i.e., def eval(self, input_data: Data)) in all SciBase rule classes. This allows Model.set_config_rule(model, e_c_i.config) to set the configuration on the instance model safely and thread-safely, without any class-level modifications.

Comment thread dingo/exec/spark.py
Comment on lines +240 to +259
if getattr(model_cls, "__module__", "").startswith("dingo.model.rule.scibase."):
model = model_cls()
Model.set_config_rule(model, e_c_i.config)
if "dynamic_config" not in model.__dict__:
model.dynamic_config = model.dynamic_config.model_copy(deep=True)
if model.dynamic_config.parameters is None:
model.dynamic_config.parameters = {}
model.dynamic_config.parameters.setdefault(
"_dingo_dataset_sql_config",
self.input_args.dataset.sql_config.model_dump(),
)
model.dynamic_config.parameters.setdefault(
"_dingo_dataset_s3_config",
self.input_args.dataset.s3_config.model_dump(),
)
model.dynamic_config.parameters.setdefault("_dingo_dataset_source", self.input_args.dataset.source)
model.dynamic_config.parameters.setdefault("_dingo_dataset_format", self.input_args.dataset.format)
model.dynamic_config.parameters.setdefault("_dingo_input_path", self.input_args.input_path)
setattr(model_cls, "dynamic_config", model.dynamic_config)
model = model_cls

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Modifying the class-level dynamic_config globally via setattr(model_cls, "dynamic_config", model.dynamic_config) is not thread-safe and will cause race conditions when running concurrent Spark tasks with different configurations.\n\nRecommended Fix:\nDefine eval as an instance method in the SciBase rules and keep the configuration isolated at the instance level, avoiding class-level modifications.

Comment on lines +1474 to +1477
@classmethod
def eval(cls, input_data: Data) -> EvalDetail:
del input_data
params = cls.dynamic_config.parameters or {}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Change the eval method from a @classmethod to a standard instance method. This ensures that the rule's configuration (dynamic_config) is read from the instance (self.dynamic_config) rather than the class (cls.dynamic_config), making the rule completely thread-safe and allowing us to remove the class-modifying hacks in local.py and spark.py.

Suggested change
@classmethod
def eval(cls, input_data: Data) -> EvalDetail:
del input_data
params = cls.dynamic_config.parameters or {}
def eval(self, input_data: Data) -> EvalDetail:
del input_data
params = self.dynamic_config.parameters or {}

Comment on lines +2229 to +2232
@classmethod
def eval(cls, input_data: Data) -> EvalDetail:
del input_data
params = cls.dynamic_config.parameters or {}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Change the eval method from a @classmethod to a standard instance method to ensure thread safety and instance-level configuration isolation.

Suggested change
@classmethod
def eval(cls, input_data: Data) -> EvalDetail:
del input_data
params = cls.dynamic_config.parameters or {}
def eval(self, input_data: Data) -> EvalDetail:
del input_data
params = self.dynamic_config.parameters or {}

Comment on lines +1671 to +1674
@classmethod
def eval(cls, input_data: Data) -> EvalDetail:
del input_data
params = cls.dynamic_config.parameters or {}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Change the eval method from a @classmethod to a standard instance method to ensure thread safety and instance-level configuration isolation.

Suggested change
@classmethod
def eval(cls, input_data: Data) -> EvalDetail:
del input_data
params = cls.dynamic_config.parameters or {}
def eval(self, input_data: Data) -> EvalDetail:
del input_data
params = self.dynamic_config.parameters or {}

Comment on lines +2492 to +2495
@classmethod
def eval(cls, input_data: Data) -> EvalDetail:
del input_data
params = cls.dynamic_config.parameters or {}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Change the eval method from a @classmethod to a standard instance method to ensure thread safety and instance-level configuration isolation.

Suggested change
@classmethod
def eval(cls, input_data: Data) -> EvalDetail:
del input_data
params = cls.dynamic_config.parameters or {}
def eval(self, input_data: Data) -> EvalDetail:
del input_data
params = self.dynamic_config.parameters or {}

Comment on lines +140 to +149
temp = tempfile.NamedTemporaryFile(
"w",
encoding="utf-8",
suffix=".json",
prefix="dingo_scibase_",
delete=False,
)
with temp:
json.dump(payload, temp, ensure_ascii=False, indent=2)
return Path(temp.name)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The temporary file created by tempfile.NamedTemporaryFile with delete=False is never deleted/cleaned up after the validation finishes. This results in a temporary file leak every time a SciBase rule is evaluated, which can clutter the system's temp directory and eventually exhaust disk space or inodes in a long-running production environment.\n\nRecommended Fix:\nIntroduce a context manager in report_utils.py that automatically deletes the temporary file when exiting the block, and use it in the eval methods of the rules.

Comment on lines +487 to +500
def connect_starrocks(config_path: Path):
if pymysql is None:
raise RuntimeError("pymysql is required. Install pymysql before running DB validation.")
cfg = load_config(config_path)
mysql_cfg = cfg["mysql"]
return pymysql.connect(
host=mysql_cfg["host"],
port=int(mysql_cfg["port"]),
user=mysql_cfg["user"],
password=mysql_cfg["password"],
charset=mysql_cfg.get("charset", "utf8mb4"),
connect_timeout=int(mysql_cfg.get("connect_timeout", 30)),
read_timeout=int(mysql_cfg.get("read_timeout", 180)),
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Unlike meta_ebook_unique.py, meta_paper_unique.py, and meta_patent_parsed_info.py, the connect_starrocks function in union_unique_meta_data.py completely lacks retry logic for handling transient network or database connection errors. This makes the union metadata validation fragile to temporary blips in a big data environment.\n\nPlease refactor connect_starrocks in union_unique_meta_data.py to include the same robust retry logic used in the other validators.

Comment on lines +1009 to +1016
def parse_xml(raw: Any) -> ET.Element:
if isinstance(raw, (bytes, bytearray)):
raw = raw.decode("utf-8", errors="replace")
text = str(raw or "").strip()
if not text:
raise ValueError("empty XML content")
text = re.sub(r"^\s*<\?xml[^>]*\?>", "", text, count=1).lstrip()
return ET.fromstring(text)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Using Python's standard xml.etree.ElementTree to parse untrusted XML data can expose the application to XML vulnerabilities such as XML External Entity (XXE) injection or entity expansion (billion laughs) attacks.\n\nIf the patent XML data in the database can come from untrusted or external sources, it is highly recommended to use defusedxml.ElementTree instead of xml.etree.ElementTree to safely parse the XML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants