Note that this is still in a branch https://github.com/VEuPathDB/expression-shepherd/tree/org-name-standardisation
The initial AI-assisted cleaning of institution/organisation name, country and institution-type fields revealed several misclassifications and edge cases, particularly involving organisations with similar names in multiple countries or with overseas campuses. Email addresses were not used in this first pass, which likely contributed to several incorrect inferences about user location and affiliation. These cases should be documented to inform future improvements to prompting and post-processing logic.
Key issues observed:
Potential rule improvements:
-
For academic users:
- Defer to organisation-based country unless strong signals suggest otherwise.
- Specifically prompt desired behaviour regarding university campuses abroad.
- Incorporate email domain heuristics (
.ac.uk, .edu.cn, .gmail.com, etc.) in decision-making.
- For students, prefer the org’s country unless a foreign campus or email domain suggests otherwise.
- For PIs, consider resolving via known lab webpages.
-
For industry users:
- Defer to the user-specified country; optionally validate against known subsidiaries or global office locations.
Note that this is still in a branch https://github.com/VEuPathDB/expression-shepherd/tree/org-name-standardisation
The initial AI-assisted cleaning of institution/organisation name, country and institution-type fields revealed several misclassifications and edge cases, particularly involving organisations with similar names in multiple countries or with overseas campuses. Email addresses were not used in this first pass, which likely contributed to several incorrect inferences about user location and affiliation. These cases should be documented to inform future improvements to prompting and post-processing logic.
Key issues observed:
Org ambiguity within same country code:
UTB(country=CO) were flagged due to ambiguity with a Spanish institution of the same acronym, despite.coemail support for Colombia.University campuses abroad:
Edge Hill Universitystudents in Sri Lanka (LK) were reclassified as GB. Their names, email domains (.gmail,.lk), and undergraduate status support retainingLK.Univ Strathclyde,Duke, etc., for students based in China.Multinational companies:
Academic edge cases:
Univ Melbournelisting countries like AX, CN, US — likely legitimate cases of international or remote students.Imperial College Londonstudents listing CN remain unclear — potentially remote study, international placements, or misclassification.Name collisions:
Potential rule improvements:
For academic users:
.ac.uk,.edu.cn,.gmail.com, etc.) in decision-making.For industry users: