Skip to content

Org-name standardisation misclassifications and opportunities for refinement #7

@bobular

Description

@bobular

Note that this is still in a branch https://github.com/VEuPathDB/expression-shepherd/tree/org-name-standardisation

The initial AI-assisted cleaning of institution/organisation name, country and institution-type fields revealed several misclassifications and edge cases, particularly involving organisations with similar names in multiple countries or with overseas campuses. Email addresses were not used in this first pass, which likely contributed to several incorrect inferences about user location and affiliation. These cases should be documented to inform future improvements to prompting and post-processing logic.

Key issues observed:

  • Org ambiguity within same country code:

    • Users from UTB (country=CO) were flagged due to ambiguity with a Spanish institution of the same acronym, despite .co email support for Colombia.
  • University campuses abroad:

    • Edge Hill University students in Sri Lanka (LK) were reclassified as GB. Their names, email domains (.gmail, .lk), and undergraduate status support retaining LK.
    • Similar misclassification occurred for Univ Strathclyde, Duke, etc., for students based in China.
  • Multinational companies:

    • AI inferred the canonical country of headquarters (e.g., Novonesis, BASF, BioNTech), overriding provided countries such as JP, US, PT.
  • Academic edge cases:

    • Students from Univ Melbourne listing countries like AX, CN, US — likely legitimate cases of international or remote students.
    • Imperial College London students listing CN remain unclear — potentially remote study, international placements, or misclassification.
  • Name collisions:

    • Institutions like “Northeastern University” appear in both US and CN, leading to ambiguity in country assignment.

Potential rule improvements:

  • For academic users:

    • Defer to organisation-based country unless strong signals suggest otherwise.
    • Specifically prompt desired behaviour regarding university campuses abroad.
    • Incorporate email domain heuristics (.ac.uk, .edu.cn, .gmail.com, etc.) in decision-making.
    • For students, prefer the org’s country unless a foreign campus or email domain suggests otherwise.
    • For PIs, consider resolving via known lab webpages.
  • For industry users:

    • Defer to the user-specified country; optionally validate against known subsidiaries or global office locations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions