ETL Pipeline: From heterogeneous bibliographic data to a unified schema#14
Open
marioloskovic55-jpg wants to merge 2 commits into
Open
ETL Pipeline: From heterogeneous bibliographic data to a unified schema#14marioloskovic55-jpg wants to merge 2 commits into
marioloskovic55-jpg wants to merge 2 commits into
Conversation
added 2 commits
June 5, 2026 16:29
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Function of the project
The goal of this pull request is to add a complete Data Extraction, Transformation, and Loading (ETL) pipeline to bibliometrix-python, so that we have clean, standardized data from the source and can proceed with analysis.
The work starts from a problem present in bibliometrix-python: it only works with data downloaded from Web of Science (WoS), while it encounters problems with those downloaded from PubMed, Scopus or Dimensions because the column names are different.
The created pipeline acts as a universal translator: wherever the data comes from, the output will always be standard so that it can be understood by all analytical functions.
Problems found in the current implementation
-There was no function in bibliometrix-python that would load and standardize the data (there is
convert2df, but in the R version).-There was no clear structure to the transformation logic, making it difficult to maintain or extend.
-Crashes were caused by incorrectly storing data in the wrong formats (for example, in the case of authors AU or cited references CR, the lists were incorrectly stored as strings)
-Many functions crashed when handling missing values (NaN, None)
-The data was pre-determined to come from Web of Science, so if it came from other sources the code would crash.
-There was no block that validated the data, but it was directly passed to the analytical functions without checking the formatting.
Building
5 modules have been created within the
www/services/etlfolder, each with a single, specific purpose.1. Mapping (
www/services/etl/mappings/)Different sources use different names for the same information: for example, if the title of an article is indicated with "Title" in Scopus and Dimensions, in Web of Science it is indicated with "TI".
I created a dictionary file for each source that provides a table:
scopus_mapping.py— maps Scopus column names to WoS standard tagsdimensions_mapping.py— maps Dimensions column names to WoS standard tagspubmed_mapping.py— maps PubMed field tags to WoS standard tagsopenalex_mapping.py— maps OpenAlex field names to WoS standard tagsThese dictionaries are just lookup tables used by the transformer.
2. Transformer (
www/services/etl/transformer.py)This module has 5 steps:
3. Validator (
www/services/etl/validator.py)The validator checks the data before providing it to the analytical functions. Specifically, it performs three checks:
The validator prints a clear [OK] or [FAIL] for each check.
4. Api Retriever (
www/services/etl/api_retriever.py)This module retrieves data without the need for manual downloading, but via internet.
It draws sources from two free public databases:
5. Standardizer (
www/services/etl/standardizer.py)This is the equivalent of the convert2df function in R and is the entry point of the pipeline.
You only need to call this function.
It works in two modes:
Internally,
convert2df()coordinates all the other modules in order:Extract -> Transform -> Validate. The user does not need to know anything about the internal modules..
Modifications to Existing Files
www/services/__init__.pyOriginally, the file imported all modules using wildcards, which caused problems because numerous additional installations (such as selenium, pyvis, etc.) were required when only the ETL pipeline was needed. Therefore, the imports were replaced with a comment so that it is still possible to import modules if necessary.
Validation results
The pipeline passed validation in tests from 3 different sources:
All three sources passed full validation without any crashes.
Demo notebook
The
demo_etl_pipeline.ipynbfile contains a step-by-step process starting from the raw pre-transformation data and displaying the standardized outputs after each stage.