Final project submission alessandro - sami#11
Open
aleximmediata wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
From Heterogeneous Bibliographic Data to a Unified Schema
A Python ETL Pipeline for Bibliometric Analysis with Bibliometrix-Python
Data Science Course – A.Y. 2025/2026
Prof. Vincenzo Moscato
Table of Contents
1. Introduction
In the field of bibliometric analysis, Bibliometrix is one of the most widely used tools for science mapping and the quantitative analysis of scientific literature. The library is designed to import data from heterogeneous sources and convert them into a common schema based on Web of Science Field Tags, making it possible to compare disparate datasets and perform consistent analyses across authors, journals, references, keywords, and citations.
The present work aimed to build a Python ETL pipeline capable of automatically retrieving bibliographic records from OpenAlex and PubMed, transforming them into a standardized format compatible with Bibliometrix-Python, and integrating them into a Shiny interface for interactive use. OpenAlex supports record retrieval via the
searchparameter, while PubMed exposes the NCBI E-utilities — specifically ESearch for obtaining identifiers and EFetch for retrieving complete records.The underlying goal was not merely to download bibliographic data, but to construct a complete process: extraction, transformation, validation, merging of distinct collections, and final use within the analytical functions. During development, it became apparent that some functions could be made compatible with both sources, while others remained operational on only one of them, primarily due to structural differences between formats and time constraints.
2. Project Objectives
The primary objective was to move from heterogeneous bibliographic data to a unified schema, enabling bibliometric analyses to be performed without rewriting source-specific code for each data provider. In practice, the project sought to replicate in Python the logic handled in Bibliometrix by the
convert2df()function — namely, the conversion of datasets from different sources into a data frame with standardized bibliographic tags.The key goals of the work were:
The project therefore required not only a data acquisition component, but also substantial structural normalization work, as the datasets obtained from different APIs were not fully aligned with the schema used by the initial sample dataset.
3. System Architecture
The system is organized into separate modules, each with a well-defined responsibility. The separation between extraction, transformation, and presentation made it possible to modify each phase independently, reducing coupling between components.
Description of main modules:
app.pyapi_retriever.pystandardizer.pyget_*.pyThe main data flow follows a linear sequence: the user initiates data retrieval from the interface,
api_retriever.pyperforms the API call for the selected source,standardizer.pytransforms the raw result into the common schema, and the standardized dataset is made available to the analytical functions.4. Extraction Phase
The extraction phase was implemented in the
api_retriever.pymodule, which acts as the central dispatcher for supported sources. The public functionretrieve()routes the request to either OpenAlex or PubMed, while source-specific functions handle the technical details of the individual endpoints.4.1 OpenAlex
For OpenAlex, the pipeline uses the
searchparameter to query the/worksendpoint. This parameter searches across titles, abstracts, and full text, allowing the most relevant works to be retrieved for a given query. A distinctive feature of OpenAlex is its representation of abstracts as anabstract_inverted_index— an inverted index of words rather than continuous text — which required the implementation of a dedicated reconstruction function.4.2 PubMed
For PubMed, retrieval is performed through the NCBI E-utilities. ESearch converts a text query into a list of identifiers (PMIDs), while EFetch returns the complete records in XML or structured text format. PubMed is part of the NCBI Entrez system and provides public programmatic access to its data via REST API.
4.3 Separation of Extraction and Transformation
This separation made the system modular and allowed each phase to be modified independently without disrupting the overall pipeline. In particular, it was possible to test the standardization component independently from extraction, by loading local files or test datasets.
5. Schema Standardization
The transformation and validation phase was centralized in
standardizer.py, which represents the critical point of compatibility between heterogeneous sources and the analytical functions. The guiding principle was to transform records from OpenAlex and PubMed into a single structure based on the most relevant bibliographic field tags — a format aligned with the standard used by Bibliometrix.5.1 Source Mappings
The module defines distinct mappings for Scopus, Dimensions, PubMed, OpenAlex, and Web of Science. Each mapping associates the raw columns of the source with the corresponding WoS tags: for example,
Title → TI,Authors → AU,Abstract → AB,Author Keywords → DE,Cited references → CR,Times cited → TC.The
dispatch(source)function selects the correct mapping based on the specified source and constitutes the first step of standardization. This approach is essential for reducing structural differences between sources that, despite sharing the same bibliographic purpose, adopt entirely different conventions.5.2 Column Renaming and Deduplication
After mapping,
rename_columns()renames the raw columns to WoS tags and handles cases where multiple source fields converge on the same tag. This issue arose most prominently with PubMed, where different fields may map to the same identifier — for example,TAandJTboth mapping toSO, orLIDandAIDboth mapping toDI.Without this deduplication step, Pandas could produce non-scalar objects or raise errors when downstream pipeline components expected single values per column.
5.3 Type Enforcement
The
enforce_types()function applies a uniform type contract to the final dataset. Multi-value columns such asAU,AF,C1,CR,DE,IDare converted tolist[str]; text fields such asTI,SO,AB,DT,LA,RPare normalized to strings;PYis reduced to a four-digit year; andTCis cast to integer.This step carries significant practical value: many analytical functions fail not because data is absent, but because the type does not match the expected format. Enforcing a uniform type contract substantially increased the reliability of the pipeline.
5.4 Null Value Handling
The
handle_nulls()function replaces missing values with coherent defaults: empty lists for multi-value columns, empty strings for text fields, and0for integer columns. Additionally, if a mandatory column is not present in the dataset, it is added automatically. Formal schema validity was treated as a prerequisite for analytical use.5.5 Derived Field SR
A relevant aspect of standardization concerns the computation of the
SR(Short Reference) field. The module first attempts to use an upstream utilitymake_short_reference, and falls back to a locally constructed reference derived from author, year, and journal.This field is required by citation analysis functions and by any function that needs a concise synthetic key to uniquely identify records.
5.6 Final Validation
Before the dataset is returned,
validate()checks that all mandatory columns are present, that no null values exist in key columns, and that list-type fields contain actual Pythonlistobjects.This final check made it possible to intercept the majority of errors before they propagated to the analytical functions.
5.7 Collection Merging and DOI-Based Deduplication
In addition to single-dataset standardization, the function
merge_collections(df1, df2, dedup_field="UT")was implemented to merge two already-standardized collections. The merge is performed after validating both inputs and includes DOI-based deduplication where available.A critical issue that emerged during development concerns the normalization of the
DI(DOI) field: OpenAlex returns DOIs as full URLs (e.g.,https://doi.org/10.xxxx/yyy), while PubMed provides them as plain strings (e.g.,10.xxxx/yyy). Without a normalization step, the same article would be treated as two distinct records in a direct comparison.This made it possible to build a unified final dataset from different sources while maintaining control over record quality and duplicate elimination.
6. Application Integration
The pipeline was connected to the
app.pyShiny interface to make the workflow usable interactively. In the API tab, the user selects the source, enters a query, and initiates retrieval; the system then performs extraction and standardization in sequence, updating the reactive data objects of the application.The same logic was adopted for the merge functionality, where the application supports both local file upload and retrieval of a second collection via API. The merge is therefore not an isolated operation, but an integral part of the main workflow.
After the second source is standardized, the result is merged with the current dataset, DOIs are normalized, duplicates are removed, and the final dataset is saved to the global reactive object and made available for CSV download.
7. Analytical Functions
After standardization, the dataset is consumed by a set of analytical functions distributed across separate modules, including
get_annualproduction.py,get_frequentwords.py,get_wordcloud.py,get_relevantauthors.py,get_bradfordlaw.py,get_averagecitations.py, and others. These modules represent the actual analysis phase, in which standardized data is transformed into tables, indicators, and charts.The main issue encountered was that not all functions could operate on the unified schema in the same way they had on the original sample dataset. It was therefore necessary to modify the data access logic, introducing dynamic type checks on the input variable.
This modification allowed many functions to accept both a direct DataFrame and a reactive wrapper object from the application. It was a significant fix, as the transition from a test dataset to real API data had exposed several points where the code was still tightly coupled to the original structure.
In practical terms, some functions were made compatible with both sources, others operate correctly on only one of them, and a few remain specific to the original format. This situation is consistent with the scope of the project, which did not aim to rewrite all analytical functions from scratch, but rather to bring as many as possible into a unified, reusable schema.
8. Compatibility with PubMed and OpenAlex
One of the most delicate aspects of the project was achieving alignment between PubMed and OpenAlex. Even though both sources were standardized toward the WoS schema, in practice not all information is available in the same way, and some fields are populated differently between the two.
For this reason, the work followed a progressive compatibility approach:
PubMed and OpenAlex are not equivalent datasets: they have different information models, varying degrees of completeness, and different conventions for authors, abstracts, references, and identifiers. Specifically:
9. Debugging and Schema Adaptation
A considerable portion of the work was dedicated to debugging. The transition from a sample schema to a standardized API-based schema produced several tracebacks, which proved useful in identifying points where the code was still relying on overly rigid assumptions about data structure. This led to modifications in the data access logic and the introduction of more flexible checks on variable content.
In a project of this nature, debugging is not a marginal phase but the point at which it becomes clear whether standardization is truly sufficient to support downstream analyses. Each traceback provided direct feedback on the robustness of the schema and guided the incremental refinement of the pipeline.
10. Limitations and Future Work
10.1 Current Limitations
Despite the results achieved, the pipeline presents several limitations that warrant explicit documentation.
Unavailability of cited references. The
CRfield, which is fundamental for co-citation analysis and bibliographic network construction, is not accessible through PubMed's public API and is frequently incomplete in OpenAlex. This significantly constrains the applicability of analytical functions that depend on this field.Uneven field coverage. Some columns of the WoS schema cannot be populated uniformly from both sources. For example, author affiliations (
C1) have different syntactic structures in OpenAlex and PubMed, and the parsers currently implemented do not guarantee complete normalization in all cases.Partially compatible analytical functions. Not all analytical functions have been adapted to work interchangeably with both sources. Some remain operational only with the original sample dataset or with a single API, which reduces the generalizability of the system.
Incomplete API error handling. Error handling for API calls is functional but not exhaustive: rate limiting, timeouts, and malformed responses are caught, but diagnostic messages are not always sufficiently detailed for the end user.
Deduplication limited to DOI. The deduplication strategy is based exclusively on the
DI(DOI) field. Records without a DOI are not deduplicated, which can result in residual duplicates when merging collections from different sources.10.2 Possible Future Developments
Based on the limitations identified above, several improvement paths can be outlined for future versions of the system.
Fuzzy deduplication. Introduction of a text-similarity-based deduplication mechanism for the
TI(title) field for records lacking a DOI, using libraries such asrapidfuzzorsentence-transformers.Support for additional sources. Extension of the pipeline to further bibliographic sources such as Semantic Scholar, Crossref, or Europe PMC, each with its own mapping toward the WoS schema.
Completion of analytical functions. Adaptation of currently partial functions, particularly those requiring cited references, toward alternative solutions that leverage the citation data available in OpenAlex.
API response caching. Implementation of a local cache layer to avoid repeated calls for the same query, reducing loading times and the risk of exceeding API rate limits.
Automated testing. Construction of a unit test suite for the
api_retriever.pyandstandardizer.pymodules, to ensure schema stability across varying sources and API versions.11. Overall Results
The final result is a complete ETL pipeline for bibliographic data, capable of retrieving records from OpenAlex and PubMed, standardizing them into a common schema, and making them available within an interactive analytical environment. The integration into the application made the workflow more accessible, supporting both direct data retrieval and the merging of distinct collections with downloadable output.
From a methodological standpoint, the project highlights that the central challenge is not merely acquiring bibliographic data, but building a process capable of harmonizing heterogeneous sources and adapting analyses to a reliable and consistent schema. The experience demonstrated that moving from a sample dataset to real-world API data requires systematic work on normalization, type enforcement, null handling, deduplication, and alignment of internal functions.
12. Conclusion
The present project has produced a solid foundation for bibliometric analysis in Python, building a bridge between heterogeneous sources and analytical functions constructed on a unified schema. The most demanding phase was the adaptation of the data schema and the debugging of the functions, particularly during the transition from a test dataset to records retrieved from PubMed and OpenAlex.
Although some functions remain fully compatible with only one of the two sources, the work accomplished has achieved its primary objective: creating a reproducible, extensible pipeline that is sufficiently robust to support a significant set of bibliometric analyses in a real-world context.
Summary of Key Code Snippets