Final project submission alessandro - sami by aleximmediata · Pull Request #11 · PRAISELab-PicusLab/bibliometrix-python

aleximmediata · 2026-06-03T12:46:26Z

From Heterogeneous Bibliographic Data to a Unified Schema

A Python ETL Pipeline for Bibliometric Analysis with Bibliometrix-Python

Data Science Course – A.Y. 2025/2026
Prof. Vincenzo Moscato

1. Introduction

In the field of bibliometric analysis, Bibliometrix is one of the most widely used tools for science mapping and the quantitative analysis of scientific literature. The library is designed to import data from heterogeneous sources and convert them into a common schema based on Web of Science Field Tags, making it possible to compare disparate datasets and perform consistent analyses across authors, journals, references, keywords, and citations.

The present work aimed to build a Python ETL pipeline capable of automatically retrieving bibliographic records from OpenAlex and PubMed, transforming them into a standardized format compatible with Bibliometrix-Python, and integrating them into a Shiny interface for interactive use. OpenAlex supports record retrieval via the search parameter, while PubMed exposes the NCBI E-utilities — specifically ESearch for obtaining identifiers and EFetch for retrieving complete records.

The underlying goal was not merely to download bibliographic data, but to construct a complete process: extraction, transformation, validation, merging of distinct collections, and final use within the analytical functions. During development, it became apparent that some functions could be made compatible with both sources, while others remained operational on only one of them, primarily due to structural differences between formats and time constraints.

2. Project Objectives

The primary objective was to move from heterogeneous bibliographic data to a unified schema, enabling bibliometric analyses to be performed without rewriting source-specific code for each data provider. In practice, the project sought to replicate in Python the logic handled in Bibliometrix by the convert2df() function — namely, the conversion of datasets from different sources into a data frame with standardized bibliographic tags.

The key goals of the work were:

automatic data retrieval via API;
standardization of record format according to the WoS schema;
integration of data into an interactive application;
support for merging multiple collections with DOI-based deduplication;
adaptation of analytical functions to the new unified schema.

The project therefore required not only a data acquisition component, but also substantial structural normalization work, as the datasets obtained from different APIs were not fully aligned with the schema used by the initial sample dataset.

3. System Architecture

The system is organized into separate modules, each with a well-defined responsibility. The separation between extraction, transformation, and presentation made it possible to modify each phase independently, reducing coupling between components.

┌─────────────────────────────────────────────────────────────┐
│                        app.py (Shiny)                       │
│         User interface – reactive data management          │
└───────────────┬─────────────────────────┬───────────────────┘
                │                         │
                ▼                         ▼
┌──────────────────────┐     ┌────────────────────────────┐
│   api_retriever.py   │     │      standardizer.py       │
│  Data extraction     │────▶│  Transformation and        │
│  OpenAlex / PubMed   │     │  WoS schema validation     │
└──────────────────────┘     └────────────┬───────────────┘
                                          │
                    ┌─────────────────────┼──────────────────┐
                    ▼                     ▼                  ▼
        ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
        │ get_annualprod.. │  │ get_frequentwords│  │ get_bradfordlaw  │
        │ get_relevantauth │  │ get_wordcloud    │  │ get_avgcitations │
        │ ...              │  │ ...              │  │ ...              │
        └──────────────────┘  └──────────────────┘  └──────────────────┘
                    │                     │                  │
                    └─────────────────────▼──────────────────┘
                                ┌──────────────────┐
                                │  Output / Charts  │
                                │  Tables / CSV     │
                                └──────────────────┘

Description of main modules:

Module	Responsibility
`app.py`	Shiny interface, reactive dataset management, tab routing
`api_retriever.py`	API request dispatcher for OpenAlex and PubMed
`standardizer.py`	Column mapping and renaming, type enforcement, null handling, SR computation
`get_*.py`	Analytical functions: annual production, authors, Bradford Law, word cloud, etc.

The main data flow follows a linear sequence: the user initiates data retrieval from the interface, api_retriever.py performs the API call for the selected source, standardizer.py transforms the raw result into the common schema, and the standardized dataset is made available to the analytical functions.

4. Extraction Phase

The extraction phase was implemented in the api_retriever.py module, which acts as the central dispatcher for supported sources. The public function retrieve() routes the request to either OpenAlex or PubMed, while source-specific functions handle the technical details of the individual endpoints.

4.1 OpenAlex

For OpenAlex, the pipeline uses the search parameter to query the /works endpoint. This parameter searches across titles, abstracts, and full text, allowing the most relevant works to be retrieved for a given query. A distinctive feature of OpenAlex is its representation of abstracts as an abstract_inverted_index — an inverted index of words rather than continuous text — which required the implementation of a dedicated reconstruction function.

4.2 PubMed

For PubMed, retrieval is performed through the NCBI E-utilities. ESearch converts a text query into a list of identifiers (PMIDs), while EFetch returns the complete records in XML or structured text format. PubMed is part of the NCBI Entrez system and provides public programmatic access to its data via REST API.

4.3 Separation of Extraction and Transformation

raw_df = retrieve(query="machine learning", source="OPENALEX", max_results=500)
clean_df = standardize(raw_df, source="OPENALEX")

This separation made the system modular and allowed each phase to be modified independently without disrupting the overall pipeline. In particular, it was possible to test the standardization component independently from extraction, by loading local files or test datasets.

5. Schema Standardization

The transformation and validation phase was centralized in standardizer.py, which represents the critical point of compatibility between heterogeneous sources and the analytical functions. The guiding principle was to transform records from OpenAlex and PubMed into a single structure based on the most relevant bibliographic field tags — a format aligned with the standard used by Bibliometrix.

5.1 Source Mappings

The module defines distinct mappings for Scopus, Dimensions, PubMed, OpenAlex, and Web of Science. Each mapping associates the raw columns of the source with the corresponding WoS tags: for example, Title → TI, Authors → AU, Abstract → AB, Author Keywords → DE, Cited references → CR, Times cited → TC.

SOURCE_MAPPINGS = {
    "SCOPUS":          SCOPUS_MAPPING,
    "DIMENSIONS":      DIMENSIONS_MAPPING,
    "PUBMED":          PUBMED_MAPPING,
    "OPENALEX":        OPENALEX_MAPPING,
    "WEB_OF_SCIENCE":  WOS_MAPPING,
    "WOS":             WOS_MAPPING,
}

The dispatch(source) function selects the correct mapping based on the specified source and constitutes the first step of standardization. This approach is essential for reducing structural differences between sources that, despite sharing the same bibliographic purpose, adopt entirely different conventions.

5.2 Column Renaming and Deduplication

After mapping, rename_columns() renames the raw columns to WoS tags and handles cases where multiple source fields converge on the same tag. This issue arose most prominently with PubMed, where different fields may map to the same identifier — for example, TA and JT both mapping to SO, or LID and AID both mapping to DI.

if df.columns.duplicated().any():
    deduped = {}
    for col in df.columns.unique():
        group = df.loc[:, df.columns == col]
        if group.shape[1] > 1:
            deduped[col] = group.apply(
                lambda row: next(
                    (v for v in row if v is not None and str(v).strip() != ""),
                    "",
                ),
                axis=1,
            )

Without this deduplication step, Pandas could produce non-scalar objects or raise errors when downstream pipeline components expected single values per column.

5.3 Type Enforcement

The enforce_types() function applies a uniform type contract to the final dataset. Multi-value columns such as AU, AF, C1, CR, DE, ID are converted to list[str]; text fields such as TI, SO, AB, DT, LA, RP are normalized to strings; PY is reduced to a four-digit year; and TC is cast to integer.

for col in LIST_COLUMNS:
    if col in df.columns:
        delim = delimiters.get(col, ";")
        df[col] = df[col].apply(lambda v, d=delim: _split_to_list(v, d))

for col in STR_COLUMNS:
    if col not in df.columns:
        continue
    if col == "PY":
        df[col] = df[col].apply(_extract_year)
    else:
        df[col] = df[col].apply(_to_safe_str)

for col in INT_COLUMNS:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors="coerce").fillna(0).astype(int)

This step carries significant practical value: many analytical functions fail not because data is absent, but because the type does not match the expected format. Enforcing a uniform type contract substantially increased the reliability of the pipeline.

5.4 Null Value Handling

The handle_nulls() function replaces missing values with coherent defaults: empty lists for multi-value columns, empty strings for text fields, and 0 for integer columns. Additionally, if a mandatory column is not present in the dataset, it is added automatically. Formal schema validity was treated as a prerequisite for analytical use.

5.5 Derived Field SR

A relevant aspect of standardization concerns the computation of the SR (Short Reference) field. The module first attempts to use an upstream utility make_short_reference, and falls back to a locally constructed reference derived from author, year, and journal.

if _HAS_UPSTREAM_SR:
    df["SR"] = df.apply(_upstream_sr, axis=1)
else:
    df["SR"] = df.apply(_local_short_reference, axis=1)

This field is required by citation analysis functions and by any function that needs a concise synthetic key to uniquely identify records.

5.6 Final Validation

Before the dataset is returned, validate() checks that all mandatory columns are present, that no null values exist in key columns, and that list-type fields contain actual Python list objects.

missing_cols = [c for c in MANDATORY_COLUMNS if c not in df.columns]
if missing_cols:
    raise ValueError(f"Missing mandatory columns: {missing_cols}")

This final check made it possible to intercept the majority of errors before they propagated to the analytical functions.

5.7 Collection Merging and DOI-Based Deduplication

In addition to single-dataset standardization, the function merge_collections(df1, df2, dedup_field="UT") was implemented to merge two already-standardized collections. The merge is performed after validating both inputs and includes DOI-based deduplication where available.

A critical issue that emerged during development concerns the normalization of the DI (DOI) field: OpenAlex returns DOIs as full URLs (e.g., https://doi.org/10.xxxx/yyy), while PubMed provides them as plain strings (e.g., 10.xxxx/yyy). Without a normalization step, the same article would be treated as two distinct records in a direct comparison.

def normalize_doi(doi):
    if pd.isna(doi):
        return ""
    s = str(doi).strip().lower()
    for prefix in ("https://doi.org/", "http://doi.org/",
                   "https://dx.doi.org/", "http://dx.doi.org/"):
        if s.startswith(prefix):
            s = s[len(prefix):]
            break
    if s in ("nan", "none", "n/a", ""):
        return ""
    return s

merged["DI"]  = merged["DI"].apply(normalize_doi)
with_doi      = merged[merged["DI"] != ""].copy()
without_doi   = merged[merged["DI"] == ""].copy()
with_doi      = with_doi.drop_duplicates(subset=["DI"], keep="first")
merged        = pd.concat([with_doi, without_doi], ignore_index=True)

This made it possible to build a unified final dataset from different sources while maintaining control over record quality and duplicate elimination.

6. Application Integration

The pipeline was connected to the app.py Shiny interface to make the workflow usable interactively. In the API tab, the user selects the source, enters a query, and initiates retrieval; the system then performs extraction and standardization in sequence, updating the reactive data objects of the application.

raw_df   = retrieve(query, source, max_res)
clean_df = standardize(raw_df, source=source)
df.set(clean_df)
api_df.set(clean_df)

The same logic was adopted for the merge functionality, where the application supports both local file upload and retrieval of a second collection via API. The merge is therefore not an isolated operation, but an integral part of the main workflow.

if source_type == "file":
    raw2 = pd.read_csv(path)
    db2  = input.merge_file_db()
else:
    raw2 = retrieve(
        query=query,
        source=input.merge_api_source(),
        max_results=input.merge_api_max()
    )
    db2 = input.merge_api_source()

After the second source is standardized, the result is merged with the current dataset, DOIs are normalized, duplicates are removed, and the final dataset is saved to the global reactive object and made available for CSV download.

@render.download(filename="merged_collection.csv")
def download_merged_csv():
    merged = merge_result.get()
    yield merged.to_csv(index=False, encoding="utf-8-sig")

7. Analytical Functions

After standardization, the dataset is consumed by a set of analytical functions distributed across separate modules, including get_annualproduction.py, get_frequentwords.py, get_wordcloud.py, get_relevantauthors.py, get_bradfordlaw.py, get_averagecitations.py, and others. These modules represent the actual analysis phase, in which standardized data is transformed into tables, indicators, and charts.

The main issue encountered was that not all functions could operate on the unified schema in the same way they had on the original sample dataset. It was therefore necessary to modify the data access logic, introducing dynamic type checks on the input variable.

M = df if isinstance(df, pd.DataFrame) else df.get()

This modification allowed many functions to accept both a direct DataFrame and a reactive wrapper object from the application. It was a significant fix, as the transition from a test dataset to real API data had exposed several points where the code was still tightly coupled to the original structure.

In practical terms, some functions were made compatible with both sources, others operate correctly on only one of them, and a few remain specific to the original format. This situation is consistent with the scope of the project, which did not aim to rewrite all analytical functions from scratch, but rather to bring as many as possible into a unified, reusable schema.

8. Compatibility with PubMed and OpenAlex

One of the most delicate aspects of the project was achieving alignment between PubMed and OpenAlex. Even though both sources were standardized toward the WoS schema, in practice not all information is available in the same way, and some fields are populated differently between the two.

For this reason, the work followed a progressive compatibility approach:

some functions were adapted to work with both sources;
others remained operational on only one;
some remain potentially extensible but were not completed due to time constraints.

PubMed and OpenAlex are not equivalent datasets: they have different information models, varying degrees of completeness, and different conventions for authors, abstracts, references, and identifiers. Specifically:

the CR (cited references) field is frequently absent or incomplete in OpenAlex and is not accessible via PubMed's public API, significantly limiting co-citation analyses;
the C1 (author affiliations) field has different syntactic structures in the two sources, requiring source-specific parsers that do not guarantee full normalization in all cases;
the AF (full author names) field is more consistently populated by OpenAlex than by PubMed.

9. Debugging and Schema Adaptation

A considerable portion of the work was dedicated to debugging. The transition from a sample schema to a standardized API-based schema produced several tracebacks, which proved useful in identifying points where the code was still relying on overly rigid assumptions about data structure. This led to modifications in the data access logic and the introduction of more flexible checks on variable content.

M = df if isinstance(df, pd.DataFrame) else df.get()

In a project of this nature, debugging is not a marginal phase but the point at which it becomes clear whether standardization is truly sufficient to support downstream analyses. Each traceback provided direct feedback on the robustness of the schema and guided the incremental refinement of the pipeline.

10. Limitations and Future Work

10.1 Current Limitations

Despite the results achieved, the pipeline presents several limitations that warrant explicit documentation.

Unavailability of cited references. The CR field, which is fundamental for co-citation analysis and bibliographic network construction, is not accessible through PubMed's public API and is frequently incomplete in OpenAlex. This significantly constrains the applicability of analytical functions that depend on this field.

Uneven field coverage. Some columns of the WoS schema cannot be populated uniformly from both sources. For example, author affiliations (C1) have different syntactic structures in OpenAlex and PubMed, and the parsers currently implemented do not guarantee complete normalization in all cases.

Partially compatible analytical functions. Not all analytical functions have been adapted to work interchangeably with both sources. Some remain operational only with the original sample dataset or with a single API, which reduces the generalizability of the system.

Incomplete API error handling. Error handling for API calls is functional but not exhaustive: rate limiting, timeouts, and malformed responses are caught, but diagnostic messages are not always sufficiently detailed for the end user.

Deduplication limited to DOI. The deduplication strategy is based exclusively on the DI (DOI) field. Records without a DOI are not deduplicated, which can result in residual duplicates when merging collections from different sources.

10.2 Possible Future Developments

Based on the limitations identified above, several improvement paths can be outlined for future versions of the system.

Fuzzy deduplication. Introduction of a text-similarity-based deduplication mechanism for the TI (title) field for records lacking a DOI, using libraries such as rapidfuzz or sentence-transformers.
Support for additional sources. Extension of the pipeline to further bibliographic sources such as Semantic Scholar, Crossref, or Europe PMC, each with its own mapping toward the WoS schema.
Completion of analytical functions. Adaptation of currently partial functions, particularly those requiring cited references, toward alternative solutions that leverage the citation data available in OpenAlex.
API response caching. Implementation of a local cache layer to avoid repeated calls for the same query, reducing loading times and the risk of exceeding API rate limits.
Automated testing. Construction of a unit test suite for the api_retriever.py and standardizer.py modules, to ensure schema stability across varying sources and API versions.

11. Overall Results

The final result is a complete ETL pipeline for bibliographic data, capable of retrieving records from OpenAlex and PubMed, standardizing them into a common schema, and making them available within an interactive analytical environment. The integration into the application made the workflow more accessible, supporting both direct data retrieval and the merging of distinct collections with downloadable output.

From a methodological standpoint, the project highlights that the central challenge is not merely acquiring bibliographic data, but building a process capable of harmonizing heterogeneous sources and adapting analyses to a reliable and consistent schema. The experience demonstrated that moving from a sample dataset to real-world API data requires systematic work on normalization, type enforcement, null handling, deduplication, and alignment of internal functions.

12. Conclusion

The present project has produced a solid foundation for bibliometric analysis in Python, building a bridge between heterogeneous sources and analytical functions constructed on a unified schema. The most demanding phase was the adaptation of the data schema and the debugging of the functions, particularly during the transition from a test dataset to records retrieved from PubMed and OpenAlex.

Although some functions remain fully compatible with only one of the two sources, the work accomplished has achieved its primary objective: creating a reproducible, extensible pipeline that is sufficiently robust to support a significant set of bibliometric analyses in a real-world context.

Summary of Key Code Snippets

# Extraction and standardization
raw_df   = retrieve(query, source, max_res)
clean_df = standardize(raw_df, source=source)

# Flexible dataset access in analytical functions
M = df if isinstance(df, pd.DataFrame) else df.get()

# DOI normalization for cross-source deduplication
merged["DI"] = merged["DI"].apply(normalize_doi)

# SR field computation with local fallback
if _HAS_UPSTREAM_SR:
    df["SR"] = df.apply(_upstream_sr, axis=1)
else:
    df["SR"] = df.apply(_local_short_reference, axis=1)

Final project submission

76a0b32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Final project submission alessandro - sami#11

Final project submission alessandro - sami#11
aleximmediata wants to merge 1 commit into
PRAISELab-PicusLab:mainfrom
aleximmediata:main

aleximmediata commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aleximmediata commented Jun 3, 2026

From Heterogeneous Bibliographic Data to a Unified Schema

A Python ETL Pipeline for Bibliometric Analysis with Bibliometrix-Python

Table of Contents

1. Introduction

2. Project Objectives

3. System Architecture

4. Extraction Phase

4.1 OpenAlex

4.2 PubMed

4.3 Separation of Extraction and Transformation

5. Schema Standardization

5.1 Source Mappings

5.2 Column Renaming and Deduplication

5.3 Type Enforcement

5.4 Null Value Handling

5.5 Derived Field SR

5.6 Final Validation

5.7 Collection Merging and DOI-Based Deduplication

6. Application Integration

7. Analytical Functions

8. Compatibility with PubMed and OpenAlex

9. Debugging and Schema Adaptation

10. Limitations and Future Work

10.1 Current Limitations

10.2 Possible Future Developments

11. Overall Results

12. Conclusion

Summary of Key Code Snippets

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant