Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
__pycache__/
bibliovenv/
Bibenv/
.idea/
.idea/
.venv/
.DS_Store
**/.DS_Store
.claude/
*.pyc
64 changes: 64 additions & 0 deletions PR_BODY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Source-agnostic ETL pipeline for Bibliometrix-Python (Extract → Transform → Validate + live API)

## Summary

This PR makes Bibliometrix-Python **source-agnostic**, replicating the conceptual robustness of R bibliometrix's `convert2df()`. It introduces a centralized ETL pipeline that turns heterogeneous bibliographic exports (Web of Science, Scopus, Dimensions, PubMed, Lens, Cochrane) — plus **live API queries** (OpenAlex, PubMed) — into a single, strictly-typed Web of Science schema that the dashboard and analytical functions can consume without crashing.

**Implementation level: Advanced** (API retrieval with pagination, rate-limiting and retries, reusing the same transformation pipeline as the file-based path).

## Problems in the current implementation that this PR addresses

- **No single entry point** like `convert2df()` → added `BibliometrixETL.run()` / `run_api()`.
- **Scattered transformation logic** → centralized in one `transform()` method, no monolith (Extract / Transform / Validate are separate, independently testable methods).
- **Weak type enforcement** → explicit type contracts (PY as 4-digit string, TC as int, multi-value fields as `list[str]`).
- **Poor null handling** → NaN/None systematically replaced with `""` (scalars) or `[]` (multi-value).
- **Implicit WoS dependency / incomplete column mapping** → declarative per-source mapping dictionaries.
- **Non-standard reference/citation parsing** → source-specific delimiters (e.g. newline for WoS `CR`).

## Architecture

**1. Dispatcher** — `extract()` in `www/services/etl_pipeline.py` routes each `(source, file_type)` pair to the right parser (reusing the existing `www/services/parsers.py`), raising clear `ValueError`/`FileNotFoundError`/`ImportError` instead of failing silently.

**2. Mapping dictionaries** — `www/services/column_mappings.py` holds one declarative `{source_column: WoS_tag}` table per database. Adding a new source = appending one sub-dictionary, no other module changes.

**3. Type contracts** — `transform()` enforces the schema in 7 documented phases: pre-processing (e.g. Dimensions affiliation extraction, pagination split into BP/EP), **SR computation reusing the existing `format_functions.format_sr_column`** (per the brief: SR is not rewritten from scratch), column rename, duplicate-column resolution, mandatory-column presence, type coercion, null cleaning.

**4. Validation** — `www/services/validator.py` programmatically verifies: all mandatory columns present, no NaN/None remaining, multi-value columns are `list[str]`.

**5. Live API (Advanced)** — `www/services/api_retriever.py`:
- **OpenAlex**: paginated `/works`, exponential backoff on 429/5xx (1-2-4-8-16s, cap 30s), per-page retry budget, abstract reconstruction from the inverted index; already-fetched rows are never dropped on error.
- **PubMed**: ESearch + EFetch, MEDLINE written to a race-free `tempfile.mkstemp` cleaned in `finally`, then **reusing** the existing `parse_pubmed_data` (no duplicated logic).

## Files

**New ETL modules** (~1,631 lines):
- `www/services/etl_pipeline.py` (768) — orchestrator
- `www/services/api_retriever.py` (376) — OpenAlex + PubMed clients
- `www/services/column_mappings.py` (176) — per-source mapping tables
- `www/services/validator.py` (135) — schema validator
- `test_etl_pipeline.py` (176) — end-to-end execution evidence harness

**Debugging / patches applied to existing analytical & service functions** (to make them work with non-WoS data instead of assuming WoS-only formats):
- `functions/get_annualproduction.py` — robust PY handling across sources
- `functions/get_worldmapcollaboration.py`
- `www/services/format_functions.py`, `histnetwork.py`, `biblionetwork.py`, `parsers.py`, `utils.py`
- `requirements.txt` — made `pywin32` Windows-only (`sys_platform == "win32"`) so `pip install` no longer fails on macOS/Linux (on Python 3.9–3.12; the pinned `scipy`/`numpy` versions have no prebuilt wheels for Python 3.13 yet); pinned `kaleido==0.2.1` (the version compatible with the existing plotly `to_image`/`write_image` calls).

## Execution evidence

End-to-end harness (`python test_etl_pipeline.py`) over four real source files:

| Source | Rows | PY filled | Assigned functions |
|---|---|---|---|
| Scopus (CSV) | 1000 | 100% | annual_production ✅ · co_citation ✅ · clustering_coupling ✅ |
| Dimensions (XLSX) | 500 | 100% | annual_production ✅ · co_citation N/A* · coupling N/A* |
| PubMed (TXT) | 10000 | 100% | annual_production ✅ · co_citation N/A* · coupling N/A* |
| Web of Science (TXT) | 500 | 100% | annual_production ✅ · co_citation ✅ · clustering_coupling ✅ |

**Result: `PASS=8 N/A=4 FAIL=0`** — all assigned functions run on every source.

\* Co-citation and bibliographic coupling are computed *from cited references* (CR). Dimensions and PubMed exports do not include a reference list, so these networks cannot be built — marked N/A rather than FAIL, consistent with the brief ("assuming the raw data contains the necessary underlying information").

## Dashboard demonstration

The Shiny dashboard (`shiny run app.py`) starts cleanly and serves HTTP 200, and the standardized DataFrame produced by the ETL allows non-WoS data (e.g. Scopus CSV) to be loaded and analyzed through the UI.
32 changes: 21 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,11 @@ The web application enables scholars to easily access bibliometric analysis feat

- **Import and convert** data from multiple bibliographic databases:
- Web of Science (plaintext, BibTeX, EndNote) - ✅ Fully supported
- Scopus (CSV, BibTeX) - 🚧 In progress
- PubMed (plaintext export) - 🚧 In progress
- Dimensions (Excel, CSV) - 🚧 In progress
- Lens.org (CSV) - 🚧 In progress
- Cochrane CDSR (plaintext) - 🚧 In progress
- Scopus (CSV, BibTeX) - ✅ Supported
- PubMed (plaintext export) - ✅ Supported
- Dimensions (Excel, CSV) - ✅ Supported
- Lens.org (CSV) - ✅ Supported
- Cochrane CDSR (plaintext) - ✅ Supported

- **Filter data** by various criteria including publication years, languages, document types, citation counts, and Bradford's Law zones

Expand Down Expand Up @@ -120,7 +120,7 @@ Aria, M. & Cuccurullo, C. (2017) **bibliometrix: An R-tool for comprehensive sci

### Prerequisites

- Python 3.9 or higher
- Python 3.9–3.12 (the pinned `scipy`/`numpy` versions do not ship prebuilt wheels for Python 3.13 yet)
- pip package manager

### Install from source
Expand All @@ -138,6 +138,16 @@ Install dependencies:
pip install -r requirements.txt
```

> **NLTK data.** The text-mining tabs (Most Frequent Words, Word Cloud, Tree Map,
> Word Frequency, Trend Topics, Thematic Map/Evolution) need the NLTK `stopwords`
> and `wordnet` corpora. The app downloads them automatically on first launch, so
> no manual step is normally required. On an **offline** machine, fetch them once
> while connected:
>
> ```bash
> python -m nltk.downloader stopwords wordnet omw-1.4 punkt
> ```

### Run the application

```bash
Expand Down Expand Up @@ -193,11 +203,11 @@ bibliometrix-python/
bibliometrix-python supports importing bibliographic data from major scientific databases:

- **Web of Science**: plaintext (.txt), BibTeX (.bib), EndNote (.ciw) - ✅ Fully supported
- **Scopus**: CSV (.csv), BibTeX (.bib) - 🚧 In progress
- **PubMed**: plaintext export - 🚧 In progress
- **Dimensions**: Excel (.xlsx), CSV (.csv) - 🚧 In progress
- **Lens.org**: CSV (.csv) - 🚧 In progress
- **Cochrane**: plaintext (.txt) - 🚧 In progress
- **Scopus**: CSV (.csv), BibTeX (.bib) - ✅ Supported
- **PubMed**: plaintext export - ✅ Supported
- **Dimensions**: Excel (.xlsx), CSV (.csv) - ✅ Supported
- **Lens.org**: CSV (.csv) - ✅ Supported
- **Cochrane**: plaintext (.txt) - ✅ Supported

### Comprehensive Bibliometric Analysis

Expand Down
Loading