PRAISELab-PicusLab · mattiadenicola02 · May 30, 2026 · May 31, 2026 · May 31, 2026 · May 31, 2026
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,9 @@
 __pycache__/
 bibliovenv/
 Bibenv/
-.idea/
+.idea/
+.venv/
+.DS_Store
+**/.DS_Store
+.claude/
+*.pyc
diff --git a/PR_BODY.md b/PR_BODY.md
@@ -0,0 +1,64 @@
+# Source-agnostic ETL pipeline for Bibliometrix-Python (Extract → Transform → Validate + live API)
+
+## Summary
+
+This PR makes Bibliometrix-Python **source-agnostic**, replicating the conceptual robustness of R bibliometrix's `convert2df()`. It introduces a centralized ETL pipeline that turns heterogeneous bibliographic exports (Web of Science, Scopus, Dimensions, PubMed, Lens, Cochrane) — plus **live API queries** (OpenAlex, PubMed) — into a single, strictly-typed Web of Science schema that the dashboard and analytical functions can consume without crashing.
+
+**Implementation level: Advanced** (API retrieval with pagination, rate-limiting and retries, reusing the same transformation pipeline as the file-based path).
+
+## Problems in the current implementation that this PR addresses
+
+- **No single entry point** like `convert2df()` → added `BibliometrixETL.run()` / `run_api()`.
+- **Scattered transformation logic** → centralized in one `transform()` method, no monolith (Extract / Transform / Validate are separate, independently testable methods).
+- **Weak type enforcement** → explicit type contracts (PY as 4-digit string, TC as int, multi-value fields as `list[str]`).
+- **Poor null handling** → NaN/None systematically replaced with `""` (scalars) or `[]` (multi-value).
+- **Implicit WoS dependency / incomplete column mapping** → declarative per-source mapping dictionaries.
+- **Non-standard reference/citation parsing** → source-specific delimiters (e.g. newline for WoS `CR`).
+
+## Architecture
+
+**1. Dispatcher** — `extract()` in `www/services/etl_pipeline.py` routes each `(source, file_type)` pair to the right parser (reusing the existing `www/services/parsers.py`), raising clear `ValueError`/`FileNotFoundError`/`ImportError` instead of failing silently.
+
+**2. Mapping dictionaries** — `www/services/column_mappings.py` holds one declarative `{source_column: WoS_tag}` table per database. Adding a new source = appending one sub-dictionary, no other module changes.
+
+**3. Type contracts** — `transform()` enforces the schema in 7 documented phases: pre-processing (e.g. Dimensions affiliation extraction, pagination split into BP/EP), **SR computation reusing the existing `format_functions.format_sr_column`** (per the brief: SR is not rewritten from scratch), column rename, duplicate-column resolution, mandatory-column presence, type coercion, null cleaning.
+
+**4. Validation** — `www/services/validator.py` programmatically verifies: all mandatory columns present, no NaN/None remaining, multi-value columns are `list[str]`.
+
+**5. Live API (Advanced)** — `www/services/api_retriever.py`:
+- **OpenAlex**: paginated `/works`, exponential backoff on 429/5xx (1-2-4-8-16s, cap 30s), per-page retry budget, abstract reconstruction from the inverted index; already-fetched rows are never dropped on error.
+- **PubMed**: ESearch + EFetch, MEDLINE written to a race-free `tempfile.mkstemp` cleaned in `finally`, then **reusing** the existing `parse_pubmed_data` (no duplicated logic).
+
+## Files
+
+**New ETL modules** (~1,631 lines):
+- `www/services/etl_pipeline.py` (768) — orchestrator
+- `www/services/api_retriever.py` (376) — OpenAlex + PubMed clients
+- `www/services/column_mappings.py` (176) — per-source mapping tables
+- `www/services/validator.py` (135) — schema validator
+- `test_etl_pipeline.py` (176) — end-to-end execution evidence harness
+
+**Debugging / patches applied to existing analytical & service functions** (to make them work with non-WoS data instead of assuming WoS-only formats):
+- `functions/get_annualproduction.py` — robust PY handling across sources
+- `functions/get_worldmapcollaboration.py`
+- `www/services/format_functions.py`, `histnetwork.py`, `biblionetwork.py`, `parsers.py`, `utils.py`
+- `requirements.txt` — made `pywin32` Windows-only (`sys_platform == "win32"`) so `pip install` no longer fails on macOS/Linux (on Python 3.9–3.12; the pinned `scipy`/`numpy` versions have no prebuilt wheels for Python 3.13 yet); pinned `kaleido==0.2.1` (the version compatible with the existing plotly `to_image`/`write_image` calls).
+
+## Execution evidence
+
+End-to-end harness (`python test_etl_pipeline.py`) over four real source files:
+
+| Source | Rows | PY filled | Assigned functions |
+|---|---|---|---|
+| Scopus (CSV) | 1000 | 100% | annual_production ✅ · co_citation ✅ · clustering_coupling ✅ |
+| Dimensions (XLSX) | 500 | 100% | annual_production ✅ · co_citation N/A* · coupling N/A* |
+| PubMed (TXT) | 10000 | 100% | annual_production ✅ · co_citation N/A* · coupling N/A* |
+| Web of Science (TXT) | 500 | 100% | annual_production ✅ · co_citation ✅ · clustering_coupling ✅ |
+
+**Result: `PASS=8  N/A=4  FAIL=0`** — all assigned functions run on every source.
+
+\* Co-citation and bibliographic coupling are computed *from cited references* (CR). Dimensions and PubMed exports do not include a reference list, so these networks cannot be built — marked N/A rather than FAIL, consistent with the brief ("assuming the raw data contains the necessary underlying information").
+
+## Dashboard demonstration
+
+The Shiny dashboard (`shiny run app.py`) starts cleanly and serves HTTP 200, and the standardized DataFrame produced by the ETL allows non-WoS data (e.g. Scopus CSV) to be loaded and analyzed through the UI.
diff --git a/README.md b/README.md
@@ -37,11 +37,11 @@ The web application enables scholars to easily access bibliometric analysis feat
 
 - **Import and convert** data from multiple bibliographic databases:
   - Web of Science (plaintext, BibTeX, EndNote) - ✅ Fully supported
-  - Scopus (CSV, BibTeX) - 🚧 In progress
-  - PubMed (plaintext export) - 🚧 In progress
-  - Dimensions (Excel, CSV) - 🚧 In progress
-  - Lens.org (CSV) - 🚧 In progress
-  - Cochrane CDSR (plaintext) - 🚧 In progress
+  - Scopus (CSV, BibTeX) - ✅ Supported
+  - PubMed (plaintext export) - ✅ Supported
+  - Dimensions (Excel, CSV) - ✅ Supported
+  - Lens.org (CSV) - ✅ Supported
+  - Cochrane CDSR (plaintext) - ✅ Supported
 
 - **Filter data** by various criteria including publication years, languages, document types, citation counts, and Bradford's Law zones
 
@@ -120,7 +120,7 @@ Aria, M. & Cuccurullo, C. (2017) **bibliometrix: An R-tool for comprehensive sci
 
 ### Prerequisites
 
-- Python 3.9 or higher
+- Python 3.9–3.12 (the pinned `scipy`/`numpy` versions do not ship prebuilt wheels for Python 3.13 yet)
 - pip package manager
 
 ### Install from source
@@ -138,6 +138,16 @@ Install dependencies:
 pip install -r requirements.txt
 ```
 
+> **NLTK data.** The text-mining tabs (Most Frequent Words, Word Cloud, Tree Map,
+> Word Frequency, Trend Topics, Thematic Map/Evolution) need the NLTK `stopwords`
+> and `wordnet` corpora. The app downloads them automatically on first launch, so
+> no manual step is normally required. On an **offline** machine, fetch them once
+> while connected:
+>
+> ```bash
+> python -m nltk.downloader stopwords wordnet omw-1.4 punkt
+> ```
+
 ### Run the application
 
 ```bash
@@ -193,11 +203,11 @@ bibliometrix-python/
 bibliometrix-python supports importing bibliographic data from major scientific databases:
 
 - **Web of Science**: plaintext (.txt), BibTeX (.bib), EndNote (.ciw) - ✅ Fully supported
-- **Scopus**: CSV (.csv), BibTeX (.bib) - 🚧 In progress
-- **PubMed**: plaintext export - 🚧 In progress
-- **Dimensions**: Excel (.xlsx), CSV (.csv) - 🚧 In progress
-- **Lens.org**: CSV (.csv) - 🚧 In progress
-- **Cochrane**: plaintext (.txt) - 🚧 In progress
+- **Scopus**: CSV (.csv), BibTeX (.bib) - ✅ Supported
+- **PubMed**: plaintext export - ✅ Supported
+- **Dimensions**: Excel (.xlsx), CSV (.csv) - ✅ Supported
+- **Lens.org**: CSV (.csv) - ✅ Supported
+- **Cochrane**: plaintext (.txt) - ✅ Supported
 
 ### Comprehensive Bibliometric Analysis