Skip to content

Complete ETL Pipeline (Advanced Level)#12

Open
nihal16000 wants to merge 1 commit into
PRAISELab-PicusLab:mainfrom
nihal16000:main
Open

Complete ETL Pipeline (Advanced Level)#12
nihal16000 wants to merge 1 commit into
PRAISELab-PicusLab:mainfrom
nihal16000:main

Conversation

@nihal16000
Copy link
Copy Markdown

@nihal16000 nihal16000 commented Jun 3, 2026

  1. Extract Phase (api_retriever.py)

Implemented live OpenAlex API integration with a polite pool mailto configuration.
Engineered cursor-based pagination to safely handle large document sets without memory overflow.
Added robust requests retry logic with exponential backoff to handle transient 429 and 500 network errors.

  1. Transform Phase (standardizer.py)

Mapped deeply nested, proprietary OpenAlex JSON payloads into the strict, flat Web of Science (WoS) schema required by native Bibliometrix functions.
Engineered a custom mathematical algorithm to reconstruct OpenAlex's inverted abstract indices back into readable strings for text mining.
Calculated the primary key Short Reference (SR) column dynamically.

  1. Validation Phase (validator.py)

Implemented a strict validation layer using pandera.
Enforced DataFrame type contracts to guarantee that downstream graphing functions do not crash due to malformed data types (e.g., verifying multi-value fields are true Python lists rather than strings).

Bug Fixes to Native Code:

Annual Scientific Production Crash: Identified a silent TypeError in the native graphing functions caused by string-based publication years. Patched this in the standardizer by forcefully casting the PY column via pd.to_numeric before loading it into the reactive state, successfully unblocking the graphical analysis tabs.

UI Integration:

Overhauled the app.py frontend to include a fully reactive "API" tab.
Bound the standardized DataFrame directly to the Shiny reactive state (df.set), allowing users to search, preview, and instantly generate analytical visuals without ever leaving the interface.

This pipeline operates cleanly within an isolated services/ directory and does not overwrite or destroy any existing native functions in lib/ or functions/.

GROUP MEMBERS

Name - Nihal Nawaz Kaleem Nawaz
Matricola - D03000283

Name - Hunain Raza
Matricola - D03000256

Name - Parth Kumar Rai
Matricola - D03000255

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants