API data ingestion pipeline and robustness improvements#10
Open
antonio-cln wants to merge 51 commits into
Open
API data ingestion pipeline and robustness improvements#10antonio-cln wants to merge 51 commits into
antonio-cln wants to merge 51 commits into
Conversation
NEW: - api_etl.py: ETL pipeline to fetch and transform metadata from OpenAlex documents MODIFIED: - www/services/__init__.py: loc 18 - www/services/metatagextraction.py: loc 17-18, 45-46 - functions/get_database.py: loc 37-38 - app.py: loc 66, 654-655, 716-728, 739-740, 770-781
NEW - Data fusion between one or more input sources (single files, API queries)
NEW - Scopus .csv parsing
NEW - Scopus .bib
ROLLBACK - Multi-source dataset: the app logic seems too hardcoded with the Ifs and branches a different logic for each database. It would require sort of a complete code refactor to deal with it?
Some extra rollback required
Fixed some functions
NEW - metatagextraction.SR employed to generate SR field
…PubMed che non hanno cited references, ma l'eccezione è gestita e da un errore nella notifica.
NEW - Function wrapping to prevent crashes
Guarding checks
NEW - Defined Data Validation and Mapping Dictionary modules
…e errors in co-citation field selection.
…copus in the data import section. Updated user interface to reflect new functionality.
…ex, PubMed, and Scopus. Implemented DOI handling to prevent duplicates in the dataset.
…odules - Updated docstrings in `scopus_mapping_dict` to clarify its purpose and functionality. - Added type hint for the `is_df_valid` function to specify it accepts a pandas DataFrame. - Expanded docstring for `is_df_valid` to describe its validation process and return values. - Improved overall readability and maintainability of the code.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The proposed PR aims at introducing new functionalities and refactor part of the existing code in order to better deal with standardizing the data ingestion process, introduce fetching documents through an API request, introduce guardrails for analytical function to ensure user-side stability and provide informations about why certain actions cannot be performed.
New modules
www/services/api_etl.pyThis is one of the core compoment of the API ingestion pipeline. This module deals with two main points:
This is dealt with in each
search_<source>_keywords()function. Since each database source allows interaction in a different way, several functions required to be implemented.This is dealt with in each
<source>_mapping_dict(). Just like in the case ofsearch_<source>_keywords(), since each database provides data in a different format, different functions required to be implemented.www/services/data_validation.pyThis is one of the core component of the API ingestion pipeline. This module manages validation of a dataframe, verifying data integrity before it enters any of the analytical workflows. It makes sure that the extracted tags are conform to the Web of Science provided schema, in particular it verifies that:
"DB", "UT", "DI", "PMID", "TI", "SO", "JI", "DT", "LA", "RP", "AB", "VL", "IS", "BP", "EP", "SR"arestring"AU", "AF", "C1", "CR", "DE", "ID"arelist"PY", "TC"areintegerFurthermore, since Pandas usually converts
stringtoobject, an explicit conversion tostringhas been applied to guarantee full conformity with the requested format.Changes
Dispatcher
A dispatcher pattern has been added in
app.py.The user is allowed to choose from what database (OpenAlex, PubMed, Scopus) he wants to fetch documents from and based on his choice, a certain pipeline will be used to proceed with the API ingestion pipeline introduced by
api_etl.pyanddata_validation.py.Single/Multiple Uploaded File processing
A more resilient approach through a
try/exceptblock has been implemented informat_functions.pywhen formatting the columns to prevent corrupted records to crash the file processing. The involved corrupted file information are provided in the terminal.For each of the
format_<tag>_column.pyfunction, a more resilient approach has been implemented throughis_valid_field()andclean_txt_string()to make sure to avoid unwanted values,NaNorNoneto enter the workflow. This guarantees that the produced dataframe columns are conform with the requests of beingstring,integersorlist.API
The aforementioned API pipeline is accessible through the web interface provided by
app.pythanks to an additional entry in the dropdown menu that allows the user to choose how he wants to provide the data. This option allows the user to choose between OpenAlex, PubMed and Scopus.Furthermore, an API Key field is present to allow the use to provide his own key to perform a query. This is a mandatory step for certain databases like Scopus while OpenAlex and PubMed allow to query their database without any key but with some restrictions.
Several API requests can be executed sequentially and any successfully completed requests are then merged together. This allows the user to fetch documents from several database sources and use the analytics functions on a broader set of data.
Analytical Function Guardrails
Most of the present analytical function work perfectly with Web of Science provided columns. The analytical function have been mostly adjusted to work according to the data provided by Web of Science and since different databases don't provide all the data that Web of Science provides, most of the functions will crash.
Analytical function crashes have been dealt with both internally and externally.
External guardrails have been implemented in order to avoid the functions to even run if the required columns to generate the plot are either missing (which shouldn't be the case since the dataframe has been validated before, it is just a double-check) or are completely empty.
According to the requests, column can be empty and all rows can be represented by either an empty
string, "", or an emptylist, []. These external guardrails prevents function from running if the data required for it to work is not conform and inform the end-user that the specific column required for that function is missing from the dataset.Two different approaches have been used since there are two macro categories of analytical functions:
For the first scenario, the following guarding check has been implemented thanks to an auxiliary validation function,
is_column_empty(). The following example is taken from Most Relevant Sources Section.For the second scenario, on top of the
is_column_empty()function, a dynamic dropdown menu items update has been implemented in order to avoid the user from selecting non-valid columns. The following example is taken from the Three-Field Plot SectionInternal modifications to most of the functions have been implemented in order avoid crashes due to valid calculations that lead to values that generate issues (mostly zero values). In this scenario, a placeholder plot is generated to inform the end-user that the calculated values to generate a plot are not valid. The following example is taken from the
get_authors_local_impact()and deals with g_index, h_index and m_index.