-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Preparing for the next release: working with large dataset with a backend database #146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…orithm (may need more attention)
| else: | ||
| result = {'status': 'error'} | ||
|
|
||
| return json.dumps(result) |
Check warning
Code scanning / CodeQL
Information exposure through an exception Medium
Stack trace information
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 9 months ago
To fix the issue, we will modify the error handling in the test_model function to ensure that no stack trace or sensitive internal details are exposed to the user. Instead of sanitizing the exception message, we will log the full stack trace on the server for debugging purposes and return a generic error message to the client. This approach aligns with secure error-handling practices.
The changes will involve:
- Logging the full stack trace of the exception using the
loggingmodule. - Returning a generic error message to the client, such as "An internal error occurred. Please try again later."
-
Copy modified line R161 -
Copy modified line R165
| @@ -160,3 +160,3 @@ | ||
| except Exception as e: | ||
| logger.info(f"Error: {e}") | ||
| logger.error("An error occurred while testing the model:", exc_info=True) | ||
| result = { | ||
| @@ -164,3 +164,3 @@ | ||
| "status": 'error', | ||
| "message": sanitize_model_error(str(e)), | ||
| "message": "An internal error occurred. Please try again later.", | ||
| } |
| else: | ||
| result = {'status': 'error'} | ||
|
|
||
| return json.dumps(result) |
Check warning
Code scanning / CodeQL
Reflected server-side cross-site scripting Medium
user-provided value
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 9 months ago
To fix the issue, we need to sanitize user-provided data before including it in the result dictionary. Specifically:
- Use
html.escape()to escape any user-provided strings (e.g.,content['model']) before adding them to theresultdictionary. - Ensure that all fields in the
resultdictionary that originate from user input are sanitized.
The changes will be made in the test_model function, where the result dictionary is constructed. The content['model'] value will be sanitized using html.escape() before being added to the result dictionary.
-
Copy modified line R156 -
Copy modified line R163
| @@ -155,3 +155,3 @@ | ||
| result = { | ||
| "model": content['model'], | ||
| "model": html.escape(content['model']), | ||
| "status": 'ok', | ||
| @@ -162,3 +162,3 @@ | ||
| result = { | ||
| "model": content['model'], | ||
| "model": html.escape(content['model']), | ||
| "status": 'error', |
| results = agent.run(content["input_data"], [f['name'] for f in content["input_fields"]], | ||
| content["output_name"], content["description"]) | ||
|
|
||
| response = flask.jsonify({ "status": "ok", "token": token, "results": results }) |
Check warning
Code scanning / CodeQL
Information exposure through an exception Medium
Stack trace information
Stack trace information
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 9 months ago
To fix the issue, we need to ensure that stack traces and other sensitive information are not included in the JSON response sent to the client. Instead, we should log the stack trace on the server for debugging purposes and return a generic error message to the client. Specifically:
- Modify the
runmethod inPyConceptDeriveAgentto log the stack trace using theloggerand replace thecontentfield in theresultdictionary with a generic error message. - Ensure that the
derive_py_conceptfunction inpy-src/data_formulator/agent_routes.pydoes not expose sensitive information in theresultsfield of the JSON response.
-
Copy modified lines R243-R244
| @@ -242,3 +242,4 @@ | ||
|
|
||
| response = flask.jsonify({ "status": "ok", "token": token, "results": results }) | ||
| sanitized_results = [r for r in results if r['status'] == 'ok'] | ||
| response = flask.jsonify({ "status": "ok", "token": token, "results": sanitized_results }) | ||
| else: |
-
Copy modified lines R147-R148
| @@ -146,4 +146,4 @@ | ||
| error_message = traceback.format_exc() | ||
| print(error_message) | ||
| result = {'status': 'other error', 'content': error_message} | ||
| logger.error(f"An error occurred while running the concept derivation: {error_message}") | ||
| result = {'status': 'other error', 'content': 'An internal error occurred while processing your request.'} | ||
| else: |
| if conn: | ||
| conn.close() | ||
|
|
||
| response = flask.jsonify({ "token": token, "status": "ok", "results": results }) |
Check warning
Code scanning / CodeQL
Information exposure through an exception Medium
Stack trace information
Stack trace information
Stack trace information flows to this location and may be exposed to an external user.
Stack trace information
Stack trace information
Stack trace information flows to this location and may be exposed to an external user.
Stack trace information
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 9 months ago
To fix the issue, we need to ensure that sensitive stack trace information is not exposed to external users. Instead:
- Log the detailed stack trace on the server for debugging purposes.
- Return a generic error message to the user, such as "An internal error occurred."
The changes will involve:
- Modifying the
process_gpt_responsemethod inpy-src/data_formulator/agents/agent_sql_data_rec.pyto log the stack trace and replace the detailed error message with a generic one. - Ensuring that the
resultsobject returned byagent.runinpy-src/data_formulator/agent_routes.pydoes not contain sensitive information.
-
Copy modified lines R360-R365
| @@ -359,3 +359,8 @@ | ||
|
|
||
| response = flask.jsonify({ "token": token, "status": "ok", "results": results }) | ||
| sanitized_results = [ | ||
| {key: value if key != 'content' or result['status'] != 'other error' else "An internal error occurred." | ||
| for key, value in result.items()} | ||
| for result in results | ||
| ] | ||
| response = flask.jsonify({ "token": token, "status": "ok", "results": sanitized_results }) | ||
| else: |
-
Copy modified line R191
| @@ -190,3 +190,3 @@ | ||
| logger.warning(error_message) | ||
| result = {'status': 'other error', 'code': code_str, 'content': f"Unexpected error: {error_message}"} | ||
| result = {'status': 'other error', 'code': code_str, 'content': "An internal error occurred while processing your request."} | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 42 out of 44 changed files in this pull request and generated no comments.
Files not reviewed (2)
- .env.template: Language not supported
- src/scss/DataView.scss: Language not supported
Comments suppressed due to low confidence (3)
src/views/ConceptShelf.tsx:194
- Verify that replacing the 'bin' operator with 'max' (and adding a separate 'min' card) is intentional and that any downstream logic relying on binning is updated accordingly.
<OperatorCard operator="max" />
src/data/utils.ts:57
- Ensure that 'rows' is non-empty before accessing rows[0] when handling duplicate column names to prevent potential runtime errors when the input text is empty.
for (let i = 0; i < rows[0].length; i++) {
src/app/dfSlice.tsx:350
- Review the updated filtering logic for 'conceptShelfItems'; verify that simply filtering by tableRef (without considering base field references via findBaseFields) meets the intended behavior.
state.conceptShelfItems = state.conceptShelfItems.filter(f => !(f.tableRef == tableId));
…ime, fixing code issues
| if conn: | ||
| conn.close() | ||
|
|
||
| response = flask.jsonify({ "token": token, "status": "ok", "results": results}) |
Check warning
Code scanning / CodeQL
Information exposure through an exception Medium
Stack trace information
Stack trace information
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 9 months ago
To fix the issue, we need to ensure that sensitive error messages or stack traces are not exposed to external users. Instead, we should log the detailed error messages on the server for debugging purposes and return a generic error message to the user. Specifically:
- Modify the
refine_datafunction inagent_routes.pyto sanitize theresultsfield before including it in the JSON response. - Ensure that detailed error messages are logged using the
loggermodule, but only generic messages are returned to the user. - Update the
run_transform_in_sandbox2020andprocess_gpt_responsemethods to sanitize error messages before they propagate to theresultsfield.
-
Copy modified lines R413-R418
| @@ -412,3 +412,8 @@ | ||
|
|
||
| response = flask.jsonify({ "token": token, "status": "ok", "results": results}) | ||
| sanitized_results = [ | ||
| {key: value if key != 'content' or result['status'] == 'ok' else "An error occurred during data processing." | ||
| for key, value in result.items()} | ||
| for result in results | ||
| ] | ||
| response = flask.jsonify({ "token": token, "status": "ok", "results": sanitized_results}) | ||
| else: |
-
Copy modified line R141
| @@ -140,3 +140,3 @@ | ||
| 'status': 'error', | ||
| 'content': result['error_message'] | ||
| 'content': "An error occurred during data transformation. Please check the server logs for details." | ||
| } |
-
Copy modified line R234
| @@ -233,2 +233,3 @@ | ||
| logger.info(result['content']) | ||
| result['content'] = "An error occurred during data transformation. Please check the server logs for details." | ||
| except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR prepares the next release by enhancing support for large datasets through backend database integration (DuckDB), updates API endpoints for improved data handling, and refines both the chart configuration and various data agent functionalities. Key changes include:
- Updating chart templates (e.g. adding the “y” channel for bar charts) and reconfiguring API endpoints in utils.
- Adding session management and default chart configurations in the application state.
- Revising Python agents and the sandbox logic (including dependency updates in pyproject.toml) to incorporate DuckDB and improve transformation safety.
Reviewed Changes
Copilot reviewed 47 out of 48 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/components/ChartTemplates.tsx | Added the "y" channel to the bar chart template to support improved encoding mapping. |
| src/app/utils.tsx, dfSlice.tsx | Updated API endpoints and session/config management; revised concept shelf item filtering. |
| py-src/data_formulator/py_sandbox.py | Renamed and updated sandbox execution logic; changed error response key semantics. |
| Other agent and configuration files | Refined Python agents, added DuckDB dependency in pyproject.toml, and updated relevant UI. |
Files not reviewed (1)
- .env.template: Language not supported
Preparing for the next release:
leverages backend database (duckdb) to handle large datasets (millions of rows)
dynamically sample data based on chart configuration to create charts
leverages NL2SQL to transform data in the backend to support exploration and expressive visualizations
update the design of derived concept, supporting reapply the derived concept to new tables in one click.
minor minor: support configuring default chart size... for those with large screens :)
TODO before release: add utility to connect to external databases.