Preparing for the next release: working with large dataset with a backend database #146

Chenglong-MS · 2025-04-17T00:48:59Z

Preparing for the next release:

leverages backend database (duckdb) to handle large datasets (millions of rows)
dynamically sample data based on chart configuration to create charts
leverages NL2SQL to transform data in the backend to support exploration and expressive visualizations
update the design of derived concept, supporting reapply the derived concept to new tables in one click.
minor minor: support configuring default chart size... for those with large screens :)
TODO before release: add utility to connect to external databases.

…orithm (may need more attention)

…tend

… tables

py-src/data_formulator/app.py

py-src/data_formulator/py_sandbox.py

py-src/data_formulator/agent_routes.py

+    else:
+        result = {'status': 'error'}
+
+    return json.dumps(result)


To fix the issue, we will modify the error handling in the test_model function to ensure that no stack trace or sensitive internal details are exposed to the user. Instead of sanitizing the exception message, we will log the full stack trace on the server for debugging purposes and return a generic error message to the client. This approach aligns with secure error-handling practices.

The changes will involve:

Logging the full stack trace of the exception using the logging module.

Returning a generic error message to the client, such as "An internal error occurred. Please try again later."

py-src/data_formulator/agent_routes.py

+    else:
+        result = {'status': 'error'}
+
+    return json.dumps(result)


To fix the issue, we need to sanitize user-provided data before including it in the result dictionary. Specifically:

Use html.escape() to escape any user-provided strings (e.g., content['model']) before adding them to the result dictionary.

Ensure that all fields in the result dictionary that originate from user input are sanitized.

The changes will be made in the test_model function, where the result dictionary is constructed. The content['model'] value will be sanitized using html.escape() before being added to the result dictionary.

py-src/data_formulator/agent_routes.py

+        results = agent.run(content["input_data"], [f['name'] for f in content["input_fields"]], 
+                                       content["output_name"], content["description"])
+
+        response = flask.jsonify({ "status": "ok", "token": token, "results": results })


To fix the issue, we need to ensure that stack traces and other sensitive information are not included in the JSON response sent to the client. Instead, we should log the stack trace on the server for debugging purposes and return a generic error message to the client. Specifically:

Modify the run method in PyConceptDeriveAgent to log the stack trace using the logger and replace the content field in the result dictionary with a generic error message.

Ensure that the derive_py_concept function in py-src/data_formulator/agent_routes.py does not expose sensitive information in the results field of the JSON response.

py-src/data_formulator/agent_routes.py

+        if conn:
+            conn.close()
+
+        response = flask.jsonify({ "token": token, "status": "ok", "results": results })


To fix the issue, we need to ensure that sensitive stack trace information is not exposed to external users. Instead:

Log the detailed stack trace on the server for debugging purposes.

Return a generic error message to the user, such as "An internal error occurred."

The changes will involve:

Modifying the process_gpt_response method in py-src/data_formulator/agents/agent_sql_data_rec.py to log the stack trace and replace the detailed error message with a generic one.

Ensuring that the results object returned by agent.run in py-src/data_formulator/agent_routes.py does not contain sensitive information.

py-src/data_formulator/tables_routes.py

Copilot

Copilot reviewed 42 out of 44 changed files in this pull request and generated no comments.

Files not reviewed (2)

.env.template: Language not supported
src/scss/DataView.scss: Language not supported

Comments suppressed due to low confidence (3)

src/views/ConceptShelf.tsx:194

Verify that replacing the 'bin' operator with 'max' (and adding a separate 'min' card) is intentional and that any downstream logic relying on binning is updated accordingly.

<OperatorCard operator="max" />

src/data/utils.ts:57

Ensure that 'rows' is non-empty before accessing rows[0] when handling duplicate column names to prevent potential runtime errors when the input text is empty.

for (let i = 0; i < rows[0].length; i++) {

src/app/dfSlice.tsx:350

Review the updated filtering logic for 'conceptShelfItems'; verify that simply filtering by tableRef (without considering base field references via findBaseFields) meets the intended behavior.

state.conceptShelfItems = state.conceptShelfItems.filter(f => !(f.tableRef == tableId));

…ime, fixing code issues

py-src/data_formulator/agent_routes.py

+        if conn:
+            conn.close()
+
+        response = flask.jsonify({ "token": token, "status": "ok", "results": results})


To fix the issue, we need to ensure that sensitive error messages or stack traces are not exposed to external users. Instead, we should log the detailed error messages on the server for debugging purposes and return a generic error message to the user. Specifically:

Modify the refine_data function in agent_routes.py to sanitize the results field before including it in the JSON response.

Ensure that detailed error messages are logged using the logger module, but only generic messages are returned to the user.

Update the run_transform_in_sandbox2020 and process_gpt_response methods to sanitize error messages before they propagate to the results field.

…rnal DBs

Copilot

Pull Request Overview

This PR prepares the next release by enhancing support for large datasets through backend database integration (DuckDB), updates API endpoints for improved data handling, and refines both the chart configuration and various data agent functionalities. Key changes include:

Updating chart templates (e.g. adding the “y” channel for bar charts) and reconfiguring API endpoints in utils.
Adding session management and default chart configurations in the application state.
Revising Python agents and the sandbox logic (including dependency updates in pyproject.toml) to incorporate DuckDB and improve transformation safety.

Reviewed Changes

Copilot reviewed 47 out of 48 changed files in this pull request and generated no comments.

File	Description
src/components/ChartTemplates.tsx	Added the "y" channel to the bar chart template to support improved encoding mapping.
src/app/utils.tsx, dfSlice.tsx	Updated API endpoints and session/config management; revised concept shelf item filtering.
py-src/data_formulator/py_sandbox.py	Renamed and updated sandbox execution logic; changed error response key semantics.
Other agent and configuration files	Refined Python agents, added DuckDB dependency in pyproject.toml, and updated relevant UI.

Files not reviewed (1)

.env.template: Language not supported

Chenglong-MS added 16 commits March 25, 2025 11:28

fix csv loading issue, and improve table header width calculation alg…

7be3639

…orithm (may need more attention)

Merge branch 'main' into dev

66dddda

Merge branch 'main' into dev

9dc0321

Merge branch 'main' into dev

4d0f9a6

Merge branch 'main' into dev

1b61337

include session id so we can process data in the python server

4285943

fix session id

4b2d5f6

a bit messy, but I'll clean up later

8df59e0

really messy for now

3a087d5

still messy, but able to work with remote data

20aab7f

updating db support, if you are reading here, hello

b078aa0

supporting saving intermediate tables as virtual tables

c31ad44

supports backend caching of the db file and managing the file in fron…

d554110

…tend

halfway through working out new concept derivation function

3b0df51

exciting! reworked concept derive features to prepare working with db…

f7a1b70

… tables

almost there!

f75b957

github-advanced-security bot found potential problems Apr 17, 2025

View reviewed changes

fix routes

e1137de

github-advanced-security bot found potential problems Apr 17, 2025

View reviewed changes

some error message cleanup

9ab0c84

Chenglong-MS requested a review from Copilot April 17, 2025 17:27

Copilot AI reviewed Apr 17, 2025

View reviewed changes

much cleaner sandboxing machnism, and by default increases response t…

a8f6fa0

…ime, fixing code issues

github-advanced-security bot found potential problems Apr 17, 2025

View reviewed changes

fix a minor error

c7ae48c

Chenglong-MS mentioned this pull request Apr 18, 2025

Loading larger .csv files broken #87

Open

Chenglong-MS added 4 commits April 17, 2025 17:35

fix bug in sql rec mode

d9ac1bf

fix various db manager issues, need to add a function to connect exte…

56c3ce7

…rnal DBs

supporting connection to mysql and pgsql

44b6373

ready for version 0.2

2731397

Chenglong-MS added 4 commits April 23, 2025 16:36

update version

3c7b14e

add readme

5513912

revise readme

ac25429

minor

8b2bd62

Chenglong-MS requested a review from Copilot April 24, 2025 00:16

Copilot AI reviewed Apr 24, 2025

View reviewed changes

Chenglong-MS merged commit 2fb5016 into main Apr 24, 2025
6 checks passed

@@ -160,3 +160,3 @@
                     except Exception as e:
-                        logger.info(f"Error: {e}")
+                        logger.error("An error occurred while testing the model:", exc_info=True)
                         result = {
@@ -164,3 +164,3 @@
                             "status": 'error',
-                            "message": sanitize_model_error(str(e)),
+                            "message": "An internal error occurred. Please try again later.",
                         }

@@ -155,3 +155,3 @@
                             result = {
-                                "model": content['model'],
+                                "model": html.escape(content['model']),
                                 "status": 'ok',
@@ -162,3 +162,3 @@
                         result = {
-                            "model": content['model'],
+                            "model": html.escape(content['model']),
                             "status": 'error',

@@ -242,3 +242,4 @@
-                    response = flask.jsonify({ "status": "ok", "token": token, "results": results })
+                    sanitized_results = [r for r in results if r['status'] == 'ok']
+                    response = flask.jsonify({ "status": "ok", "token": token, "results": sanitized_results })
                 else:

@@ -146,4 +146,4 @@
                                 error_message = traceback.format_exc()
-                                print(error_message)
-                                result = {'status': 'other error', 'content': error_message}
+                                logger.error(f"An error occurred while running the concept derivation: {error_message}")
+                                result = {'status': 'other error', 'content': 'An internal error occurred while processing your request.'}
                         else:

@@ -359,3 +359,8 @@
-                    response = flask.jsonify({ "token": token, "status": "ok", "results": results })
+                    sanitized_results = [
+                        {key: value if key != 'content' or result['status'] != 'other error' else "An internal error occurred."
+                         for key, value in result.items()}
+                        for result in results
+                    ]
+                    response = flask.jsonify({ "token": token, "status": "ok", "results": sanitized_results })
                 else:

@@ -190,3 +190,3 @@
                                 logger.warning(error_message)
-                                result = {'status': 'other error', 'code': code_str, 'content': f"Unexpected error: {error_message}"}
+                                result = {'status': 'other error', 'code': code_str, 'content': "An internal error occurred while processing your request."}
                         else:

@@ -140,3 +140,3 @@
                         'status': 'error',
-                        'content': result['error_message']
+                        'content': "An error occurred during data transformation. Please check the server logs for details."
                     }

@@ -233,2 +233,3 @@
                                     logger.info(result['content'])
+                                    result['content'] = "An error occurred during data transformation. Please check the server logs for details."
                             except Exception as e:

Preparing for the next release: working with large dataset with a backend database #146

Preparing for the next release: working with large dataset with a backend database #146

Uh oh!

Conversation

Chenglong-MS commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check warning

Uh oh!

Copilot Autofix

Check warning

Uh oh!

Copilot Autofix

Check warning

Uh oh!

Uh oh!

Copilot Autofix

Check warning

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot Autofix

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Check warning

Uh oh!

Uh oh!

Copilot Autofix

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Chenglong-MS commented Apr 17, 2025 •

edited

Loading