Skip to content

Conversation

@Chenglong-MS
Copy link
Collaborator

@Chenglong-MS Chenglong-MS commented Apr 17, 2025

Preparing for the next release:

  • leverages backend database (duckdb) to handle large datasets (millions of rows)

  • dynamically sample data based on chart configuration to create charts

  • leverages NL2SQL to transform data in the backend to support exploration and expressive visualizations

  • update the design of derived concept, supporting reapply the derived concept to new tables in one click.

  • minor minor: support configuring default chart size... for those with large screens :)

  • TODO before release: add utility to connect to external databases.

image image

else:
result = {'status': 'error'}

return json.dumps(result)

Check warning

Code scanning / CodeQL

Information exposure through an exception Medium

Stack trace information
flows to this location and may be exposed to an external user.

Copilot Autofix

AI 9 months ago

To fix the issue, we will modify the error handling in the test_model function to ensure that no stack trace or sensitive internal details are exposed to the user. Instead of sanitizing the exception message, we will log the full stack trace on the server for debugging purposes and return a generic error message to the client. This approach aligns with secure error-handling practices.

The changes will involve:

  1. Logging the full stack trace of the exception using the logging module.
  2. Returning a generic error message to the client, such as "An internal error occurred. Please try again later."
Suggested changeset 1
py-src/data_formulator/agent_routes.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/py-src/data_formulator/agent_routes.py b/py-src/data_formulator/agent_routes.py
--- a/py-src/data_formulator/agent_routes.py
+++ b/py-src/data_formulator/agent_routes.py
@@ -160,3 +160,3 @@
         except Exception as e:
-            logger.info(f"Error: {e}")
+            logger.error("An error occurred while testing the model:", exc_info=True)
             result = {
@@ -164,3 +164,3 @@
                 "status": 'error',
-                "message": sanitize_model_error(str(e)),
+                "message": "An internal error occurred. Please try again later.",
             }
EOF
@@ -160,3 +160,3 @@
except Exception as e:
logger.info(f"Error: {e}")
logger.error("An error occurred while testing the model:", exc_info=True)
result = {
@@ -164,3 +164,3 @@
"status": 'error',
"message": sanitize_model_error(str(e)),
"message": "An internal error occurred. Please try again later.",
}
Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
else:
result = {'status': 'error'}

return json.dumps(result)

Check warning

Code scanning / CodeQL

Reflected server-side cross-site scripting Medium

Cross-site scripting vulnerability due to a
user-provided value
.

Copilot Autofix

AI 9 months ago

To fix the issue, we need to sanitize user-provided data before including it in the result dictionary. Specifically:

  1. Use html.escape() to escape any user-provided strings (e.g., content['model']) before adding them to the result dictionary.
  2. Ensure that all fields in the result dictionary that originate from user input are sanitized.

The changes will be made in the test_model function, where the result dictionary is constructed. The content['model'] value will be sanitized using html.escape() before being added to the result dictionary.


Suggested changeset 1
py-src/data_formulator/agent_routes.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/py-src/data_formulator/agent_routes.py b/py-src/data_formulator/agent_routes.py
--- a/py-src/data_formulator/agent_routes.py
+++ b/py-src/data_formulator/agent_routes.py
@@ -155,3 +155,3 @@
                 result = {
-                    "model": content['model'],
+                    "model": html.escape(content['model']),
                     "status": 'ok',
@@ -162,3 +162,3 @@
             result = {
-                "model": content['model'],
+                "model": html.escape(content['model']),
                 "status": 'error',
EOF
@@ -155,3 +155,3 @@
result = {
"model": content['model'],
"model": html.escape(content['model']),
"status": 'ok',
@@ -162,3 +162,3 @@
result = {
"model": content['model'],
"model": html.escape(content['model']),
"status": 'error',
Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
results = agent.run(content["input_data"], [f['name'] for f in content["input_fields"]],
content["output_name"], content["description"])

response = flask.jsonify({ "status": "ok", "token": token, "results": results })

Check warning

Code scanning / CodeQL

Information exposure through an exception Medium

Stack trace information
flows to this location and may be exposed to an external user.
Stack trace information
flows to this location and may be exposed to an external user.

Copilot Autofix

AI 9 months ago

To fix the issue, we need to ensure that stack traces and other sensitive information are not included in the JSON response sent to the client. Instead, we should log the stack trace on the server for debugging purposes and return a generic error message to the client. Specifically:

  1. Modify the run method in PyConceptDeriveAgent to log the stack trace using the logger and replace the content field in the result dictionary with a generic error message.
  2. Ensure that the derive_py_concept function in py-src/data_formulator/agent_routes.py does not expose sensitive information in the results field of the JSON response.

Suggested changeset 2
py-src/data_formulator/agent_routes.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/py-src/data_formulator/agent_routes.py b/py-src/data_formulator/agent_routes.py
--- a/py-src/data_formulator/agent_routes.py
+++ b/py-src/data_formulator/agent_routes.py
@@ -242,3 +242,4 @@
         
-        response = flask.jsonify({ "status": "ok", "token": token, "results": results })
+        sanitized_results = [r for r in results if r['status'] == 'ok']
+        response = flask.jsonify({ "status": "ok", "token": token, "results": sanitized_results })
     else:
EOF
@@ -242,3 +242,4 @@

response = flask.jsonify({ "status": "ok", "token": token, "results": results })
sanitized_results = [r for r in results if r['status'] == 'ok']
response = flask.jsonify({ "status": "ok", "token": token, "results": sanitized_results })
else:
py-src/data_formulator/agents/agent_py_concept_derive.py
Outside changed files

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/py-src/data_formulator/agents/agent_py_concept_derive.py b/py-src/data_formulator/agents/agent_py_concept_derive.py
--- a/py-src/data_formulator/agents/agent_py_concept_derive.py
+++ b/py-src/data_formulator/agents/agent_py_concept_derive.py
@@ -146,4 +146,4 @@
                     error_message = traceback.format_exc()
-                    print(error_message)
-                    result = {'status': 'other error', 'content': error_message}
+                    logger.error(f"An error occurred while running the concept derivation: {error_message}")
+                    result = {'status': 'other error', 'content': 'An internal error occurred while processing your request.'}
             else:
EOF
@@ -146,4 +146,4 @@
error_message = traceback.format_exc()
print(error_message)
result = {'status': 'other error', 'content': error_message}
logger.error(f"An error occurred while running the concept derivation: {error_message}")
result = {'status': 'other error', 'content': 'An internal error occurred while processing your request.'}
else:
Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
if conn:
conn.close()

response = flask.jsonify({ "token": token, "status": "ok", "results": results })

Check warning

Code scanning / CodeQL

Information exposure through an exception Medium

Stack trace information
flows to this location and may be exposed to an external user.
Stack trace information
flows to this location and may be exposed to an external user.
Stack trace information flows to this location and may be exposed to an external user.
Stack trace information
flows to this location and may be exposed to an external user.
Stack trace information
flows to this location and may be exposed to an external user.
Stack trace information flows to this location and may be exposed to an external user.
Stack trace information
flows to this location and may be exposed to an external user.

Copilot Autofix

AI 9 months ago

To fix the issue, we need to ensure that sensitive stack trace information is not exposed to external users. Instead:

  1. Log the detailed stack trace on the server for debugging purposes.
  2. Return a generic error message to the user, such as "An internal error occurred."

The changes will involve:

  • Modifying the process_gpt_response method in py-src/data_formulator/agents/agent_sql_data_rec.py to log the stack trace and replace the detailed error message with a generic one.
  • Ensuring that the results object returned by agent.run in py-src/data_formulator/agent_routes.py does not contain sensitive information.

Suggested changeset 2
py-src/data_formulator/agent_routes.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/py-src/data_formulator/agent_routes.py b/py-src/data_formulator/agent_routes.py
--- a/py-src/data_formulator/agent_routes.py
+++ b/py-src/data_formulator/agent_routes.py
@@ -359,3 +359,8 @@
         
-        response = flask.jsonify({ "token": token, "status": "ok", "results": results })
+        sanitized_results = [
+            {key: value if key != 'content' or result['status'] != 'other error' else "An internal error occurred." 
+             for key, value in result.items()} 
+            for result in results
+        ]
+        response = flask.jsonify({ "token": token, "status": "ok", "results": sanitized_results })
     else:
EOF
@@ -359,3 +359,8 @@

response = flask.jsonify({ "token": token, "status": "ok", "results": results })
sanitized_results = [
{key: value if key != 'content' or result['status'] != 'other error' else "An internal error occurred."
for key, value in result.items()}
for result in results
]
response = flask.jsonify({ "token": token, "status": "ok", "results": sanitized_results })
else:
py-src/data_formulator/agents/agent_sql_data_rec.py
Outside changed files

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/py-src/data_formulator/agents/agent_sql_data_rec.py b/py-src/data_formulator/agents/agent_sql_data_rec.py
--- a/py-src/data_formulator/agents/agent_sql_data_rec.py
+++ b/py-src/data_formulator/agents/agent_sql_data_rec.py
@@ -190,3 +190,3 @@
                     logger.warning(error_message)
-                    result = {'status': 'other error', 'code': code_str, 'content': f"Unexpected error: {error_message}"}
+                    result = {'status': 'other error', 'code': code_str, 'content': "An internal error occurred while processing your request."}
             else:
EOF
@@ -190,3 +190,3 @@
logger.warning(error_message)
result = {'status': 'other error', 'code': code_str, 'content': f"Unexpected error: {error_message}"}
result = {'status': 'other error', 'code': code_str, 'content': "An internal error occurred while processing your request."}
else:
Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
@Chenglong-MS Chenglong-MS requested a review from Copilot April 17, 2025 17:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 42 out of 44 changed files in this pull request and generated no comments.

Files not reviewed (2)
  • .env.template: Language not supported
  • src/scss/DataView.scss: Language not supported
Comments suppressed due to low confidence (3)

src/views/ConceptShelf.tsx:194

  • Verify that replacing the 'bin' operator with 'max' (and adding a separate 'min' card) is intentional and that any downstream logic relying on binning is updated accordingly.
<OperatorCard operator="max" />

src/data/utils.ts:57

  • Ensure that 'rows' is non-empty before accessing rows[0] when handling duplicate column names to prevent potential runtime errors when the input text is empty.
for (let i = 0; i < rows[0].length; i++) {

src/app/dfSlice.tsx:350

  • Review the updated filtering logic for 'conceptShelfItems'; verify that simply filtering by tableRef (without considering base field references via findBaseFields) meets the intended behavior.
state.conceptShelfItems = state.conceptShelfItems.filter(f => !(f.tableRef == tableId));

if conn:
conn.close()

response = flask.jsonify({ "token": token, "status": "ok", "results": results})

Check warning

Code scanning / CodeQL

Information exposure through an exception Medium

Stack trace information
flows to this location and may be exposed to an external user.
Stack trace information
flows to this location and may be exposed to an external user.

Copilot Autofix

AI 9 months ago

To fix the issue, we need to ensure that sensitive error messages or stack traces are not exposed to external users. Instead, we should log the detailed error messages on the server for debugging purposes and return a generic error message to the user. Specifically:

  1. Modify the refine_data function in agent_routes.py to sanitize the results field before including it in the JSON response.
  2. Ensure that detailed error messages are logged using the logger module, but only generic messages are returned to the user.
  3. Update the run_transform_in_sandbox2020 and process_gpt_response methods to sanitize error messages before they propagate to the results field.

Suggested changeset 3
py-src/data_formulator/agent_routes.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/py-src/data_formulator/agent_routes.py b/py-src/data_formulator/agent_routes.py
--- a/py-src/data_formulator/agent_routes.py
+++ b/py-src/data_formulator/agent_routes.py
@@ -412,3 +412,8 @@
 
-        response = flask.jsonify({ "token": token, "status": "ok", "results": results})
+        sanitized_results = [
+            {key: value if key != 'content' or result['status'] == 'ok' else "An error occurred during data processing."
+             for key, value in result.items()}
+            for result in results
+        ]
+        response = flask.jsonify({ "token": token, "status": "ok", "results": sanitized_results})
     else:
EOF
@@ -412,3 +412,8 @@

response = flask.jsonify({ "token": token, "status": "ok", "results": results})
sanitized_results = [
{key: value if key != 'content' or result['status'] == 'ok' else "An error occurred during data processing."
for key, value in result.items()}
for result in results
]
response = flask.jsonify({ "token": token, "status": "ok", "results": sanitized_results})
else:
py-src/data_formulator/py_sandbox.py
Outside changed files

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/py-src/data_formulator/py_sandbox.py b/py-src/data_formulator/py_sandbox.py
--- a/py-src/data_formulator/py_sandbox.py
+++ b/py-src/data_formulator/py_sandbox.py
@@ -140,3 +140,3 @@
             'status': 'error',
-            'content': result['error_message']
+            'content': "An error occurred during data transformation. Please check the server logs for details."
         }
EOF
@@ -140,3 +140,3 @@
'status': 'error',
'content': result['error_message']
'content': "An error occurred during data transformation. Please check the server logs for details."
}
py-src/data_formulator/agents/agent_py_data_transform.py
Outside changed files

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/py-src/data_formulator/agents/agent_py_data_transform.py b/py-src/data_formulator/agents/agent_py_data_transform.py
--- a/py-src/data_formulator/agents/agent_py_data_transform.py
+++ b/py-src/data_formulator/agents/agent_py_data_transform.py
@@ -233,2 +233,3 @@
                         logger.info(result['content'])
+                        result['content'] = "An error occurred during data transformation. Please check the server logs for details."
                 except Exception as e:
EOF
@@ -233,2 +233,3 @@
logger.info(result['content'])
result['content'] = "An error occurred during data transformation. Please check the server logs for details."
except Exception as e:
Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
@Chenglong-MS Chenglong-MS requested a review from Copilot April 24, 2025 00:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR prepares the next release by enhancing support for large datasets through backend database integration (DuckDB), updates API endpoints for improved data handling, and refines both the chart configuration and various data agent functionalities. Key changes include:

  • Updating chart templates (e.g. adding the “y” channel for bar charts) and reconfiguring API endpoints in utils.
  • Adding session management and default chart configurations in the application state.
  • Revising Python agents and the sandbox logic (including dependency updates in pyproject.toml) to incorporate DuckDB and improve transformation safety.

Reviewed Changes

Copilot reviewed 47 out of 48 changed files in this pull request and generated no comments.

File Description
src/components/ChartTemplates.tsx Added the "y" channel to the bar chart template to support improved encoding mapping.
src/app/utils.tsx, dfSlice.tsx Updated API endpoints and session/config management; revised concept shelf item filtering.
py-src/data_formulator/py_sandbox.py Renamed and updated sandbox execution logic; changed error response key semantics.
Other agent and configuration files Refined Python agents, added DuckDB dependency in pyproject.toml, and updated relevant UI.
Files not reviewed (1)
  • .env.template: Language not supported

@Chenglong-MS Chenglong-MS merged commit 2fb5016 into main Apr 24, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants