Skip to content

Improve response parsing: no ptypes, faster datetimes#513

Open
karawoo wants to merge 5 commits intomainfrom
kara-parsing
Open

Improve response parsing: no ptypes, faster datetimes#513
karawoo wants to merge 5 commits intomainfrom
kara-parsing

Conversation

@karawoo
Copy link
Collaborator

@karawoo karawoo commented Mar 5, 2026

Intent

Fixes #483

Approach

Removes the connectapi_ptypes dictionary and vctrs-based type coercion system (ensure_columns, ensure_column, vec_cast). Instead, each getter function declares its own datetime_cols and applies lightweight post-parse coercion via coerce_datetime(), which handles RFC 3339 strings, epoch seconds, POSIXct pass-through, and all-NA columns.

Parsing pipeline:

  • Connect$request() now uses jsonlite::fromJSON() instead of httr::content(as = "parsed"), giving us control over jsonlite's simplification behavior
  • parse_connect_rfc3339() uses a vectorized substr-based parser (faster than strptime on large vectors)
  • page_cursor() fetches subsequent pages with simplify=TRUE so jsonlite builds data frames in C, then combines with vctrs::vec_rbind()

I started this work while trying to improve performance of get_usage_static() (see profiling on #501 (comment)). With this branch, get_usage_static() is ~20% faster than on main for 175k records.

Checklist

  • Does this change update NEWS.md (referencing the connected issue if necessary)?
  • Does this change need documentation? Have you run devtools::document()?

@karawoo karawoo requested a review from jonkeane March 5, 2026 20:13
@karawoo karawoo mentioned this pull request Mar 5, 2026
2 tasks
Comment on lines +860 to +862
# Keep only the columns relevant to job termination; the API response
# includes extra fields (e.g. payload, guid) on error that vary by outcome.
keep <- c("app_id", "app_guid", "job_key", "job_id", "result", "code", "error")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a problem that we get variable data at this point?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the variable data is for error cases (i.e. trying to terminate a job that is not currently active, which returns a 409 but doesn't raise an R error) vs. successful requests. I guess it's debatable whether we should be raising an R error more eagerly, but I think the behavior of accommodating different field names is consistent with main.

The one difference here is that main will always return the columns of keep whereas on this branch if any columns from keep are missing across all responses, they'd be omitted. I'll update to make it more consistent with main.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually thinking a bit in the opposite direction: if we sometimes get different fieldnames that's probably totally ok. This is probably more important (or really, more possible) when we get to having these objects not be DFs that get passed around. Then the normalization of "here are the columns you're getting" can be done at the as.data.frame() point). We don't need to do this here, it just smelled a little funny to me

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha yeah, if we were returning a list then I agree it'd make more sense to not worry about the columns -- the responses are what they are, and that may include different fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove or drastically overhaul type parsing even when producing data frames

2 participants