Improve response parsing: no ptypes, faster datetimes#513
Improve response parsing: no ptypes, faster datetimes#513
Conversation
| # Keep only the columns relevant to job termination; the API response | ||
| # includes extra fields (e.g. payload, guid) on error that vary by outcome. | ||
| keep <- c("app_id", "app_guid", "job_key", "job_id", "result", "code", "error") |
There was a problem hiding this comment.
Is it a problem that we get variable data at this point?
There was a problem hiding this comment.
I believe that the variable data is for error cases (i.e. trying to terminate a job that is not currently active, which returns a 409 but doesn't raise an R error) vs. successful requests. I guess it's debatable whether we should be raising an R error more eagerly, but I think the behavior of accommodating different field names is consistent with main.
The one difference here is that main will always return the columns of keep whereas on this branch if any columns from keep are missing across all responses, they'd be omitted. I'll update to make it more consistent with main.
There was a problem hiding this comment.
I was actually thinking a bit in the opposite direction: if we sometimes get different fieldnames that's probably totally ok. This is probably more important (or really, more possible) when we get to having these objects not be DFs that get passed around. Then the normalization of "here are the columns you're getting" can be done at the as.data.frame() point). We don't need to do this here, it just smelled a little funny to me
There was a problem hiding this comment.
Gotcha yeah, if we were returning a list then I agree it'd make more sense to not worry about the columns -- the responses are what they are, and that may include different fields.
Intent
Fixes #483
Approach
Removes the
connectapi_ptypesdictionary and vctrs-based type coercion system (ensure_columns,ensure_column,vec_cast). Instead, each getter function declares its owndatetime_colsand applies lightweight post-parse coercion viacoerce_datetime(), which handles RFC 3339 strings, epoch seconds, POSIXct pass-through, and all-NA columns.Parsing pipeline:
Connect$request()now usesjsonlite::fromJSON()instead ofhttr::content(as = "parsed"), giving us control over jsonlite's simplification behaviorparse_connect_rfc3339()uses a vectorized substr-based parser (faster than strptime on large vectors)page_cursor()fetches subsequent pages with simplify=TRUE so jsonlite builds data frames in C, then combines withvctrs::vec_rbind()I started this work while trying to improve performance of
get_usage_static()(see profiling on #501 (comment)). With this branch,get_usage_static()is ~20% faster than on main for 175k records.Checklist
NEWS.md(referencing the connected issue if necessary)?devtools::document()?