Skip to content

Parallelize / cache collections/ logic? #2141

@C-Loftus

Description

@C-Loftus

Is your feature request related to a problem? Please describe.

Currently when a user goes to collections/ pygeoapi iterates through all the collections and iterates through them, instantiating the class, one by one serially. This becomes a bottleneck in servers with many collections.

Here is the function:

@jsonldify
def describe_collections(api: API, request: APIRequest,
dataset=None) -> Tuple[dict, int, str]:

And here is one of the loops:
for k, v in collections_dict.items():
if v.get('visibility', 'default') == 'hidden':
LOGGER.debug(f'Skipping hidden layer: {k}')

This is generally fine for local / offline data but if you need to proxy a remote ESRI endpoint or have another provider (i.e. xarray with S3) that makes a network request to get the metadata like fields, it can significantly slow down the entire collections/ response, particularly if there are multiple of these types of collections.

Describe the solution you'd like

You could solve this in two ways:

  1. Parallelism/Concurrency:
  • Ideally this would be a good fit for async. However, since there are some challenges with async Starlette server fails with top level async-await #1964 (and I don't think async would work with flask regardless) it seems like a threadpool might be the easiest option in the standard library. Threadpool would add some additional overhead at the start but decrease latency overall.
  1. Caching
  • You could wrap the collections endpoint with functools.cache and it would become nearly instant without needing to worry about overhead from parallelism. However the downside for this would be dealing with cache invalidation. It could be an option to enable caching collections in the pygeoapi config (i.e. not have it enabled by default) and then have an option to supply a specific header to break the cache if needed.

Describe alternatives you've considered

You can use other green thread libraries like https://www.gevent.org/ but this would introduce another dependency. This would be lighter weight than threading.

Additional context

To my understanding there is no way to limit what data gets returned in collections. If this was possible and one could skip fetching the data that one may need to use network calls for (i.e. fields for EDR) this would be less of an issue. However, I don't believe that is the case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions