You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Currently when a user goes to collections/ pygeoapi iterates through all the collections and iterates through them, instantiating the class, one by one serially. This becomes a bottleneck in servers with many collections.
This is generally fine for local / offline data but if you need to proxy a remote ESRI endpoint or have another provider (i.e. xarray with S3) that makes a network request to get the metadata like fields, it can significantly slow down the entire collections/ response, particularly if there are multiple of these types of collections.
Describe the solution you'd like
You could solve this in two ways:
Parallelism/Concurrency:
Ideally this would be a good fit for async. However, since there are some challenges with async Starlette server fails with top level async-await #1964 (and I don't think async would work with flask regardless) it seems like a threadpool might be the easiest option in the standard library. Threadpool would add some additional overhead at the start but decrease latency overall.
Caching
You could wrap the collections endpoint with functools.cache and it would become nearly instant without needing to worry about overhead from parallelism. However the downside for this would be dealing with cache invalidation. It could be an option to enable caching collections in the pygeoapi config (i.e. not have it enabled by default) and then have an option to supply a specific header to break the cache if needed.
Describe alternatives you've considered
You can use other green thread libraries like https://www.gevent.org/ but this would introduce another dependency. This would be lighter weight than threading.
Additional context
To my understanding there is no way to limit what data gets returned in collections. If this was possible and one could skip fetching the data that one may need to use network calls for (i.e. fields for EDR) this would be less of an issue. However, I don't believe that is the case.
Is your feature request related to a problem? Please describe.
Currently when a user goes to
collections/pygeoapi iterates through all the collections and iterates through them, instantiating the class, one by one serially. This becomes a bottleneck in servers with many collections.Here is the function:
pygeoapi/pygeoapi/api/__init__.py
Lines 930 to 933 in fa3d129
And here is one of the loops:
pygeoapi/pygeoapi/api/__init__.py
Lines 966 to 968 in fa3d129
This is generally fine for local / offline data but if you need to proxy a remote ESRI endpoint or have another provider (i.e. xarray with S3) that makes a network request to get the metadata like
fields, it can significantly slow down the entire collections/ response, particularly if there are multiple of these types of collections.Describe the solution you'd like
You could solve this in two ways:
functools.cacheand it would become nearly instant without needing to worry about overhead from parallelism. However the downside for this would be dealing with cache invalidation. It could be an option to enable caching collections in the pygeoapi config (i.e. not have it enabled by default) and then have an option to supply a specific header to break the cache if needed.Describe alternatives you've considered
You can use other green thread libraries like https://www.gevent.org/ but this would introduce another dependency. This would be lighter weight than threading.
Additional context
To my understanding there is no way to limit what data gets returned in collections. If this was possible and one could skip fetching the data that one may need to use network calls for (i.e. fields for EDR) this would be less of an issue. However, I don't believe that is the case.