Parallelize / cache `collections/` logic?

**Is your feature request related to a problem? Please describe.**

Currently when a user goes to `collections/` pygeoapi iterates through all the collections and iterates through them, instantiating the class, one by one serially. This becomes a bottleneck in servers with many collections. 

Here is the function:
https://github.com/geopython/pygeoapi/blob/fa3d129de54856da657a23fa43aa2fb479aa1595/pygeoapi/api/__init__.py#L930-L933
And here is one of the loops:
https://github.com/geopython/pygeoapi/blob/fa3d129de54856da657a23fa43aa2fb479aa1595/pygeoapi/api/__init__.py#L966-L968

This is generally fine for local / offline data but if you need to proxy a remote ESRI endpoint or have another provider (i.e. xarray with S3) that makes a network request to get the metadata like `fields`, it can significantly slow down the entire collections/ response, particularly if there are multiple of these types of collections.

**Describe the solution you'd like**

You could solve this in two ways:

1. Parallelism/Concurrency:
- Ideally this would be a good fit for async. However, since there are some challenges with async #1964 (and I don't think async would work with flask regardless) it seems like a threadpool might be the easiest option in the standard library. Threadpool would add some additional overhead at the start but decrease latency overall. 

2. Caching
- You could wrap the collections endpoint with `functools.cache` and it would become nearly instant without needing to worry about overhead from parallelism. However the downside for this would be dealing with cache invalidation. It could be an option to enable caching collections in the pygeoapi config (i.e. not have it enabled by default) and then have an option to supply a specific header to break the cache if needed. 

**Describe alternatives you've considered**

You can use other green thread libraries like https://www.gevent.org/ but this would introduce another dependency. This would be lighter weight than threading.

**Additional context**

To my understanding there is no way to limit what data gets returned in collections. If this was possible and one could skip fetching the data that one may need to use network calls for (i.e. fields for EDR) this would be less of an issue. However, I don't believe that is the case. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parallelize / cache `collections/` logic? #2141

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development


	@jsonldify
	def describe_collections(api: API, request: APIRequest,
	dataset=None) -> Tuple[dict, int, str]:

	for k, v in collections_dict.items():
	if v.get('visibility', 'default') == 'hidden':
	LOGGER.debug(f'Skipping hidden layer: {k}')

Uh oh!

Parallelize / cache collections/ logic? #2141

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Parallelize / cache `collections/` logic? #2141