Skip to content

[SOLR-18176] HttpShardHandler query throughput bottleneck from ZooKeeper#4237

Open
mlbiscoc wants to merge 2 commits intoapache:mainfrom
mlbiscoc:SOLR-18176-shardhandler-bottleneck
Open

[SOLR-18176] HttpShardHandler query throughput bottleneck from ZooKeeper#4237
mlbiscoc wants to merge 2 commits intoapache:mainfrom
mlbiscoc:SOLR-18176-shardhandler-bottleneck

Conversation

@mlbiscoc
Copy link
Copy Markdown
Contributor

https://issues.apache.org/jira/browse/SOLR-18176

HttpShardHandler was bottlenecking in throughput due to CloudReplicaSource recalling for ZooKeeper collection state with every distrib request due to the missing allowCache=true parameter. This resulted in large CPU utilization in ZooKeeper and the synchronized call blocking QTP threads waiting for zookeeper response.

See JIRA above for more detailed information.

@mlbiscoc mlbiscoc requested a review from dsmiley March 24, 2026 20:17
Copy link
Copy Markdown
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!!!
Remember ./gradlew writeChangelog

Thankfully, org.apache.solr.servlet.HttpSolrCall is also asking for the cached version (2 places).

Makes me wonder, what other callers of getCollection (thus allowCached) were making a deliberate decision vs accidental? IMO the cached version should be the default, if there should even be a default.

@mlbiscoc
Copy link
Copy Markdown
Contributor Author

IMO the cached version should be the default, if there should even be a default.

I agree. I thought about the same thing but I have no idea of the impact and potential consequences of making that change.

Copy link
Copy Markdown
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine without a test but it'd be nice. Not some narrow unit test but somehow deeper. Like... SolrZkClient has metrics... so I could imagine a test that reads a metric before, does the query, then reads the metric again to ensure it hasn't changed. Such a test would comprehensively ensure what we want without depending on any internals other than of course metric names but that's fair.

@dsmiley
Copy link
Copy Markdown
Contributor

dsmiley commented Mar 26, 2026

Thinking of the bigger picture of Solr's internal APIs, caching or not, I'm reminded of a year ago working closer with ClusterState. ClusterState has a bit of javadocs claiming that it is immutable. But it actually isn't -- it can fetch newer state. It's live nodes can change later. Some parts of Solr implicitly expect ClusterState to be an immutable snapshot of the world -- replica placement plugins and I think some general cluster processing. I'd rather see a simplification of ClusterState to actually be immutable, and instead the ClusterStateProvider be the primary API to get / cache state.json & livenodes & aliases & cluster properties etc. Aligned with this would be moving some of CloudSolrClient's caching into an improved ClusterStateProvider base class. I wonder what you think @HoustonPutman ?

@HoustonPutman
Copy link
Copy Markdown
Contributor

Thinking of the bigger picture of Solr's internal APIs, caching or not, I'm reminded of a year ago working closer with ClusterState. ClusterState has a bit of javadocs claiming that it is immutable. But it actually isn't -- it can fetch newer state. It's live nodes can change later. Some parts of Solr implicitly expect ClusterState to be an immutable snapshot of the world -- replica placement plugins and I think some general cluster processing. I'd rather see a simplification of ClusterState to actually be immutable, and instead the ClusterStateProvider be the primary API to get / cache state.json & livenodes & aliases & cluster properties etc. Aligned with this would be moving some of CloudSolrClient's caching into an improved ClusterStateProvider base class. I wonder what you think @HoustonPutman ?

Yeah, that sounds good to me. I don't really have an opinion on whether ClusterState should be immutable or not, but it should be one or the other (immutable or everything is updated) and not in some weird middle-ground.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants