Skip to content

Fix: Manifest content hash computation times out (#6123)#7258

Merged
dsotirho-ucsc merged 5 commits intodevelopfrom
issues/dsotirho-ucsc/6123-manifest-content-hash
Feb 17, 2026
Merged

Fix: Manifest content hash computation times out (#6123)#7258
dsotirho-ucsc merged 5 commits intodevelopfrom
issues/dsotirho-ucsc/6123-manifest-content-hash

Conversation

@dsotirho-ucsc
Copy link
Copy Markdown
Contributor

@dsotirho-ucsc dsotirho-ucsc commented Jul 8, 2025

Linked issues: #6123

Checklist

Author

  • PR is assigned to the author
  • PR is a draft
  • Target branch is develop
  • Name of PR branch matches issues/<GitHub handle of author>/<issue#>-<slug>
  • PR is linked to all issues it (partially) resolves
  • PR description links to connected issues
  • PR title matches1 that of a linked issue or comment in PR explains why they're different
  • PR title references all linked issues
  • For each linked issue, there is at least one commit whose title references that issue

1 when the issue title describes a problem, the corresponding PR
title is Fix: followed by the issue title

Author (partiality)

  • Added p tag to titles of partial commits
  • This PR is labeled partial or completely resolves all linked issues
  • This PR partially resolves each of the linked issues or does not have the partial label

Author (reindex)

  • Added r tag to commit title or the changes introduced by this PR will not require reindexing of any deployment
  • This PR is labeled reindex:dev or the changes introduced by it will not require reindexing of dev
  • This PR is labeled reindex:anvildev or the changes introduced by it will not require reindexing of anvildev
  • This PR is labeled reindex:anvilprod or the changes introduced by it will not require reindexing of anvilprod
  • This PR is labeled reindex:prod or the changes introduced by it will not require reindexing of prod
  • This PR is labeled reindex:partial and its description documents the specific reindexing procedure for dev, anvildev, anvilprod and prod or requires a full reindex or carries none of the labels reindex:dev, reindex:anvildev, reindex:anvilprod and reindex:prod

Author (API changes)

  • This PR and its linked issues are labeled API or this PR does not modify a REST API
  • Added a (A) tag to commit title for backwards (in)compatible changes or this PR does not modify a REST API
  • Updated REST API version number in app.py or this PR does not modify a REST API

Author (upgrading deployments)

  • Ran make docker_images.json and committed the resulting changes or this PR does not modify azul_docker_images, or any other variables referenced in the definition of that variable
  • Documented upgrading of deployments in UPGRADING.rst or this PR does not require upgrading deployments
  • Added u tag to commit title or this PR does not require upgrading deployments
  • This PR is labeled upgrade or does not require upgrading deployments
  • This PR is labeled deploy:shared or does not modify docker_images.json, and does not require deploying the shared component for any other reason
  • This PR is labeled deploy:gitlab or does not require deploying the gitlab component
  • This PR is labeled deploy:runner or does not require deploying the runner image

Author (hotfixes)

  • Added F tag to main commit title or this PR does not include permanent fix for a temporary hotfix
  • Reverted the temporary hotfixes for any linked issues or the none of the stable branches (anvilprod and prod) have temporary hotfixes for any of the issues linked to this PR

Author (before every review)

  • Rebased PR branch on develop, squashed fixups from prior reviews
  • Ran make requirements_update or this PR does not modify requirements*.txt, common.mk, Makefile, Dockerfile or environment.boot
  • Added R tag to commit title or this PR does not modify requirements*.txt
  • This PR is labeled reqs or does not modify requirements*.txt
  • make integration_test passes in personal deployment or this PR does not modify functionality that could affect the IT outcome
  • PR is awaiting requested review from a peer
  • Status of PR is Review requested
  • PR is assigned to only the peer

Peer reviewer (after approval)

Note that when requesting changes, the PR must be assigned back to the author.

  • Actually approved the PR
  • PR is not a draft
  • PR is awaiting requested review from system administrator
  • Status of PR is Review requested
  • PR is assigned to only the system administrator

System administrator (after approval)

  • Actually approved the PR
  • Labeled linked issues as demo or no demo
  • Commented on linked issues about demo expectations or all linked issues are labeled no demo
  • Decided if PR can be labeled no sandbox
  • A comment to this PR details the completed security design review
  • PR title is appropriate as title of merge commit
  • N reviews label is accurate
  • Status of PR is Approved
  • PR is assigned to only the operator

Operator

  • Checked reindex:… labels and r commit title tag
  • Checked that demo expectations are clear or all linked issues are labeled no demo
  • Squashed PR branch and rebased onto develop
  • Sanity-checked history
  • Pushed PR branch to GitHub

Operator (deploy .shared and .gitlab components)

  • Ran _select dev.shared && CI_COMMIT_REF_NAME=develop make -C terraform/shared apply_keep_unused or this PR is not labeled deploy:shared
  • Ran _select dev.gitlab && CI_COMMIT_REF_NAME=develop make -C terraform/gitlab apply or this PR is not labeled deploy:gitlab
  • Ran _select anvildev.shared && CI_COMMIT_REF_NAME=develop make -C terraform/shared apply_keep_unused or this PR is not labeled deploy:shared
  • Ran _select anvildev.gitlab && CI_COMMIT_REF_NAME=develop make -C terraform/gitlab apply or this PR is not labeled deploy:gitlab
  • Checked the items in the next section or this PR is labeled deploy:gitlab
  • PR is assigned to only the system administrator or this PR is not labeled deploy:gitlab

System administrator (post-deploy of .gitlab component)

  • Background migrations for dev.gitlab are complete or this PR is not labeled deploy:gitlab
  • Background migrations for anvildev.gitlab are complete or this PR is not labeled deploy:gitlab
  • PR is assigned to only the operator

Operator (deploy runner image)

  • Ran _select dev.gitlab && make -C terraform/gitlab/runner or this PR is not labeled deploy:runner
  • Ran _select anvildev.gitlab && make -C terraform/gitlab/runner or this PR is not labeled deploy:runner

Operator (sandbox build)

  • Added sandbox label or PR is labeled no sandbox
  • Pushed PR branch to GitLab dev or PR is labeled no sandbox
  • Pushed PR branch to GitLab anvildev or PR is labeled no sandbox
  • Build passes in sandbox deployment or PR is labeled no sandbox
  • Build passes in anvilbox deployment or PR is labeled no sandbox
  • Reviewed build logs for anomalies in sandbox deployment or PR is labeled no sandbox
  • Reviewed build logs for anomalies in anvilbox deployment or PR is labeled no sandbox
  • Deleted unreferenced indices in sandbox or this PR does not remove catalogs or otherwise causes unreferenced indices in dev
  • Deleted unreferenced indices in anvilbox or this PR does not remove catalogs or otherwise causes unreferenced indices in anvildev
  • Started reindex in sandbox or this PR is not labeled reindex:dev
  • Started reindex in anvilbox or this PR is not labeled reindex:anvildev
  • Checked for failures in sandbox or this PR is not labeled reindex:dev
  • Checked for failures in anvilbox or this PR is not labeled reindex:anvildev

Operator (merge the branch)

  • All status checks passed and the PR is mergeable
  • The title of the merge commit starts with the title of this PR
  • Added PR # reference to merge commit title
  • Collected commit title tags in merge commit title but only included p if the PR is also labeled partial
  • Pushed merge commit to GitHub
  • Status of PR is Merged lower
  • Status of blocked issues is Triage or no issues are blocked on the linked issues

Operator (main build)

  • Pushed merge commit to GitLab dev
  • Pushed merge commit to GitLab anvildev
  • Build passes on GitLab dev
  • Reviewed build logs for anomalies on GitLab dev
  • Build passes on GitLab anvildev
  • Reviewed build logs for anomalies on GitLab anvildev
  • Ran _select dev.shared && make -C terraform/shared apply or this PR is not labeled deploy:shared
  • Ran _select anvildev.shared && make -C terraform/shared apply or this PR is not labeled deploy:shared
  • Deleted PR branch from GitHub
  • Deleted PR branch from GitLab dev
  • Deleted PR branch from GitLab anvildev
  • Status of linked issues is Lower Triage

Operator (reindex)

  • Deindexed all unreferenced catalogs in dev or this PR is neither labeled reindex:partial nor reindex:dev
  • Deindexed all unreferenced catalogs in anvildev or this PR is neither labeled reindex:partial nor reindex:anvildev
  • Deindexed specific sources in dev or this PR is neither labeled reindex:partial nor reindex:dev
  • Deindexed specific sources in anvildev or this PR is neither labeled reindex:partial nor reindex:anvildev
  • Indexed specific sources in dev or this PR is neither labeled reindex:partial nor reindex:dev
  • Indexed specific sources in anvildev or this PR is neither labeled reindex:partial nor reindex:anvildev
  • Started reindex in dev or this PR does not require reindexing dev
  • Started reindex in anvildev or this PR does not require reindexing anvildev
  • Checked for, triaged and possibly requeued messages in both fail queues in dev or this PR does not require reindexing dev
  • Checked for, triaged and possibly requeued messages in both fail queues in anvildev or this PR does not require reindexing anvildev
  • Emptied fail queues in dev or this PR does not require reindexing dev
  • Emptied fail queues in anvildev or this PR does not require reindexing anvildev
  • Restarted the Data Browser pipeline for the ucsc/hca/dev branch on GitLab in dev or this PR does not require reindexing dev
  • Restarted the Data Browser pipeline for the ucsc/lungmap/dev branch on GitLab in dev or this PR does not require reindexing dev
  • Restarted deploy_browser job in the GitLab pipeline for this PR in dev or this PR does not require reindexing dev
  • Restarted the Data Browser pipeline for the ucsc/anvil/anvildev branch on GitLab in anvildev or this PR does not require reindexing anvildev
  • Restarted deploy_browser job in the GitLab pipeline for this PR in anvildev or this PR does not require reindexing anvildev

Operator (mirroring)

  • Started mirroring in dev or this PR does not require mirroring dev
  • Started mirroring in anvildev or this PR does not require mirroring anvildev
  • Checked for, triaged and possibly requeued messages in mirror fail queue in dev or this PR does not require mirroring dev
  • Checked for, triaged and possibly requeued messages in mirror fail queue in anvildev or this PR does not require mirroring anvildev
  • Emptied mirror fail queue in dev or this PR does not require mirroring dev
  • Emptied mirror fail queue in anvildev or this PR does not require mirroring anvildev

Operator

  • Propagated the deploy:shared, deploy:gitlab, deploy:runner, API, reindex:partial, reindex:anvilprod and reindex:prod labels to the next promotion PRs or this PR carries none of these labels
  • Propagated any specific instructions related to the deploy:shared, deploy:gitlab, deploy:runner, API, reindex:partial, reindex:anvilprod and reindex:prod labels, from the description of this PR to that of the next promotion PRs or this PR carries none of these labels
  • PR is assigned to no one

Shorthand for review comments

  • L line is too long
  • W line wrapping is wrong
  • Q bad quotes
  • F other formatting problem

@github-actions github-actions Bot added the orange label Jul 8, 2025
@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/dsotirho-ucsc/6123-manifest-content-hash branch from b5212ca to 7ed3673 Compare July 8, 2025 00:47
@codecov
Copy link
Copy Markdown

codecov Bot commented Jul 8, 2025

Codecov Report

❌ Patch coverage is 87.67123% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.83%. Comparing base (9578669) to head (3e29ec5).
⚠️ Report is 6 commits behind head on develop.

Files with missing lines Patch % Lines
test/integration_test.py 0.00% 6 Missing ⚠️
src/azul/azulclient.py 0.00% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #7258      +/-   ##
===========================================
- Coverage    84.83%   84.83%   -0.01%     
===========================================
  Files          157      157              
  Lines        23161    23185      +24     
===========================================
+ Hits         19648    19668      +20     
- Misses        3513     3517       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@coveralls
Copy link
Copy Markdown

coveralls commented Jul 8, 2025

Coverage Status

coverage: 85.049% (-0.001%) from 85.05%
when pulling 3b68da4 on issues/dsotirho-ucsc/6123-manifest-content-hash
into 9578669 on develop.

@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/dsotirho-ucsc/6123-manifest-content-hash branch from 7ed3673 to d7e9024 Compare July 8, 2025 19:56
@dsotirho-ucsc dsotirho-ucsc added the API API change affecting callers label Jul 8, 2025
@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/dsotirho-ucsc/6123-manifest-content-hash branch 10 times, most recently from 2a9c76f to e4693d5 Compare July 10, 2025 17:27
@dsotirho-ucsc
Copy link
Copy Markdown
Contributor Author

7258_IT_2025-07-10.txt

achave11-ucsc
achave11-ucsc previously approved these changes Jul 11, 2025
Copy link
Copy Markdown
Member

@achave11-ucsc achave11-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ✅

@achave11-ucsc achave11-ucsc marked this pull request as ready for review July 11, 2025 23:48
Comment thread environment.py
Comment thread src/azul/service/manifest_service.py Outdated
"""
return self._manifest_hash('bundles')

def _manifest_hash(self, base: str) -> int:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PL (modal parameter smell, literal string)

@hannes-ucsc hannes-ucsc removed their assignment Jul 14, 2025
@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/dsotirho-ucsc/6123-manifest-content-hash branch 3 times, most recently from bd5d131 to 81a2b78 Compare July 21, 2025 23:59
Copy link
Copy Markdown
Member

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to request changes in my previous review.

@dsotirho-ucsc
Copy link
Copy Markdown
Contributor Author

Please prove that IT passes in personal deployment with bundle notification enabled and disabled.

Note that IT was run prior to the drop commit.

With bundle notifications disabled (default configuration):
7258_IT_2026-01-07-disabled.txt

With bundle notifications enabled (set in environment.py):
7258_IT_2026-01-07-enabled.txt

Copy link
Copy Markdown
Member

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit tests break on the drop commit. Please address that, and leave the drop commit in place.

Comment thread src/azul/service/manifest_service.py Fixed
@dsotirho-ucsc
Copy link
Copy Markdown
Contributor Author

7258_IT_2026-01-14.txt

Comment thread test/indexer/test_index_controller.py Outdated
return entities


@skipIf(not config.enable_bundle_notifications,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a test requires bundle notifications to be enabled, it should patch that in instead of just being skipped.

@hannes-ucsc
Copy link
Copy Markdown
Member

… in other words, please remove the global patch and individually patch the tests that depend on bundle notifications. Use a mixin instead of decorators. Any tests that patch in both cases (enabled and disabled), should not use the mixin. You can remove the drop commit but ensure that the branch history makes obvious (e.g., via fixups) what was changed in response to this latest review.

@dsotirho-ucsc
Copy link
Copy Markdown
Contributor Author

7258_IT_2026-01-22.txt

Comment thread test/azul_test_case.py Outdated

@classmethod
def _patch_enable_bundle_notifications(cls):
cls.addClassPatch(patch.object(target=type(config),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use patch_config here?

Comment thread test/azul_test_case.py Outdated
cls._patch_enable_bundle_notifications()

@classmethod
def _patch_enable_bundle_notifications(cls):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline this method, please. I don't see any other callsites or other reasons to have it.

@dsotirho-ucsc
Copy link
Copy Markdown
Contributor Author

7258_IT_2026-01-27.txt

Copy link
Copy Markdown
Member

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No fixups next time, please. Push commits individually.

Index: src/azul/service/manifest_service.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/service/manifest_service.py b/src/azul/service/manifest_service.py
--- a/src/azul/service/manifest_service.py	(revision 7d93c67e2b241feab2bf219b0dd247dc9be8cafe)
+++ b/src/azul/service/manifest_service.py	(date 1769619165737)
@@ -104,6 +104,7 @@
 )
 from azul.indexer.document import (
     DocumentType,
+    EntityType,
     FieldPath,
 )
 from azul.indexer.field import (
@@ -851,7 +852,7 @@
 
     @property
     @abstractmethod
-    def entity_type(self) -> str:
+    def entity_type(self) -> EntityType:
         """
         The type of the index entities this generator consumes. This controls
         which aggregate Elasticsearch index is queried to fetch the aggregate
@@ -993,11 +994,7 @@
         # The explicit filters are already normalized so we don't to do anything
         # special to desensitize the hash to insignificat differences
         filter_string = json.dumps(self.filters.explicit)
-        # If incremental index changes are disabled, we don't need to worry
-        # about individual bundles, only sources.
-        content_hash = str(
-            self.manifest_hash(by_bundle=config.enable_bundle_notifications)
-        )
+        content_hash = self._content_hash(by_bundle=config.enable_bundle_notifications)
         catalog = self.catalog
         format = self.format()
         manifest_hash_input = [
@@ -1047,9 +1044,7 @@
             file_name = atlas + '-manifest-' + self.s3_object_key_base(manifest_key)
         return file_name
 
-    def _create_request(self, entity_type: str | None = None) -> Search:
-        if entity_type is None:
-            entity_type = self.entity_type
+    def _create_request(self, entity_type: EntityType) -> Search:
         pipeline = self._create_pipeline()
         request = self.service.create_request(self.catalog, entity_type)
         request = pipeline.prepare_request(request)
@@ -1171,7 +1166,7 @@
         return self.mirror_service.mirror_uri(source, file_cls, file)
 
     @cache
-    def manifest_hash(self, *, by_bundle: bool) -> int:
+    def _content_hash(self, *, by_bundle: bool) -> str:
         """
         Return a hash of the input this generator builds the manifest from. The
         input is the set of ES documents from the files index. For two generator
@@ -1204,14 +1199,14 @@
         filters. This mode should *not* be used if the index is changing or is
         likely to change due to the incremental incorporation of bundles.
         """
-        log.debug('Computing content hash for manifest from %s using %r ...',
+        log.debug('Computing content hash from %s matching %r ...',
                   'bundles' if by_bundle else 'sources', self.filters)
         start_time = time.time()
         if by_bundle:
-            request = self._create_request()
+            entity_type = None
         else:
-            root_entity_type = self.metadata_plugin.root_entity_type
-            request = self._create_request(entity_type=root_entity_type)
+            entity_type = self.metadata_plugin.root_entity_type
+        request = self._create_request(entity_type)
         request.aggs.metric(
             'hash',
             'scripted_metric',
@@ -1244,8 +1239,8 @@
         request = request.extra(size=0)
         response = request.execute()
         assert len(response.hits) == 0
-        hash_value = response.aggregations.hash.value
-        log.info('Manifest content hash %i was computed in %.3fs using filters %r.',
+        hash_value = str(response.aggregations.hash.value)
+        log.info('Content hash %r was computed in %.3fs using filters %r.',
                  hash_value, time.time() - start_time, self.filters)
         return hash_value
 
@@ -1450,7 +1445,7 @@
         return 'curlrc'
 
     @property
-    def entity_type(self) -> str:
+    def entity_type(self) -> EntityType:
         return 'files'
 
     @cached_property
@@ -1704,7 +1699,7 @@
         return 'tsv'
 
     @property
-    def entity_type(self) -> str:
+    def entity_type(self) -> EntityType:
         return 'files'
 
     @cached_property
@@ -1819,7 +1814,7 @@
         return None
 
     def _all_docs_sorted(self) -> Iterable[JSON]:
-        request = self._create_request()
+        request = self._create_request(self.entity_type)
         request = request.params(preserve_order=True).sort('entity_id.keyword')
         for hit in request.scan():
             doc = self._hit_to_doc(hit)
@@ -1849,7 +1844,7 @@
                                 metaclass=ABCMeta):
 
     @property
-    def entity_type(self) -> str:
+    def entity_type(self) -> EntityType:
         # Orphans only have projects/datasets as hubs, so we need to retrieve
         # aggregates of those types in order to join against orphan replicas
         root_entity_type = self.metadata_plugin.root_entity_type
Index: src/azul/plugins/__init__.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/plugins/__init__.py b/src/azul/plugins/__init__.py
--- a/src/azul/plugins/__init__.py	(revision 7d93c67e2b241feab2bf219b0dd247dc9be8cafe)
+++ b/src/azul/plugins/__init__.py	(date 1769618757589)
@@ -496,7 +496,7 @@
         raise NotImplementedError
 
     @property
-    def root_entity_type(self) -> str:
+    def root_entity_type(self) -> EntityType:
         """
         The type of entity that sits at the root of the entity graph, and that
         all other entities are directly or indirectly associated with.
@@ -509,8 +509,10 @@
         """
         raise NotImplementedError
 
+    # REVIEW: Separate commit for the type hint changes
+
     @property
-    def hot_entity_types(self) -> Iterable[str]:
+    def hot_entity_types(self) -> Iterable[EntityType]:
         """
         The types of inner entities that do not explicitly track their hubs in
         replica documents in order to avoid a large list of hub references in

@cached_property
def manifest_content_hash(self) -> int:
log.debug('Computing content hash for manifest using filters %r ...', self.filters)
@cache

Check warning

Code scanning / CodeQL

Use of the return value of a procedure Warning

The result of
cache
is used even though it is always None.
@dsotirho-ucsc
Copy link
Copy Markdown
Contributor Author

7258_IT_2026-01-29.txt

Copy link
Copy Markdown
Member

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your commit titles tend to be a little too specific. You typically don't need to identify the artifacts that you are modifying. The title should document the intent, specify the refactoring, or call out the issue being fixed. Which artifacts are being modified is usually immediately apparent from the diff. Please change the titles to this:

Add FIXME (#7183)
Use type alias for entity type
[A] Workaround: Manifest content hash computation times out (#6123)
Fix method type hint
Refactor unit test

Comment thread src/azul/service/manifest_service.py Outdated
hash_value = response.aggregations.hash.value
log.info('Manifest content hash %i was computed in %.3fs using filters %r.',
hash_value = str(response.aggregations.hash.value)
log.info('Content hash %r was computed in %.3fs using filters %r.',
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'Computed content hash %r from %s matching %r'

Not sure about period at the end. Match existing conventions.

Copy link
Copy Markdown
Member

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and no FIXUPS please next time and push commits individually so that there is a status check for each one.

@dsotirho-ucsc
Copy link
Copy Markdown
Contributor Author

7258_IT_2026-01-30.txt

Comment thread environment.py
# FIXME: Enable bundle notifications again #7183
# https://github.com/DataBiosphere/azul/issues/7183
#
'AZUL_ENABLE_BUNDLE_NOTIFICATIONS': '0'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change this to 1 for prod and anvilprod for the time being, and make this PR partial. We'll let it marinate in the lower deployments for three to four weeks. Until then, the ticket should remain in Triage.

@dsotirho-ucsc
Copy link
Copy Markdown
Contributor Author

7258_IT_2026-02-06.txt

Comment thread deployments/anvilprod/environment.py Outdated
'AZUL_MIRRORING_CONCURRENCY': '128',

# FIXME: Remove once bundle notifications are enabled globally
# https://github.com/DataBiosphere/azul/issues/7183
Copy link
Copy Markdown
Member

@hannes-ucsc hannes-ucsc Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong text and referenced issue. See my previous review about when these will be removed.

@dsotirho-ucsc
Copy link
Copy Markdown
Contributor Author

7258_IT_2026-02-13.txt

@hannes-ucsc
Copy link
Copy Markdown
Member

Security Design Review: PR #7258

Title: Fix: Manifest content hash computation times out (#6123)

Summary of Changes

This PR introduces a workaround for manifest content hash computation timeouts by adding a new AZUL_ENABLE_BUNDLE_NOTIFICATIONS configuration toggle. When bundle notifications are disabled (the new default on lower deployments), the manifest content hash is computed from source IDs instead of bundle FQIDs, which is faster but only safe when the index is not being incrementally updated via bundle notifications. The bundle notification endpoint (/{catalog}/bundles) is also disabled when the toggle is off.

Security-Relevant Changes

  1. Disabling the /{catalog}/bundles endpoint (index_controller.py:74)

    • When enable_bundle_notifications is False, the route is not registered via the enabled=False parameter to @app.route. This means Chalice will return a 403 (not found/forbidden) for requests to this path.
    • Defense in depth: An assert config.enable_bundle_notifications guard is added inside the index_bundle handler (index_controller.py:213), ensuring that even if the route were somehow reached, the handler would reject the request. This is good practice.
    • The endpoint already requires HMAC authentication (index_controller.py:204), so disabling it does not remove a previously unprotected route — it disables an already-protected one. No concern.
  2. New environment variable AZUL_ENABLE_BUNDLE_NOTIFICATIONS

    • Parsed via the existing _boolean() method which only accepts '0' and '1'. This is a well-established pattern in the codebase. No concern.
    • prod and anvilprod explicitly set it to '1' (enabled), preserving current behavior. Lower deployments default to '0' (disabled). This is documented with FIXMEs for future cleanup. No concern.
  3. Elasticsearch scripted metric aggregation (manifest_service.py:1213-1238)

    • The new by_bundle=False code path introduces a new Painless script that iterates over params._source.sources and hashes source.id. The scripts are string literals with no user-controlled interpolation — all values come from the ES document source. This is the same pattern as the pre-existing by_bundle=True path. No injection risk.
  4. Manifest cache correctness (not strictly security, but integrity-relevant)

    • When by_bundle=False, the content hash is based on source IDs only. As documented in the docstring, this mode must not be used when the index is changing incrementally. Since bundle notifications (the only incremental update path) are disabled in this mode, the invariant holds. A false positive (stale manifest served) could only occur if the index were modified through a path other than bundle notifications while this mode is active.
    • The manifest_key now includes the by_bundle value indirectly via the hash output, so manifests generated under different modes will have different keys. No cache poisoning across modes.
  5. AzulClientNotificationError now carries HTTP status codes (azulclient.py:209)

    • The error now includes the set of HTTP error codes in args[1]. This is used in the integration test to assert the expected status (400 vs 403). These are internal error details, not exposed to end users. No information leakage concern.
  6. patch_config type hint broadened (azul_test_case.py)

    • Now accepts bool | int | str instead of just str. This is test infrastructure only. No concern.

Threat Model Considerations

  • Unauthorized index modification: The bundle notification endpoint is the only external-facing mutation endpoint for the index. Disabling it reduces the attack surface. When disabled, the endpoint returns 403, and the handler has a secondary assertion guard.
  • Stale/incorrect manifest delivery: The source-based hash is less sensitive than the bundle-based hash, but since bundle notifications are the only incremental update mechanism and they are disabled in this mode, the risk of serving stale manifests is acceptably low given the design constraints.
  • No new external inputs: No new user-facing API parameters, query parameters, or request body fields are introduced. The only new input is a server-side environment variable.

Verdict

No security concerns identified. The changes reduce attack surface by disabling an endpoint, add defense-in-depth with an assertion guard, introduce no new external inputs, and the Elasticsearch scripts remain free of injection risk. The manifest cache integrity trade-off is well-documented and appropriate given the constraints.

@hannes-ucsc
Copy link
Copy Markdown
Member

Security design review

  • Security design review completed; this PR does not
    • … affect authentication; for example:
      • OAuth 2.0 with the application (API or Swagger UI)
      • Authentication of developers with Google Cloud APIs
      • Authentication of developers with AWS APIs
      • Authentication with a GitLab instance in the system
      • Password and 2FA authentication with GitHub
      • API access token authentication with GitHub
      • Authentication with Terra
    • … affect the permissions of internal users like access to
      • Cloud resources on AWS and GCP
      • GitLab repositories, projects and groups, administration
      • an EC2 instance via SSH
      • GitHub issues, pull requests, commits, commit statuses, wikis, repositories, organizations
    • … affect the permissions of external users like access to
      • TDR snapshots
    • … affect permissions of service or bot accounts
      • Cloud resources on AWS and GCP
    • … affect audit logging in the system, like
      • adding, removing or changing a log message that represents an auditable event
      • changing the routing of log messages through the system
    • … affect monitoring of the system
    • … introduce a new software dependency like
      • Python packages on PYPI
      • Command-line utilities
      • Docker images
      • Terraform providers
    • … add an interface that exposes sensitive or confidential data at the security boundary
    • … affect the encryption of data at rest
    • … require persistence of sensitive or confidential data that might require encryption at rest
    • … require unencrypted transmission of data within the security boundary
    • … affect the network security layer; for example by
      • modifying, adding or removing firewall rules
      • modifying, adding or removing security groups
      • changing or adding a port a service, proxy or load balancer listens on
  • Documentation on any unchecked boxes is provided in comments below

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

4+ reviews [process] Lead requested changes four times or more API API change affecting callers partial [process] PR is does not completely resolve associated ticket sandbox [process] Resolution is being verified in sandbox deployment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Manifest content hash computation times out

5 participants