Skip to content

Conversation

@sarvekshayr
Copy link
Contributor

What changes were proposed in this pull request?

Ozone currently has no way to clear missing containers from the system. Even if all the data is deleted from the OM, the block deletes will never leave SCM because it has no replicas to send them to.
As a short term mitigation, we added a CLI to SCM that supports “acking“ missing containers by ID if the admin confirms they are not a problem, so they do not mask future issues. This would remove them from ozone admin container report output and the missing container count metric. This would need to be persisted in the ContainerInfo in SCM, and we show this property in ozone admin container info. There is also a CLI to raise containers as an issue again and to query the list of acked missing containers.

What is the link to the Apache JIRA

HDDS-14103

How was this patch tested?

Container report shows 1 MISSING container.

bash-5.1$ ozone admin container report
Container Summary Report generated at 2026-02-05T04:28:00Z
==========================================================

Container State Summary
=======================
OPEN: 0
CLOSING: 1
QUASI_CLOSED: 0
CLOSED: 0
DELETING: 0
DELETED: 0
RECOVERING: 0

Container Health Summary
========================
HEALTHY: 0
UNDER_REPLICATED: 0
MIS_REPLICATED: 0
OVER_REPLICATED: 0
MISSING: 1
UNHEALTHY: 0
EMPTY: 0
OPEN_UNHEALTHY: 0
QUASI_CLOSED_STUCK: 0
OPEN_WITHOUT_PIPELINE: 0
UNHEALTHY_UNDER_REPLICATED: 0
UNHEALTHY_OVER_REPLICATED: 0
MISSING_UNDER_REPLICATED: 0
QUASI_CLOSED_STUCK_UNDER_REPLICATED: 0
QUASI_CLOSED_STUCK_OVER_REPLICATED: 0
QUASI_CLOSED_STUCK_MISSING: 0

First 100 MISSING containers:
#1

Acknowledge the container as MISSING as it is not an issue.

bash-5.1$ ozone admin container ack 1
Acknowledged container: 1

bash-5.1$ ozone admin container ack --list
1

bash-5.1$ ozone admin container list
[ {
  "state" : "CLOSING",
  "stateEnterTime" : "2026-02-05T04:22:50.850Z",
  "replicationConfig" : {
    "replicationFactor" : "ONE",
    "requiredNodes" : 1,
    "minimumNodes" : 1,
    "replicationType" : "RATIS"
  },
  "usedBytes" : 12,
  "numberOfKeys" : 1,
  "lastUsed" : "2026-02-05T04:30:04.195255259Z",
  "owner" : "omServiceIdDefault",
  "containerID" : 1,
  "deleteTransactionId" : 0,
  "sequenceId" : 2,
  "healthState" : "HEALTHY",
  "open" : true,
  "deleted" : false,
  "ackMissing" : true
} ]

Container report removes # 1 as MISSING.

bash-5.1$ ozone admin container report
Container Summary Report generated at 2026-02-05T04:28:10Z
==========================================================

Container State Summary
=======================
OPEN: 0
CLOSING: 1
QUASI_CLOSED: 0
CLOSED: 0
DELETING: 0
DELETED: 0
RECOVERING: 0

Container Health Summary
========================
HEALTHY: 0
UNDER_REPLICATED: 0
MIS_REPLICATED: 0
OVER_REPLICATED: 0
MISSING: 0
UNHEALTHY: 0
EMPTY: 0
OPEN_UNHEALTHY: 0
QUASI_CLOSED_STUCK: 0
OPEN_WITHOUT_PIPELINE: 0
UNHEALTHY_UNDER_REPLICATED: 0
UNHEALTHY_OVER_REPLICATED: 0
MISSING_UNDER_REPLICATED: 0
QUASI_CLOSED_STUCK_UNDER_REPLICATED: 0
QUASI_CLOSED_STUCK_OVER_REPLICATED: 0
QUASI_CLOSED_STUCK_MISSING: 0

Unacknowledge the container as MISSING as it is problematic.

bash-5.1$ ozone admin container unack 1
Unacknowledged container: 1

Container report adds # 1 as MISSING again.

bash-5.1$ ozone admin container report
Container Summary Report generated at 2026-02-05T04:28:30Z
==========================================================

Container State Summary
=======================
OPEN: 0
CLOSING: 1
QUASI_CLOSED: 0
CLOSED: 0
DELETING: 0
DELETED: 0
RECOVERING: 0

Container Health Summary
========================
HEALTHY: 0
UNDER_REPLICATED: 0
MIS_REPLICATED: 0
OVER_REPLICATED: 0
MISSING: 1
UNHEALTHY: 0
EMPTY: 0
OPEN_UNHEALTHY: 0
QUASI_CLOSED_STUCK: 0
OPEN_WITHOUT_PIPELINE: 0
UNHEALTHY_UNDER_REPLICATED: 0
UNHEALTHY_OVER_REPLICATED: 0
MISSING_UNDER_REPLICATED: 0
QUASI_CLOSED_STUCK_UNDER_REPLICATED: 0
QUASI_CLOSED_STUCK_OVER_REPLICATED: 0
QUASI_CLOSED_STUCK_MISSING: 0

First 100 MISSING containers:
#1

@sodonnel
Copy link
Contributor

sodonnel commented Feb 6, 2026

I am not sure about this idea. Surely, if the container is missing and all efforts have been made to ensure there are no copies that can be recovered, the correct thing to do is to remove the container from the system?

@errose28
Copy link
Contributor

errose28 commented Feb 6, 2026

@sodonnel I agree that safely removing would be the best long term solution. However implementing that robustly is more complicated. Even if all the keys are deleted from OM, SCM won't have any DNs to send the block delete requests to, and those DNs cannot tell SCM that their replicas are empty and safe to be deleted. We therefore need a check for orphan containers in between SCM and OM that handles the cleanup. I don't think we want to allow admins to manually remove containers from the system based on their own investigation.

Copy link
Contributor

@priyeshkaratha priyeshkaratha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sarvekshayr for the patch. I left few comments related to admin check and other good to go changes. Please have a look into those.

public void execute(ScmClient scmClient) throws IOException {
if (list) {
// List acknowledged containers
ContainerListResult result = scmClient.listContainer(1, Integer.MAX_VALUE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fetching all containers and filtering them on the client side can be inefficient if the cluster has a large number of containers. Consider adding a server-side filter to listContainer to fetch only the containers with ackMissing=true. This would require changes to the StorageContainerLocationProtocol.

If it is difficult, I am ok with the current changes since it is used by CLI tool only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan to display the list of acknowledged missing containers in the Recon UI?
If yes, I can add the required server-side filter. If not, since this isn’t a frequently used command, we can retain the current implementation.

cc: @errose28 @devmadhuu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants