-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-14103. Create an option in SCM to ack/ignore missing containers #9719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerInfo.java
Outdated
Show resolved
Hide resolved
|
I am not sure about this idea. Surely, if the container is missing and all efforts have been made to ensure there are no copies that can be recovered, the correct thing to do is to remove the container from the system? |
|
@sodonnel I agree that safely removing would be the best long term solution. However implementing that robustly is more complicated. Even if all the keys are deleted from OM, SCM won't have any DNs to send the block delete requests to, and those DNs cannot tell SCM that their replicas are empty and safe to be deleted. We therefore need a check for orphan containers in between SCM and OM that handles the cleanup. I don't think we want to allow admins to manually remove containers from the system based on their own investigation. |
priyeshkaratha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sarvekshayr for the patch. I left few comments related to admin check and other good to go changes. Please have a look into those.
...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMClientProtocolServer.java
Show resolved
Hide resolved
...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMClientProtocolServer.java
Show resolved
Hide resolved
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerInfo.java
Outdated
Show resolved
Hide resolved
| public void execute(ScmClient scmClient) throws IOException { | ||
| if (list) { | ||
| // List acknowledged containers | ||
| ContainerListResult result = scmClient.listContainer(1, Integer.MAX_VALUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fetching all containers and filtering them on the client side can be inefficient if the cluster has a large number of containers. Consider adding a server-side filter to listContainer to fetch only the containers with ackMissing=true. This would require changes to the StorageContainerLocationProtocol.
If it is difficult, I am ok with the current changes since it is used by CLI tool only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we plan to display the list of acknowledged missing containers in the Recon UI?
If yes, I can add the required server-side filter. If not, since this isn’t a frequently used command, we can retain the current implementation.
cc: @errose28 @devmadhuu
What changes were proposed in this pull request?
Ozone currently has no way to clear missing containers from the system. Even if all the data is deleted from the OM, the block deletes will never leave SCM because it has no replicas to send them to.
As a short term mitigation, we added a CLI to SCM that supports “acking“ missing containers by ID if the admin confirms they are not a problem, so they do not mask future issues. This would remove them from
ozone admin container reportoutput and the missing container count metric. This would need to be persisted in theContainerInfoin SCM, and we show this property inozone admin container info. There is also a CLI to raise containers as an issue again and to query the list of acked missing containers.What is the link to the Apache JIRA
HDDS-14103
How was this patch tested?
Container report shows 1 MISSING container.
Acknowledge the container as MISSING as it is not an issue.
Container report removes # 1 as MISSING.
Unacknowledge the container as MISSING as it is problematic.
Container report adds # 1 as MISSING again.