Skip to content

Ensure that binary logs for PITR are in a shared directory#541

Merged
mattlord merged 34 commits intomainfrom
point-in-time-recovery
Mar 8, 2026
Merged

Ensure that binary logs for PITR are in a shared directory#541
mattlord merged 34 commits intomainfrom
point-in-time-recovery

Conversation

@mattlord
Copy link
Copy Markdown
Collaborator

@mattlord mattlord commented Mar 12, 2024

When executing the vtctldclient RestoreFromBackup --restore-to-pos <value> command, the vttablet process in the vttablet container within the vttablet pod — in the RestoreFromBackup tabletmanager RPC — restores the full backup within the VTDATAROOT (specifically /vt/vtdataroot/vt_<tabletUID>/ for the mysql data) that is shared by all containers within the pod using the configured backup engine (e.g. xtrabackup). It orchestrates that in conjunction with the mysqlctld process that's running inside the mysqld container within the same vttablet pod. In the end there is a running mysqld instance inside the mysqld container that is from the restored full backup. Then once the full backup is in place and the mysqld process is running the vttablet process uses the OS tmp dir of /tmp to restore the binary logs from the backup — via the builtinbackupengine — for subsequent application and /tmp is not a shared mount point within the pod so when mysqlbinlog subsequently tries to read them from within the mysqld container it cannot find them in its container's /tmp directory and it fails with an error.

vtctldclient

https://github.com/vitessio/vitess/blob/3ae5cf7e690e560dd5630119215bcc3f5ecf31c8/go/cmd/vtctldclient/command/backups.go#L227-L263

vtctld[server]

https://github.com/vitessio/vitess/blob/3ae5cf7e690e560dd5630119215bcc3f5ecf31c8/go/vt/vtctl/grpcvtctldserver/server.go#L3260-L3286

vttablet

https://github.com/vitessio/vitess/blob/3ae5cf7e690e560dd5630119215bcc3f5ecf31c8/go/vt/vttablet/tabletmanager/rpc_backup.go#L173-L193

https://github.com/vitessio/vitess/blob/3ae5cf7e690e560dd5630119215bcc3f5ecf31c8/go/vt/vttablet/tabletmanager/restore.go#L191-L273

mysqlctld (rather than mysqlctl, and which runs in the mysql container)

https://github.com/vitessio/vitess/blob/3ae5cf7e690e560dd5630119215bcc3f5ecf31c8/go/vt/mysqlctl/backup.go#L364-L487

https://github.com/vitessio/vitess/blob/3ae5cf7e690e560dd5630119215bcc3f5ecf31c8/go/vt/mysqlctl/builtinbackupengine.go#L995-L1060

vttablet builtinbackupengine

https://github.com/vitessio/vitess/blob/3ae5cf7e690e560dd5630119215bcc3f5ecf31c8/go/vt/mysqlctl/builtinbackupengine.go#L995-L1060

Related issues and PRs:

@mattlord mattlord force-pushed the point-in-time-recovery branch 5 times, most recently from 1824f91 to 0562d49 Compare March 13, 2024 01:15
@mattlord mattlord requested review from GuptaManan100, frouioui and shlomi-noach and removed request for GuptaManan100 March 13, 2024 01:15
@mattlord mattlord changed the title Ensure that binary logs for PITR are restored to a shared location Ensure that binary logs for PITR are use a shared location Mar 13, 2024
@mattlord mattlord changed the title Ensure that binary logs for PITR are use a shared location Ensure that binary logs for PITR are use a shared directory Mar 13, 2024
@mattlord mattlord changed the title Ensure that binary logs for PITR are use a shared directory Ensure that binary logs for PITR are in a shared directory Mar 13, 2024
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the point-in-time-recovery branch from 0562d49 to d259730 Compare March 13, 2024 02:25
Copy link
Copy Markdown
Collaborator

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Should we also provide the flag in yaml files?

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord
Copy link
Copy Markdown
Collaborator Author

Nice! Should we also provide the flag in yaml files?

Yeah. I think this does it. e2e5e8b

@shlomi-noach
Copy link
Copy Markdown
Collaborator

shlomi-noach commented Mar 17, 2024

Yeah. I think this does it.

How is the value being set, and to what specific value?

@mattlord
Copy link
Copy Markdown
Collaborator Author

How is the value being set, and to what specific value?

The user would specify the flag and value in their cluster yaml definition using the extraFlags parameter, just as they do for mysqld flags, e.g. If they don't specify a value then we enforce the default within the operator.

Copy link
Copy Markdown
Collaborator

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me like it's feature complete and can be taken out of Draft?

@mattlord mattlord marked this pull request as ready for review March 18, 2024 12:18
@mattlord mattlord requested a review from GuptaManan100 March 18, 2024 12:18
@shlomi-noach shlomi-noach requested a review from a team March 20, 2024 06:28
@mattlord
Copy link
Copy Markdown
Collaborator Author

How is the value being set, and to what specific value?

The user would specify the flag and value in their cluster yaml definition using the extraFlags parameter, just as they do for mysqld flags, e.g. If they don't specify a value then we enforce the default within the operator.

The flag ended up being for vttablet and vtbackup, not mysqlctld (although vtbackup is a modified mysqlctld). I will leave the mysqlctld extra flags support though as that may come to be useful.

Signed-off-by: Matt Lord <mattalord@gmail.com>
Comment on lines +90 to +94
// Ensure that binary logs are restored to/from a location that all containers
// in the pod can access if no location was explicitly provided.
if _, ok := vttabletAllFlags["builtinbackup-incremental-restore-path"]; !ok {
vttabletAllFlags["builtinbackup-incremental-restore-path"] = vtDataRootPath
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if the path specified in --builtinbackup-incremental-restore-path is not accessible to all containers in the pod?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am guessing it is up to the user to set the same value on all components too? mysqlctl, vttablet and vtbackup

Copy link
Copy Markdown
Collaborator Author

@mattlord mattlord Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same thing that happens now to every user. It doesn't work. PITR does not generally work in the operator today.

Comment thread pkg/operator/vttablet/mysqlctld.go Outdated
vtRootInitScript = `set -ex
mkdir -p /mnt/vt/bin
cp --no-clobber /vt/bin/mysqlctld /mnt/vt/bin/
cp --no-clobber $(command -v mysqlbinlog) /mnt/vt/bin/ || true
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The directory /mnt/.../ is shared across all the containers in the pod I am assuming? In which case, this line would resolve what you wrote in the PR's description:

/tmp is not a shared mount point within the pod so when mysqlbinlog subsequently tries to read them from within the mysqld container it cannot find them in its container's /tmp directory and it fails with an error

Copy link
Copy Markdown
Collaborator Author

@mattlord mattlord Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is simply about copying the mysqlbinlog binary from the vitess/lite container image to the mysqlctld/vtbackup container (if it's not already there), as it looks like we'll need to keep that around in the lite image because the MySQL images do not contain that binary and it's needed for PITR.

@mattlord mattlord force-pushed the point-in-time-recovery branch from c9793dd to e069303 Compare April 10, 2024 18:32
This includes mysqld (of course) and mysqlbinlog

But it does NOT include xtrabackup

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the point-in-time-recovery branch from e069303 to 2aeb36b Compare April 10, 2024 20:30
mattlord added 3 commits July 12, 2024 13:12
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the point-in-time-recovery branch from ae78d72 to 5e7df2b Compare July 12, 2024 18:55
@mattlord mattlord removed the request for review from GuptaManan100 March 4, 2026 14:50
mattlord added 4 commits March 6, 2026 14:54
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord requested a review from frouioui March 6, 2026 23:14
Comment thread test/endtoend/utils.sh Outdated
echo "Backup failed"
for i in {1..600} ; do
out=$(kubectl get vtb -n example --no-headers | wc -l)
if echo "${out}" | grep -c "${finalBackupCount}" >/dev/null; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finalBackupCount does not seem to be initialized before this first wait. With an empty value, grep -c "" returns 1, so we break immediately here. Then after the increment below, the second loop only waits for count 1, which the full backup already satisfies. That means this test can pass without ever observing the incremental backup.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Fixed.

Comment thread test/endtoend/utils.sh Outdated
Comment thread pkg/operator/vttablet/env_vars.go Outdated
Copy link
Copy Markdown
Contributor

@nickvanw nickvanw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I traced the PITR/operator flow through the changed code and upstream Vitess and left a few inline comments. The shared restore-path change looks directionally right, but I found one runtime wiring issue and a couple of gaps in the new end-to-end coverage.

Signed-off-by: Matt Lord <mattalord@gmail.com>
mattlord added 2 commits March 7, 2026 01:22
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the point-in-time-recovery branch from 05d2aa7 to dc8fa0b Compare March 7, 2026 03:26
@mattlord mattlord merged commit 547553e into main Mar 8, 2026
13 checks passed
@mattlord mattlord deleted the point-in-time-recovery branch March 8, 2026 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants