Contents
- Getting started as operator
- Operator jobs
- First order of business: add a calendar event for the next scheduled operator
- Check weekly for Amazon OpenSearch Service updates
- Review counts
- Testing a PR in the
sandbox - Reindexing
- Updating the AMI for GitLab instances
- Updating software packages via release version upgrade in AL2023 instances
- Upgrading GitLab & ClamAV
- Upgrade direct Python dependencies
- Increase GitLab data volume size
- Backup GitLab volumes
- Updating the Swagger UI
- Export AWS Inspector findings
- Adding snapshots to
dev - Adding snapshots to
prod - Removing catalogs from
prodand setting a new default - Promoting to
prod - Backporting from
prodtodevelop - Deploying the Data Browser
- Running a ZAP vulnerability scan
- Troubleshooting
- GitHub bot account
- Handing over operator duties
- Read the entire document
- It is strongly recommend that you install SmartGit
Ask the lead via Slack to:
- add you to the
Azul OperatorsGitHub group on DataBiosphere - give you Maintainer access to the GitLab
dev,anvildev,anvilprodandprodinstances - assign to you the
Editorrole on the Google Cloud projectsplatform-hca-prodandplatform-hca-anvilprod - remove the
Editorrole in those projects from the previous operator
- add you to the
Ask Erich Weiler (weiler@soe.ucsc.edu) via email (cc Ben and Hannes) to:
- grant you developer access to AWS accounts
platform-hca-prodand ``platform-anvil-prod` - revoke that access from the previous operator (mention them by name)
- grant you developer access to AWS accounts
Confirm access to GitLab:
Add your SSH key to your user account on GitLab under the "Settings/SSH Keys" panel
Confirm SSH access to the GitLab instance:
ssh -T git@ssh.gitlab.dev.singlecell.gi.ucsc.edu Welcome to GitLab, @amarjandu!
Add the gitlab instances to the local working copy's
.git/configfile using:[remote "gitlab.dcp2.dev"] url = git@ssh.gitlab.dev.singlecell.gi.ucsc.edu:ucsc/azul fetch = +refs/heads/*:refs/remotes/gitlab.dcp2.dev/* [remote "gitlab.dcp2.prod"] url = git@ssh.gitlab.azul.data.humancellatlas.org:ucsc/azul.git fetch = +refs/heads/*:refs/remotes/gitlab.dcp2.prod/* [remote "gitlab.anvil.dev"] url = git@ssh.gitlab.anvil.gi.ucsc.edu:ucsc/azul.git fetch = +refs/heads/*:refs/remotes/gitlab.anvil.dev/*Confirm access to fetch branches:
git fetch -v gitlab.dcp2.dev From ssh.gitlab.dev.singlecell.gi.ucsc.edu:ucsc/azul = [up to date] develop -> gitlab.dcp2.dev/develop = [up to date] issues/amar/2653-es-2-slow -> gitlab.dcp2.dev/issues/amar/2653-es-2-slow`
Standardize remote repository names. If the name of the remote repository on GitHub is set to
originrename the remote repository togithub. Run:git remote rename origin github
As soon as your shift begins and before performing any other actions as an operator create the following Google Calendar event in the Team Boardwalk calendar.
Create an all-day calendar event for the two weeks after your current stint,
using the title Azul Operator: <name> with the name of the operator who will
be serving next.
If you are aware of any schedule irregularities, such as one operator performing more than one consecutive stints, create events for those as well.
The operator checks weekly for notifications about service software updates to Amazon OpenSearch Service domains for all Azul deployments. Note that service software updates are distinct from updates to the upstream version of ElasticSearch (or Amazon's OpenSearch fork) in use on an ES domain. While the latter are discretional and applied via a change to TerraForm configuration, some of the latter are mandatory.
Unless we intervene, AWS will automatically force the installation of any update
about which we receive a High severity notification, typically two weeks
after the notification was sent. Read Amazon notification severities for more
information. The operator must prevent the automatic installation of such
updates. It would be disastrous if an update were to be applied during a reindex
in prod. Instead, the operator must apply the update manually as part of an
operator ticket in GitHub, as soon as possible, and well before Amazon would
apply it automatically.
To check for, and apply, if necessary, any pending service software updates, the operator performs the following steps daily.
In Amazon OpenSearch Service Console select the Notifications pane and identify notifications with subject
Service Software Update.Record the severity, date and the ES domain name of these notifications. Collect this information for all ES domain in both the
prodanddevAWS accounts. If there are no notifications, you are done.Open a new ticket in GitHub and title it
Apply Amazon OpenSearch (ES) Software Update (before {date}). Include(before {date})in the title if any notification is ofHighseverity, representing a forced update. Replace{date}with the anticipated date of the forced installation. If there already is an open ticket for pending updates, reuse that ticket and adjust it accordingly.If title contains a date, pin the ticket as High Priority in ZenHub.
The description of the ticket should include a checklist item for each ES domain recorded in step 2. The checklist should include items for notifying the team members about any disruptions to their personal deployments, say, when the
sandboxdomain is being updated.Use this template for the checklist:
- [ ] Update `azul-index-dev` - [ ] Update `azul-index-anvildev` - [ ] Update `azul-index-anvilprod` - [ ] Confirm with Azul devs that their personal deployments are idle - [ ] Update `azul-index-sandbox` - [ ] Update `azul-index-anvilbox` - [ ] Update `azul-index-hammerbox` - [ ] Update `azul-index-prod` - [ ] Confirm snapshots are disabled on all domains - `aws opensearch describe-domains --domain-name <NAME> | jq '.DomainStatusList[].SnapshotOptions'` - Value of `AutomatedSnapshotStartHour` should be `-1`
Note that, somewhat counterintuitively, main deployments are updated before their respective
sandbox. If, during step 3, updates or domains were added to an existing ticket, the entire process may have to be restarted and certain checklist items may need to be reset.To update an ES domain, select it the Amazon OpenSearch Service console. Under General information, the Service software version should have an Update available hyperlink. Click on it and follow the subsequent instructions.
Once the upgrade process is completed for the
devorprodES domain, perform a smoke test using the respective Data Browser instance.
When verifying accuracy of the review count label, search for the string
hannes-ucsc requested on the PR page. Make sure to check for comments that
indicate if a review count was not bumped.
The operator sets sandbox label on a PR before pushing the PR branch to
GitLab. If the resulting sandbox build passes, the PR is merged and the label
stays on. If the build fails, the label is removed. Only one un-merged PR should
have the label.
If the tests fail while running a sandbox PR, an operator should do minor failure triage.
- If the PR fails because of out-of-date requirements on a PR with the
[R]tag the operator should rerunmake requirements_update, committing the changes separately with a title like[R] Update requirements. It is not necessary to re-request a review after doing so. - For integration test failures, check if the PR has the
reindextag. If so, running an early reindex may resolve the failure. - Determine if the failure could have been caused by the changes in the PR. If so, there is no need to open up a new ticket. Bounce the PR back to the "In progress" column and notify the author of the failure. Ideally provide a link.
- All other build failures need to be tracked in tickets. If there is an existing ticket, comment on it with a link to the failed job and move the ticket to Triage. If there is no existing ticket resembling the failed build, create a new one, with a link to the failed build, a transcript of any relevant error messages and stack traces from the build output, and any relevant log entries from CloudWatch.
If a GitLab build fails on a main deployment, the operator must evaluate the impact of that failure. This evaluation should include visiting the Data Browser to verify it isn't broken.
To restore the deployment to a known working state, the operator should rerun the deploy job of previous passing pipeline for that deployment. This can be done without pushing anything and only takes a couple of minutes. The branch for that deployment must then be reverted to the previously passing commit.
During reindexing, watch the ES domain for unassigned shards, using the AWS
console. The azul-prod CloudWatch dashboard has a graph for the shard count.
It is OK to have unassigned shards for a while but if the same unassigned shards
persist for over an hour, they are probably permanently unassigned. Follow the
procedure outlined in this AWS support article, using either Kibana or
Cerebro. Cerebro has a dedicated form field for the index setting referenced in
that article. In the past, unassigned shards have been caused by AWS attempting
to make snapshots of the indices that are currently being written to under high
load during reindexing. Make sure that GET _cat/snapshots/cs-automated
returns nothing. Make sure that the Start Hour under Snapshots on the
Cluster confguration tab of the ES domain page in the AWS console is shown as
0-1:00 UTC. If either of these checks fails, file a support ticket with AWS
urgently requesting snapshots to be disabled.
The operator must check the status of the queues after every reindex for
failures. Use python scripts/manage_queues.py to identify any failed
messages. If failed messages are found, use python scripts/manage_queues.py
to
- dump the failed notifications to JSON file(s), using
--deleteto simultaneously clear thenotifications_failqueue - force-feed the failed notifications back into the
notifications_retryqueue. We feed directly into the retry queue, not the primary queue, to save time if/when the messages fail again.
This may cause the previously failed messages to succeed. Repeat this procedure
until the set of failed notifications stabilizes, i.e., the
notifications_fail queue is empty or no previously failed notifications
succeeded.
Next, repeat the dump/delete/force-feed steps with the failed tallies, feeding
them into tallies_retry queue (again, NOT the primary queue) until the
set of failed tallies stabilizes.
If at this point the fail queues are not empty, all remaining failures must be tracked in tickets:
- document the failures within the PR that added the changes
- triage against expected failures from existing issues
- create new issues for unexpected failures
- link each failure you document to their respective issue
- ping people on the Slack channel
#dcp2about those issues, and finally - clear the fail queues so they are empty for the next reindexing
For an example of how to document failures within a PR click here.
From the GitLab web app, select the reindex or early_reindex job for
the pipeline that needs reindexing of a specific catalog. From there, you
should see an option for defining the key and value of additional variables to
parameterize the job with.
To specify a catalog to be reindexed, set Key to azul_current_catalog
and Value to the name of the catalog, for example, dcp3. To specify the
sources to be reindexed, set Key to azul_current_sources and
Value to a space-separated list of sources globs, e.g.
*:hca_dev_* *:lungmap_dev_*. Check the inputs you just
made. Start the reindex job by clicking on Run job. Wait until the job
has completed.
Repeat these steps to reindex any additional catalogs.
As part of the upgrades isssue, operators must check for updates to the AMI for the root volume of the EC2 instance running GitLab. We use a hardened — to the requirements of the CIS Hardened Image Level 1 on Amazon Linux 2023. The license to use the AMI for an EC2 instance is sold by CIS as a subscription on the AWS Marketplace:
https://aws.amazon.com/marketplace/pp/prodview-fqqp6ebucarnm
The license costs $0.024 per instance/hour. Every AWS account must subscribe separately.
There are ways to dynamically determine the latest AMI released by CIS under the subscription but in the spirit of reproducible builds, we would rather pin the AMI ID and adopt updates at our own discretion to avoid unexpected failures.
Note that the AMI versioning scheme (e.g., v01, v11) indicates the month
of release, and is not a monotonically increasing value.
To obtain the latest compatible AMI ID, select the desired ….gitlab
component, say, _select dev.gitlab and run
aws ec2 describe-images \
--owners aws-marketplace \
--filters="Name=name,Values=*prod-fvm47vekg24oc*" \
| jq -r '.Images[] | .CreationDate+"\t"+.ImageId+"\t"+.Name' \
| sort \
| tail -1
This prints the date, ID and name of the latest CIS-hardened AMI. Update the
ami_id variable in terraform/gitlab/gitlab.tf.json.template.py to refer
to the AMI ID. Update the image name in the comment right above the variable so
that we know which semantic product version the AMI represents. AMIs are
specific to a region so the variable holds a dictionary with one entry per
region. If there are ….gitlab components in more than one AWS region (which
is uncommon), you need to select at least one ….gitlab component in each of
these regions, rerun the command above for each such component, and add or
update the ami_id entry for the respective region. Instead of selecting a
….gitlab component, you can just specify the region of the component using
the --region option to aws ec2 describe-images.
Amazon Linux 2023 uses deterministic versioning, so package updates require
upgrading to a specific release version rather than using the default already
hard-coded in the AMI. Otherwise, packages would never be updated. To obtain the
Amazon Linux 2023 release version, SSH into the GitLab instance, say,
anvildev.gitlab, and run:
sudo dnf check-release-update
This prints the list of available AL2023 releases. Note that the most recent
release is listed last. Each entry shows the release version string and the
command to apply it. Copy the version string (e.g., 2023.10.20260302) from
the last entry in the command's output and update the AL2023_release
variable in terraform/gitlab/gitlab.tf.json.template.py.
Operators check for updates to the Docker images for GitLab and ClamAV as part
of the biweekly upgrade process, and whenever a GitLab security releases
requires it. An email notification is sent to azul-group@ucsc.edu when a
GitLab security release is available. Discuss with the lead the Table of
Fixes referenced in the release blog post to determine the urgency of the
update. When updating the GitLab version, either as part of the regular update
or when necessary, check if there are applicable updates to the `GitLab runner
image`_ as well. Use the latest runner image whose major and minor version match
that of the GitLab image. When upgrading across multiple GitLab versions, follow
the prescribed GitLab upgrade path. You will likely only be able to perform
a step on that path per biweekly upgrade PR.
Before upgrading the GitLab version, create a backup of the GitLab volume. See Backup GitLab volumes for help.
In PyCharm, use Package tool window to view the most recent versions of
the project's direct Python dependencies. This feature may only work properly
after running make envhook, and correctly configuring the Python interpreter
for the project (at least once before).
Proceed by identifying the packages that are candidates for upgrades. Check the
dependencies listed in requirements.txt and requirements.dev.txt against
the Package tool window, where the dependency indicates of an available version.
When updating:
- Update to the latest mature release (a release with a high patch number or where the most recent patch release is at least a couple of months old) and go backward if problems occur.
- Document each of these problems with a dedicated FIXME, with its respective ticket & reference, when non-trivial code base changes are necessary due to a package version upgrade.
- Reference the GitHub link in a comment beside the conflictive package.
- If updating a package causes a trivial change or a dismissable warning when including a FIXME (e.g., deprecation warnings), it should be done on its own commit, to easily identify the dependencies forcing the change and the given resolution.
Note, a way to display all available versions of a given package in a concise
way, is to pretend to install a non-existing version from a terminal console
via the pip command. For example, to see all available versions of flake8
one may run pip install flake8=9.9.9, and the output will display all
versions of the dependency.
As always, each of the committed changes should be tested, and should independently succeed all feature branch checks in GitHub, etc. Perform the following for smoke-testing basic operations and functions:
- Recreate the project's virtualenv from scratch, run the
requirementstarget, run theenvhooktarget and end withrequirements_update.- Run the
test, anddeploytargets in personal deployment (or via sandbox) and then run the integration test.
When the CloudWatch alarm for high disk usage on the GitLab data volume goes off, you must attach a new, larger volume to the instance. Run the command below to create both a snapshot of the current data volume and a new data volume with the specified size restored from that snapshot.
Discuss the desired new size with the system administrator before running the command:
python scripts/create_gitlab_snapshot.py --new-size [new_size]
When this command finishes, it will leave the instance in a stopped state. Take note of the command logged by the script. You'll use it to delete the old data volume after confirming that GitLab is up and running with the new volume attached.
Next, deploy the gitlab TF component in order to attach the new data volume.
The only resource with changes in the resulting plan should be
aws_instance.gitlab. Once the gitlab TF component has been deployed,
start the GitLab instance again by running:
python scripts/create_gitlab_snapshot.py --start-only
Finally, SSH into the instance to complete the setup of new data volume. Use the
df command to confirm the size and mount point of the device, and
resize2fs to grow the size of the mounted file system so that it matches
that of the volume. Run:
df | grep /mnt/gitlab # Verify the name of the mounted device (e.g. /dev/nvme1n1) and note the available size sudo resize2fs <device_name> # Match the device name emitted by the previous command df # Verify the new available size is larger
The output of the last df command should inform of the success of these
operations. A larger available size compared to the first run indicates that
the resizing operation was successful. You can now delete the old data volume by
running the deletion command you noted earlier.
Use the create_gitlab_snapshot.py script to back up the EBS data volume
attached to each of our GitLab instances. The script will stop the instance,
create a snapshot of the GitLab EBS volume, tag the snapshot and finally restart
the instance:
python scripts/create_gitlab_snapshot.py
For GitLab or ClamAV updates, use the --no-restart flag in order to leave
the instance stopped after the snapshot has been created. There is no point in
starting the instance only to have the update terminate it again.
Operators should regularly check for available updates to the Swagger UI. The
current version used by Azul is hardcoded in scripts/update_swagger.py. The
latest upstream version is available as the latest release tag.
Scheduled upgrade PR's should only include minor and hotfix updates to the
Swagger UI. If a new major version is available, open a new issue instead. To
perform the update, edit the tag variable in the update_swagger script
and run it. The script copies the files from the dist directory, but at the
specified version, which may be different on the master branch, which the link
points to.
If, after running the script, there are nontrivial changes to the
swagger-initializer.js or oauth2-redirect.html files, cancel the update
and open a new issue instead. You may need to try updating to an older,
intermediate version. Forward trivial changes to those two files to their
respective Mustache template copies, and commit the changes to the script and
all modified files in the swagger/ directory. The commit message must
include the new tag, as well as a link to the upstream source in the commit
body, e.g.:
Update Swagger UI to v<release version> (#issue-number) https://github.com/swagger-api/swagger-ui/tree/v<release version>/dist
_select anvilprodRun
python scripts/export_inspector_findings.pyto generate a CSV fileSelect
File>Importto import the generated CSV, and on theImport filedialog use these options:- Import location: Insert new sheet(s)
- Convert text to numbers, dates, and formulas: Checked
Rename the new tab using
YYYY-MM-DDwith the date of the upgrade issue, and move it to the front of the stackApply visual formatting (e.g. column width) to the sheet using a previous sheet as a guide
When adding a new snapshot to dev, anvildev, the operator should also
add the snapshot to sandbox or anvilbox, respectively.
The post_deploy_tdr.py script will fail if the computed common prefix
contains an unacceptable number of subgraphs. If the script reports that the
common prefix is too long, truncate it by 1 character. If it's too short, append
1 arbitrary hexadecimal character. Pass the updated prefix as a keyword argument
to the mksrc function for the affected source(s), including a partition
prefix length of 1. Then refresh the environment and re-attempt the deployment.
We decide on a case-by-case basis whether PRs which update or add new snapshots
to prod should be filed against the prod branch instead of develop.
When deciding whether to perform snapshot changes directly to prod or
include them in a routine promotion, the system admin considers the scope of
changes to be promoted. It would be a mistake to promote large changes in
combination with snapshots because that would make it difficult to diagnose
whether indexing failures are caused by the changes or the snapshots.
PRs which remove catalogs or set a new default for prod should be filed
against the prod branch instead of develop.
When setting a new default catalog in prod, the operator shall also delete
the old default catalog unless the ticket explicitly specifies not to delete the
old catalog.
Add a checklist item at the end of the PR checklist to file a back-merge PR from
prod to develop.
Add another checklist item instructing the operator to manually delete the old catalog.
Promotions to prod should happen weekly on Wednesdays, at 3pm. We promote
earlier in the week in order to triage any potential issues during reindexing.
We promote at 3pm to give a cushion of time in case anything goes wrong.
To do a promotion:
- Decide together with lead up to which commit to promote. This commit will be the HEAD of the promotions branch.
- Create a new GitHub issue with the title
Promotion yyyy-mm-dd - Make sure your
prodbranch is up to date with the remote. - Create a branch at the commit chosen above. Name the branch correctly. See promotion PR template for what the correct branch name is.
- File a PR on GitHub from the new promotion branch and connect it to the
issue. The PR must target
prod. Use the promotion PR template. - Request a review from the primary reviewer.
- Once PR is approved, announce in the #team-boardwalk Slack channel that
you plan to promote to
prod - Search for and follow any special
[u]upgrading instructions that were added. - When merging, follow the checklist and making sure to carry over any commit
title tags (
[u r R]for example) into the default merge commit title e.g.,[u r R] Merge branch 'promotions/2022-02-22' into prod. Don't rebase the promotion branch and don't push the promotion branch to GitLab. Merge the promotion branch intoprodand push the merge commit on theprodbranch first to GitHub and then to theprodinstance of GitLab.
There should only ever be one open backport PR against develop. If more
commits accumulate on prod, waiting to be backported, close the existing
backport PR first. The new PR will include the changes from the old one.
Make a branch from
prodat the most recent commit being backported. Name the branch following this pattern:backports/<7-digit SHA1 of most recent backported commit>
Open a PR from your branch, targeting
develop. The PR title should beBackport: <7-digit SHA1 of most recent backported commit> (#<Issue number(s)>, PR #<PR number>)
Repeat this pattern for each of the older backported commits, if there are any. An example commit title would be
Backport 32c55d7 (#3383, PR #3384) and d574f91 (#3327, PR #3328)
Be sure to use the PR template for backports by appending
&template=backport.mdto the URL in your browser's address bar.Assign and request review from the primary reviewer. The PR should only be assigned to one person at a time, either the reviewer or the operator.
Perform the merge. The commit title should match the PR title
git merge prod --no-ff
Push the merge commit to
develop. It is normal for the branch history to look very ugly following the merge.
The Data Browser is deployed in two steps. The first step is building the
ucsc/data-browser project on GitLab. This is initiated by pushing a branch
whose name matches ucsc/*/* to one of our GitLab instances. The resulting
pipeline produces a tarball stored in the package registry on that GitLab
instance. The second step is running the deploy_browser job of the
ucsc/azul project pipeline on that same instance. This job creates or
updates the necessary cloud infrastructure (CloudFront, S3, ACM, Route 53),
downloads the tarball from the package registry and unpacks that tarball to the
S3 bucket backing the Data Browser's CloudFront distribution.
Typically, CC requests the deployment of a Data Browser instance on Slack,
specifying the commit (tag or sha1) they wish to be deployed. After the
system administrator approves that request, the operator pushes the specified tag
(if a tag was specified) to the GitLab instance for the Azul {deployment}
that backs the Data Browser instance to be deployed. Then the specified tag (or
commit, if no tag was specified) is merged into one of the
ucsc/{atlas}/{deployment} branches. That branch is then is pushed to the
DataBiosphere/data-browser project on GitHub, and the ucsc/data-browser
project on GitLab (same instance as above). For the merge commit title,
SmartGit's default can be used, as long as the title reflects the commit (tag or
sha1) specified by CC.
The {atlas} placeholder can be hca, anvil or lungmap. Not all
combinations of {atlas} and {deployment} are valid. Valid combinations
are ucsc/anvil/anvildev, ucsc/anvil/anvilprod, ucsc/hca/dev,
ucsc/hca/prod, ucsc/lungmap/dev or ucsc/lungmap/prod, for example.
The ucsc/data-browser pipeline on GitLab blindly builds any branch, but
Azul's deploy_browser job is configured to only use the tarball from exactly
one branch (see deployments/*.browser/environment.py) and it will always use
the tarball from the most recent pipeline on that branch.
Follow these steps to set up the ZAP application for scanning the HCA and AnVIL systems. This set up only needs to be completed once, and for future scans you can simply jump to Launching ZAP.
- Download ZAP from https://www.zaproxy.org/ .
- Install & open ZAP.
- From the popup, select the No, I do not want to persis this session at this moment in time option and click Start.
- Confirm that ZAP is configured to run in standard mode by first selecting Edit from the app menu bar, then ZAP Mode, then selecting Standard Mode.
- To prevent ZAP scans from exceeding Azul's request rate limit and being
temporarily blocked by the system, you will need to configure the maximum
rate of requests that ZAP will send out. From the app menu bar, select
Tools, then Options, then Network, then Rate Limit. Add and enable a
three request per second rule for the match string
anvilproject.org, and another rule for the match stringhumancellatlas.org. - With the Options window open, select Check for Updates from the list of options. Confirm that both Check for updates on startup, and Check for updates to the add-ons you have installed are enabled.
- Click OK to close the Options window, and then proceed to exit the ZAP application.
All scans need be run with authenticated requests. The process for running an authenticated scan is to first obtain an Azul authentication token, and then launch the ZAP application with the token set as an environment variable. ZAP will then use your token to add an authentication header to all requests made during the scan. See the ZAP documentation for more information.
Follow these steps to get an authorization token from Azul:
- Open the Swagger UI for the appropriate (HCA or AnVIL) Azul service.
- Click Authorize, select all scopes, click Authorize, then Close to complete the authorization.
- Using the Swagger UI, execute an endpoint such as
/index/catalogs. - Locate the example
curlcommand that Swagger produces for you, and copy the token value from theAuthorizationheader (e.g.Bearer ya29.a0…).
Using the token copied above, you can now set an environment variable and launch ZAP from the command line. Open a terminal window, and run:
export ZAP_AUTH_HEADER_VALUE="<TOKEN-VALUE-HERE>"/Applications/ZAP.app/Contents/MacOS/ZAP.sh
After the ZAP application has opened, follow the steps below to create a new session and run a scan. After your scan has completed and you have generated a report, close the ZAP application, and then repeat the steps above to start each additional scan with a fresh authentication token.
With the ZAP application open, and prior to running any scan, start a new session and import azul-zap-scan.context from the azul-private repo. Failure to do so will pollute the scan results with known false positives and findings from the previous scan. A new session is created each time you launch ZAP. Alternatively, to manually open a new session, select File from the application menu bar, and then select New Session.
To import the context file select File from the application menu bar, followed
by Import Context… and then proceed to find the azul-zap-scan.context file
and click Open. Confirm the context is In Scope by double-clicking the newly
imported context and ensuring the checkmark is present in the In Scope
checkbox. After clicking OK, a red dot will be shown in the icon next to the
entry labeled Azul Context. Lastly, delete the Default Context by
right-clicking it and selecting the Delete option.
˚
If you are prompted with options to persist the ZAP session, select the No, I
do not want to persis this session at this moment in time option and click
Start.
You may now continue with either a Data Portal / Browser scan or Azul Indexer / Service API scan.
- Using the Quick Start tab, click Automated Scan.
- Enter the desired URL (e.g. https://anvilproject.org/) in the URL to attack field.
- Enable the Use traditional spider option.
- Select If modern from the Use ajax spider option, and Firefox Headless from the With option.
- Click Attack to begin the scan.
- Wait until all the scans (Ajax spider, passive scans, etc.) have completed.
In practice, this can take up to four hours depending on the target URL. Note
that you will not receive a notification when the scans have completed.
Instead, take note of the Current Status values in the ZAP window footer.
Proceed when all scan counts show
0. - Continue with the steps below to generate a report.
In order to run an API scan you must first import the OpenAPI definition:
- From the app menu bar, select Import, then Import an OpenAPI Definition.
- Enter the URL of the OpenAPI definition (e.g. https://service.explore.anvilproject.org/openapi.json) in the URL field.
- Click Import to complete start the import.
After the import of the OpenAPI definition completes, you can then proceed to run an automated scan using the same steps as when running an Data Portal / Browser scan. For the URL to attack, enter the base URL of the Azul indexer or service with no additional path components (e.g. https://service.explore.anvilproject.org/).
After a scan has completed, use the following steps to save a PDF export of the scan results.
- From the app menu bar, select Report, then Generate Report.
- Navigate to the Template tab of the Generate Report window, and select Traditional PDF Report from the Template option.
- Navigate to the Scope tab, and enter a value such as "AnVIL Data Portal" in the Report Title field.
- The Report Name field specifies the name of the file to be created. Enter a value such as "2025-01-01-anvil-data-portal.pdf" in this field.
- Click Generate Report to complete the export.
In some instances, deploying a Terraform component can take a long time. While
_login now makes sure that there are four hours left on the current
credentials, it can't do that if you don't call it before such an operation.
Note that _select also calls _login. The following is a list of
operations which you should expect to take an hour or longer:
- the first time deploying any component
- deploying a plan that creates or replaces an Elasticsearch domain
- deploying a plan that involves ACM certificates
- deploying a
sharedcomponent after modifyingazul_docker_imagesinenvironment.py, especially on a slow uplink
To make things worse, if the credentials expire while Terraform is updating
resources, it will not be able to write the partially updated state back to the
shared bucket. A subsequent retry will therefore likely report conflicts due to
already existing resources. The rememdy is to import those existing resources
into the Terraform state using terraform import.
If an error occurs when pushing to the develop branch, ensure that the branch you would like to merge in is rebased on develop and has completed its CI pipeline. If there is only one approval (from the primary reviewer) an operator may approve a PR that does not belong to them. If the PR has no approvals (for example, it belongs to the primary reviewer), the operator may approve the PR and seek out another team member to perform the second needed review. When making such a pro-forma review, indicate this within the review summary (example).
This can happen when a PR is chained on another PR and the base PR is merged and its branch deleted. To solve this, first restore the base PR branch. The operator should have a copy of the branch locally that they can push. If not, then the PR's original author should.
Once the base branch is restored, the Reopen PR button should again be
clickable on the chained PR.
This can happen on the rare occasion that the IT's random selection of bundles happens to pick predominantly large bundles that need to be partitioned before they can be indexed. This process can divide bundles into partitions, and divide partitions into sub-partitions, since technically bundles are partitions with an empty prefix.
In the AWS console, run the CloudWatch Insights query below with the indexer log groups selected to see how many divisions have occurred:
fields @timestamp, @log, @message | filter @message like 'Dividing partition' | parse 'Dividing partition * of bundle *, version *, with * entities into * sub-partitions.' as partition, bundle, version, enities, subpartitions | display partition, bundle, version, enities, subpartitions | stats count(@requestId) as total_count by bundle, partition | sort total_count desc | sort @timestamp desc | limit 1000
Note that when bundles are being partitioned, errors of exceeded rate & quota limits should be expected:
[ERROR] TransportError: TransportError(429, '429 Too Many Requests /azul_v2_prod_dcp17-it_cell_suspensions/_search') [ERROR] Forbidden: 403 GET https://bigquery.googleapis.com/bigquery/v2/projects/...: Quota exceeded: Your project:XXXXXXXXXXXX exceeded quota for tabledata.list bytes per second per project. For more information, see https://cloud.google.com/bigquery/docs/troubleshoot-quotas
Follow these steps to retry the IT job:
Cancel the ongoing IT job (if in progress)
Comment on issue #4299 with a link to the failed job
Purge the queues:
python scripts/manage_queues.py purge_all
Rerun the IT job
Continuous integration environments (GitLab, Travis) may need a GitHub token to
access GitHub's API. To avoid using a personal access token tied to any
particular developer's account, we created a Google Group called
azul-group@ucsc.edu of which Hannes is the owner. We then used that group
email to register a bot account in GitHub. Apparently that's ok:
User accounts are intended for humans, but you can give one to a robot, such as a continuous integration bot, if necessary.
Only Hannes knows the GitHub password of the bot account but any member of the group can request the password to be reset. All members will receive the password reset email. Hannes knows the 2FA recovery codes.
- Old operator must finish any merges in progress. The sandbox should be empty. The new operator should inherit a clean slate. This should be done before the first working day of the new operator's shift.
- Old operator must re-assign all tickets in the approved column to the new operator.
- Old operator must re-assign expected indexing failure tickets to the new operator, along with ticket that tracks operator duties.
- New operator must request the necessary permissions, as specified in Getting started as operator.