-
Notifications
You must be signed in to change notification settings - Fork 41
Disable nightly sync of Asset Manager assets from asset-slave-2 to S3 #7019
Disable nightly sync of Asset Manager assets from asset-slave-2 to S3 #7019
Conversation
The push-attachments-to-s3.sh script is run by the push_attachments_to_s3 cron job daily at 21:00 and is only enabled on asset-slave-2 in production. It uses `s3cmd sync` to sync /mnt/uploads (i.e. all assets) to the govuk-attachments-production S3 bucket. In production the copy-attachments-to-slaves.sh script also writes to the same S3 bucket. It's run by the copy-attachments-to-slaves cron job which runs every 1 min on asset-master-1. In production process_uploaded_attachments_to_s3 is set to true and so copy-attachments-to-slaves.sh uses `s3cmd put` to copy virus scanned assets to the same govuk-attachments-production S3 bucket. However, this is only relevant to Whitehall assets, because Asset Manager virus scanning works differently. Thus it seems that currently Asset Manager assets are only copied to the govuk-attachments-production S3 bucket every night and not continuously like Whitehall assets. It's not clear to me what purpose this S3 bucket is serving given that there are also Duplicity jobs creating off-site backups to a different S3 bucket and a cron job rsyncing files from the asset master to each of the asset slaves. However, I suppose the Duplicity backups will be up to 1 day out-of-date and the asset slaves are not off-site, so perhaps it's filling that gap. It's worth noting there's an attachments-s3-env-sync.sh script on asset master in staging & integration which looks like it was intended to sync from the govuk-attachments-production S3 bucket, but it does not appear to be called from anywhere. Since this change [1] to Asset Manager the files for new assets are deleted from the filesystem once they have been virus scanned and uploaded to S3, because Asset Manager now serves them from S3 via Nginx. Thus the Asset Manager app should no longer be permanently adding asset files to the filesystem and there's no need to have the asset-manager sub-directory under /mnt/uploads synced to S3 by the push-attachments-to-s3.sh script; hence the change in this commmit to exclude that directory. I would've preferred to have changed the main source directory for the `s3cmd sync` command to /mnt/uploads/whitehall, i.e. so only the whitehall sub-directory is synced. We're about to delete the files for Asset Manager assets which have been uploaded to S3 i.e. the vast majority of them. We plan to use this Asset Manager Rake task [2] to delete the files via the Carrierwave uploader mounted on Asset#file. This will delete the underlying file from the uploads directory under the Rails root directory which is sym-linked to /data/uploads/asset-manager. The latter is where the asset-master /mnt/uploads directory is mounted using NFS. If we were to leave this script unchanged, its `s3cmd sync` command would have deleted all the Asset Manager assets from the S3 bucket. By excluding the asset-manager sub-directory, we can leave a recent set of Asset Manager assets in the S3 bucket, acting as a kind of backup in case we run into any unforseen problems when deleting the assets. The script should continue to run and so the push_attachments_to_s3_xxx Icinga check should not report any alerts. [1]: alphagov/asset-manager#373 [2]: https://github.com/alphagov/asset-manager/blob/d803db930614a6063c0fc16730f6ba3eaf08e6d9/lib/tasks/govuk_assets.rake#L5
For context: It differs to Duplicity since that is an incremental backup rather than just a straight copy of what's currently in play. I hope that makes more sense, and positively look forward to saying goodbye to all those bash scripts in the future. |
Yes, it does make more sense - it was pretty much the conclusion I had come to, but it's good to have it confirmed. I think we're now protected against the scenario you describe (for Asset Manager assets at least) by having versioning enabled on the new-ish I'm confident we going to be able to gradually remove/simplify much of the asset-related stuff in |
@floehopper and I have just chatted about this. We've agreed that it makes more sense to exclude this directory from the sync after we've removed the asset-manager files from NFS and after this This will ensure that the asset-manager files we delete from NFS are also removed from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approving the change but we should hold off merging as outlined in my comment on the PR.
@chrisroos: Your comment matches my understanding of what we just agreed. Thanks for reviewing. |
All Asset Manager asset files were deleted from production yesterday afternoon - see alphagov/asset-manager#296 for details. I can see that the "Push attachments to S3" Nagios check on Thus I'm now happy to merge this PR. |
The
push-attachments-to-s3.sh
script is run by thepush_attachments_to_s3
cron job daily at 21:00 and is only enabled onasset-slave-2
in production. It usess3cmd sync
to sync/mnt/uploads
(i.e. all assets) to thegovuk-attachments-production
S3 bucket.In production the
copy-attachments-to-slaves.sh
script also writes to the same S3 bucket. It's run by thecopy-attachments-to-slaves
cron job which runs every 1 min onasset-master-1
. In productionprocess_uploaded_attachments_to_s3
is set to true and socopy-attachments-to-slaves.sh
usess3cmd put
to copy virus scanned assets to the samegovuk-attachments-production
S3 bucket. However, this is only relevant to Whitehall assets, because Asset Manager virus scanning works differently.Thus it seems that currently Asset Manager assets are only copied to the
govuk-attachments-production
S3 bucket every night and not continuously like Whitehall assets.It's not clear to me what purpose this S3 bucket is serving given that there are also Duplicity jobs creating off-site backups to a different S3 bucket and a cron job rsync-ing files from the asset master to each of the asset slaves. However, I suppose the Duplicity backups will be up to
1 day out-of-date and the asset slaves are not off-site, so perhaps it's filling that gap. It's worth noting there's an
attachments-s3-env-sync.sh
script on asset master in staging & integration which looks like it was intended to sync from thegovuk-attachments-production
S3 bucket, but it does not appear to be called from anywhere.Since this change to Asset Manager the files for new assets are deleted from the filesystem once they have been virus scanned and uploaded to S3, because Asset Manager now serves them from S3 via Nginx. Thus the Asset Manager app should no longer be permanently adding asset files to the filesystem and there's no need to have the asset-manager sub-directory under /mnt/uploads synced to S3 by the
push-attachments-to-s3.sh
script; hence the change in this commit toexclude that directory.
I would've preferred to have changed the main source directory for the
s3cmd sync
command from/mnt/uploads
to/mnt/uploads/whitehall
, i.e. so only thewhitehall
sub-directory is synced, but there appear to be other non-asset-related sub-directories:The
lost+found
sub-directory is already excluded, butpublisher
&support-api
are not. So it seems simpler to excludeasset-manager
for now.We're about to delete the files for Asset Manager assets which have been uploaded to S3 i.e. the vast majority of them. We plan to use this Asset Manager Rake task to delete the files via the Carrierwave uploader mounted on
Asset#file
. This will delete the underlying file from the uploads directory under the Rails root directory which is sym-linked to/data/uploads/asset-manager
. The latter is where the asset master/mnt/uploads
directory is mounted using NFS. If we were to leave this script unchanged, itss3cmd sync
command would have deleted all the Asset Manager assetsfrom the S3 bucket. By excluding the
asset-manager
sub-directory, we can leave a recent set of Asset Manager assets in the S3 bucket, acting as a kind of backup in case we run into any unforeseen problems when deleting the assets.The script should continue to run and so the
push_attachments_to_s3_xxx
Icinga check should not report any alerts.