Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove existing Asset Manager asset files from NFS #296

Closed
floehopper opened this issue Nov 9, 2017 · 12 comments
Closed

Remove existing Asset Manager asset files from NFS #296

floehopper opened this issue Nov 9, 2017 · 12 comments
Assignees

Comments

@floehopper
Copy link
Contributor

floehopper commented Nov 9, 2017

Note that we need to implement a solution for new assets before doing this.

Once we have enabled proxying to S3 via Nginx for staging & integration, we should be able to delete all Asset Manager asset files from NFS on all environments and rely entirely on the assets being in the relevant S3 bucket.

  • If the nightly off-site backup job has not been removed, we should be aware that any changes we make to the production NFS mount will be synced to staging & integration overnight.
  • We should be careful not to delete any files which are currently being processed by Sidekiq jobs (e.g. awaiting/undergoing virus scan or S3 upload). We can probably achieve this by iterating over the assets in the Asset Manager database and checking the state of each asset before actually deleting the underlying file from NFS.
  • We should be careful not to delete any Whitehall assets from NFS, i.e. files under /mnt/uploads/whitehall as opposed to /mnt/uploads/asset-manager.
  • We should not remove the /mnt/uploads/asset-manager directory, because the Asset Manager app will still need to store files there until they have been virus scanned and uploaded to S3.
@floehopper floehopper added this to the Decommission NFS for mainstream assets milestone Nov 9, 2017
@floehopper floehopper changed the title Delete Asset Manager assets from NFS Delete Asset Manager asset files from NFS Dec 5, 2017
@floehopper floehopper changed the title Delete Asset Manager asset files from NFS Remove Asset Manager asset files from NFS Dec 5, 2017
@floehopper floehopper changed the title Remove Asset Manager asset files from NFS Remove existing Asset Manager asset files from NFS Dec 5, 2017
@floehopper
Copy link
Contributor Author

Note: We should stop the nightly Duplicity backup of the Asset Manager assets before doing this.

@floehopper
Copy link
Contributor Author

floehopper commented Jan 3, 2018

Note: We should stop the nightly Duplicity backup of the Asset Manager assets before doing this.

I've rebased alphagov/govuk-puppet#6768 against master, force-pushed, and requested a review in preparation for stopping the nightly backup.

@floehopper
Copy link
Contributor Author

Here's a first attempt at a plan:

Preparatory steps

  1. Merge, deploy & apply Remove sync-asset-manager-from-master cron job from asset slaves. This needs to be applied in each environment before we run the govuk_assets:govuk_assets:delete_file_from_nfs_for_assets_uploaded_to_s3 Asset Manager Rake task in that environment, because the cron job runs every 10 mins and we want to avoid a sudden increase in load on the asset slaves and/or the NFS mount. Also (at least initially) we want to avoid deleting the Asset Manager assets from the asset slaves.
  2. Merge, deploy & apply Disable nightly sync of Asset Manager assets from asset-slave-2 to S3. This needs to be applied in production before 21:00 if we want to avoid Asset Manager assets being deleted from the govuk-attachments-production S3 bucket.
  3. Merge, deploy & apply Remove nightly off-site backup of Asset Manager assets from production NFS to S3. This needs to be applied in production before 04:13 if we want to avoid Asset Manager assets being deleted from the off-site backup, i.e. the govuk-offsite-backups-production S3 bucket.

Integration

  1. Warn 2nd Line that the "Asset master and slave are using about the same amount of disk space" Icinga check is likely to go "critical" on the asset slaves in integration shortly. Suggest that they acknowledge the alert with an expire time (TODO: how long?).
  2. Run the govuk_assets:govuk_assets:delete_file_from_nfs_for_assets_uploaded_to_s3 Asset Manager Rake task in integration.

Staging

  1. Warn 2nd Line that the "Asset master and slave are using about the same amount of disk space" Icinga check is likely to go "critical" on the asset slaves in staging shortly. Suggest that they acknowledge the alert with an expire time (TODO: how long?).
  2. Run the govuk_assets:govuk_assets:delete_file_from_nfs_for_assets_uploaded_to_s3 Asset Manager Rake task in staging.

Production

  1. Warn 2nd Line that the "Asset master and slave are using about the same amount of disk space" Icinga check is likely to go "critical" on the asset slaves in production shortly. Suggest that they acknowledge the alert with an expire time (TODO: how long?).
  2. Run the govuk_assets:govuk_assets:delete_file_from_nfs_for_assets_uploaded_to_s3 Asset Manager Rake task in production.

Note: I think it's worth running the Rake task in each environment and not just in production in order to avoid the nightly sync having to delete all the Asset Manager asset files, because I think it might make that job take a long time. However, an alternative approach would be to only run the Rake task in production and allow the environment syncing to delete the asset files in staging & integration.

@floehopper
Copy link
Contributor Author

floehopper commented Jan 8, 2018

I ran the Rake task on integration at 10:55 today (see output below). It took about 16 mins to queue ~600K jobs.

10:55:10 Started by user James Mead
10:55:10 [EnvInject] - Loading node environment variables.
10:55:10 Building in workspace /var/lib/jenkins/workspace/run-rake-task
10:55:10 [run-rake-task] $ /bin/sh -xe /tmp/hudson2895038369793271941.sh
10:55:10 + govuk_node_list -c backend --single-node
10:55:10 + ssh [email protected] cd /var/apps/asset-manager && govuk_setenv asset-manager bundle exec rake govuk_assets:delete_file_from_nfs_for_assets_uploaded_to_s3
10:55:24 1000 of 590552 (0%) assets
10:55:25 2000 of 590552 (0%) assets
10:55:26 3000 of 590552 (1%) assets
10:55:27 4000 of 590552 (1%) assets
10:55:28 5000 of 590552 (1%) assets
10:55:28 6000 of 590552 (1%) assets
10:55:29 7000 of 590552 (1%) assets
10:55:30 8000 of 590552 (1%) assets
10:55:31 9000 of 590552 (2%) assets
10:55:32 10000 of 590552 (2%) assets

...

11:11:23 590000 of 590552 (100%) assets
11:11:24 590552 of 590552 (100%) assets
11:11:24
11:11:24 Finished!
11:11:24 Finished: SUCCESS

@floehopper
Copy link
Contributor Author

floehopper commented Jan 8, 2018

It took ~1h30m to process all the jobs and there were no errors/retries:

2018-01-08 govuk_assets-delete_file_from_nfs_for_assets_uploaded_to_s3 - integration - sidekiq

2018-01-08 govuk_assets-delete_file_from_nfs_for_assets_uploaded_to_s3 - integration - sidekiq-detail

The CPU usage increased during this time, but not above the critical level:

2018-01-08 govuk_assets-delete_file_from_nfs_for_assets_uploaded_to_s3 - integration - backend-resource-usage

The asset master/slave disk comparison alert went critical for some of the time, but all of the files were deleted within 20 mins and the slaves had caught up within 50 mins:

2018-01-08 govuk_assets-delete_file_from_nfs_for_assets_uploaded_to_s3 - integration - asset-master-slave-comparison

@floehopper
Copy link
Contributor Author

I ran the following commands on asset-master-1 to work out how much disk space is being used (in kb) by Asset Manager assets:

$ du -d1 /mnt/uploads
2466360	/mnt/uploads/asset-manager
703121772	/mnt/uploads/whitehall
6872	/mnt/uploads/publisher
20321400	/mnt/uploads/support-api
16	/mnt/uploads/lost+found
725916424	/mnt/uploads

The figures we recorded on 05 Jan were:

43262172	/mnt/uploads/asset-manager
702965552	/mnt/uploads/whitehall

Thus we see there has been a drop of 43,262,172 - 2,466,360 = 40,795,812K (i.e. 38.9GB) to 2,466,360K (i.e. 2.4GB).

The following suggests that most of the disk space is being used in the /mnt/uploads/asset-manager/assets directory.

$ du -d1 /mnt/uploads/asset-manager/
6840	/mnt/uploads/asset-manager/tmp
2459516	/mnt/uploads/asset-manager/assets
2466360	/mnt/uploads/asset-manager/

This makes sense from the point of view that the /mnt/uploads/asset-manager/tmp is only used temporarily by Carrierwave before it moves files into the /mnt/uploads/asset-manager/assets directory. However, it seems slightly odd that so much disk space is being used in /mnt/uploads/asset-manager/assets.

Running the following command demonstrates that there are no files (only directories) under the /mnt/uploads/asset-manager/assets directory.

$ find . -type f | wc -l
0

And the following command shows that there are ~600K empty directories under the /mnt/uploads/asset-manager/assets directory.

$ find . -type d -empty | wc -l
590556

I think it must be the directories which are taking up the disk space.

@floehopper
Copy link
Contributor Author

I've created a new issue to capture the idea of clearing up the empty directories, because I don't think this is urgent.

@floehopper
Copy link
Contributor Author

floehopper commented Jan 8, 2018

The Rake task was kicked off in production at 15:22 today and 600845 jobs were queued:

15:22:46 Started by user Ruben Arakelyan
15:22:46 [EnvInject] - Loading node environment variables.
15:22:46 Building in workspace /var/lib/jenkins/workspace/run-rake-task
15:22:46 [run-rake-task] $ /bin/sh -xe /tmp/hudson5905563818501335203.sh
15:22:46 + govuk_node_list -c backend --single-node
15:22:46 + ssh [email protected] cd /var/apps/asset-manager && govuk_setenv asset-manager bundle exec rake govuk_assets:delete_file_from_nfs_for_assets_uploaded_to_s3
15:22:49 I, [2018-01-08T15:22:49.344739 #29870]  INFO -- sentry: ** [Raven] Raven 2.7.1 ready to catch errors
15:22:49 MONGODB | Topology type 'unknown' initializing.
15:22:49 MONGODB | Server mongo-1.backend:27017 initializing.
15:22:49 MONGODB | Server description for mongo-1.backend:27017 changed from 'unknown' to 'secondary'.
15:22:49 MONGODB | Server mongo-3.backend:27017 initializing.
15:22:49 MONGODB | Server description for mongo-3.backend:27017 changed from 'unknown' to 'primary'.
15:22:49 MONGODB | Server mongo-2.backend:27017 initializing.
15:22:49 MONGODB | Server description for mongo-2.backend:27017 changed from 'unknown' to 'secondary'.
15:22:49 MONGODB | There was a change in the members of the 'unknown' topology.
15:22:49 MONGODB | Topology type 'unknown' changed to type 'replica set'.
15:22:49 MONGODB | There was a change in the members of the 'replica set' topology.
15:22:49 MONGODB | mongo-3.backend:27017 | govuk_assets_production.find | STARTED | {"find"=>"assets", "filter"=>{"state"=>"uploaded"}, "projection"=>{"_id"=>1}}
15:22:49 MONGODB | mongo-3.backend:27017 | govuk_assets_production.find | SUCCEEDED | 0.011259110000000001s
15:22:49 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | STARTED | {"getMore"=>4166065128594334594, "batchSize"=>0, "collection"=>"assets"}
15:22:51 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | SUCCEEDED | 1.332767853s
15:22:51 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | STARTED | {"getMore"=>4166065128594334594, "batchSize"=>0, "collection"=>"assets"}
15:22:52 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | SUCCEEDED | 1.328760374s
15:22:53 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | STARTED | {"getMore"=>4166065128594334594, "batchSize"=>0, "collection"=>"assets"}
15:22:54 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | SUCCEEDED | 1.53358949s
15:22:55 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | STARTED | {"getMore"=>4166065128594334594, "batchSize"=>0, "collection"=>"assets"}
15:22:55 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | SUCCEEDED | 0.26837663300000003s
15:22:56 1000 of 600845 (0%) assets
15:22:56 2000 of 600845 (0%) assets
15:22:57 3000 of 600845 (0%) assets
15:22:58 4000 of 600845 (1%) assets
15:22:59 5000 of 600845 (1%) assets
15:22:59 6000 of 600845 (1%) assets
15:23:00 7000 of 600845 (1%) assets
15:23:01 8000 of 600845 (1%) assets
15:23:02 9000 of 600845 (1%) assets
15:23:03 10000 of 600845 (2%) assets

...

15:31:05 600000 of 600845 (100%) assets
15:31:05 600845 of 600845 (100%) assets
15:31:05 
15:31:05 Finished!
15:31:06 Finished: SUCCESS

@floehopper
Copy link
Contributor Author

floehopper commented Jan 8, 2018

@rubenarakelyan ran the Rake task on production at 15:22 today (see output below). It took about 9 mins to queue ~600K jobs.

15:22:46 Started by user Ruben Arakelyan
15:22:46 [EnvInject] - Loading node environment variables.
15:22:46 Building in workspace /var/lib/jenkins/workspace/run-rake-task
15:22:46 [run-rake-task] $ /bin/sh -xe /tmp/hudson5905563818501335203.sh
15:22:46 + govuk_node_list -c backend --single-node
15:22:46 + ssh [email protected] cd /var/apps/asset-manager && govuk_setenv asset-manager bundle exec rake govuk_assets:delete_file_from_nfs_for_assets_uploaded_to_s3
15:22:49 I, [2018-01-08T15:22:49.344739 #29870]  INFO -- sentry: ** [Raven] Raven 2.7.1 ready to catch errors
15:22:49 MONGODB | Topology type 'unknown' initializing.
15:22:49 MONGODB | Server mongo-1.backend:27017 initializing.
15:22:49 MONGODB | Server description for mongo-1.backend:27017 changed from 'unknown' to 'secondary'.
15:22:49 MONGODB | Server mongo-3.backend:27017 initializing.
15:22:49 MONGODB | Server description for mongo-3.backend:27017 changed from 'unknown' to 'primary'.
15:22:49 MONGODB | Server mongo-2.backend:27017 initializing.
15:22:49 MONGODB | Server description for mongo-2.backend:27017 changed from 'unknown' to 'secondary'.
15:22:49 MONGODB | There was a change in the members of the 'unknown' topology.
15:22:49 MONGODB | Topology type 'unknown' changed to type 'replica set'.
15:22:49 MONGODB | There was a change in the members of the 'replica set' topology.
15:22:49 MONGODB | mongo-3.backend:27017 | govuk_assets_production.find | STARTED | {"find"=>"assets", "filter"=>{"state"=>"uploaded"}, "projection"=>{"_id"=>1}}
15:22:49 MONGODB | mongo-3.backend:27017 | govuk_assets_production.find | SUCCEEDED | 0.011259110000000001s
15:22:49 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | STARTED | {"getMore"=>4166065128594334594, "batchSize"=>0, "collection"=>"assets"}
15:22:51 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | SUCCEEDED | 1.332767853s
15:22:51 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | STARTED | {"getMore"=>4166065128594334594, "batchSize"=>0, "collection"=>"assets"}
15:22:52 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | SUCCEEDED | 1.328760374s
15:22:53 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | STARTED | {"getMore"=>4166065128594334594, "batchSize"=>0, "collection"=>"assets"}
15:22:54 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | SUCCEEDED | 1.53358949s
15:22:55 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | STARTED | {"getMore"=>4166065128594334594, "batchSize"=>0, "collection"=>"assets"}
15:22:55 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | SUCCEEDED | 0.26837663300000003s
15:22:56 1000 of 600845 (0%) assets
15:22:56 2000 of 600845 (0%) assets
15:22:57 3000 of 600845 (0%) assets
15:22:58 4000 of 600845 (1%) assets
15:22:59 5000 of 600845 (1%) assets
15:22:59 6000 of 600845 (1%) assets
15:23:00 7000 of 600845 (1%) assets
15:23:01 8000 of 600845 (1%) assets
15:23:02 9000 of 600845 (1%) assets
15:23:03 10000 of 600845 (2%) assets

...

15:31:05 600000 of 600845 (100%) assets
15:31:05 600845 of 600845 (100%) assets
15:31:05 
15:31:05 Finished!
15:31:06 Finished: SUCCESS

@floehopper
Copy link
Contributor Author

It took ~0h25m to process all the jobs and there were no errors/retries:

2018-01-08 govuk_assets-delete_file_from_nfs_for_assets_uploaded_to_s3 - production - sidekiq

2018-01-08 govuk_assets-delete_file_from_nfs_for_assets_uploaded_to_s3 - production - sidekiq-detail

The CPU usage increased during this time, but not above the critical level:

2018-01-08 govuk_assets-delete_file_from_nfs_for_assets_uploaded_to_s3 - production - backend-resource-usage

The asset master/slave disk comparison alert went critical for some of the time, but all of the files were deleted within 6 mins. asset-slave-1 had caught up within ~26 mins, but asset-slave-2 took more like 54 mins for some reason.

2018-01-08 govuk_assets-delete_file_from_nfs_for_assets_uploaded_to_s3 - production - asset-master-slave-comparison

@floehopper
Copy link
Contributor Author

floehopper commented Jan 8, 2018

After discussion with @chrisroos, I took a different approach to that outlined in this earlier comment:

Integration

  1. Warn 2nd Line that the "Asset master and slave are using about the same amount of disk space" Icinga check is likely to go "critical" on the asset slaves in integration shortly.
  2. Run the govuk_assets:govuk_assets:delete_file_from_nfs_for_assets_uploaded_to_s3 Asset Manager Rake task in integration.
  3. As long as all the jobs are processed OK, proceed to production.

Staging

  1. No need to run the Rake task in staging - we can rely on the nightly environment sync jobs to take care of this for us.

Production

  1. Warn 2nd Line that the "Asset master and slave are using about the same amount of disk space" Icinga check is likely to go "critical" on the asset slaves in production shortly.
  2. Run the govuk_assets:govuk_assets:delete_file_from_nfs_for_assets_uploaded_to_s3 Asset Manager Rake task in production.

Same day

  1. Merge, deploy & apply Remove nightly off-site backup of Asset Manager assets from production NFS to S3. This needs to be applied in production before 04:13 if we want to avoid Asset Manager assets being deleted from the off-site backup, i.e. in the govuk-offsite-backups-production S3 bucket.

Next day

  1. Merge, deploy & apply Remove sync-asset-manager-from-master cron job from asset slaves.
  2. Merge, deploy & apply Disable nightly sync of Asset Manager assets from asset-slave-2 to S3.

I have completed all of the above steps except the two "Next day" ones which I plan to do tomorrow.

@floehopper
Copy link
Contributor Author

alphagov/govuk-puppet#7016 and alphagov/govuk-puppet#7019 have both been merged and so I'm happy to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant