Remove existing Asset Manager asset files from NFS #296

floehopper · 2017-11-09T16:02:53Z

Note that we need to implement a solution for new assets before doing this.

Once we have enabled proxying to S3 via Nginx for staging & integration, we should be able to delete all Asset Manager asset files from NFS on all environments and rely entirely on the assets being in the relevant S3 bucket.

If the nightly off-site backup job has not been removed, we should be aware that any changes we make to the production NFS mount will be synced to staging & integration overnight.
We should be careful not to delete any files which are currently being processed by Sidekiq jobs (e.g. awaiting/undergoing virus scan or S3 upload). We can probably achieve this by iterating over the assets in the Asset Manager database and checking the state of each asset before actually deleting the underlying file from NFS.
We should be careful not to delete any Whitehall assets from NFS, i.e. files under /mnt/uploads/whitehall as opposed to /mnt/uploads/asset-manager.
We should not remove the /mnt/uploads/asset-manager directory, because the Asset Manager app will still need to store files there until they have been virus scanned and uploaded to S3.

The text was updated successfully, but these errors were encountered:

floehopper · 2017-12-29T15:01:44Z

Note: We should stop the nightly Duplicity backup of the Asset Manager assets before doing this.

floehopper · 2018-01-03T17:28:36Z

Note: We should stop the nightly Duplicity backup of the Asset Manager assets before doing this.

I've rebased alphagov/govuk-puppet#6768 against master, force-pushed, and requested a review in preparation for stopping the nightly backup.

floehopper · 2018-01-04T17:43:44Z

Here's a first attempt at a plan:

Preparatory steps

Merge, deploy & apply Remove sync-asset-manager-from-master cron job from asset slaves. This needs to be applied in each environment before we run the govuk_assets:govuk_assets:delete_file_from_nfs_for_assets_uploaded_to_s3 Asset Manager Rake task in that environment, because the cron job runs every 10 mins and we want to avoid a sudden increase in load on the asset slaves and/or the NFS mount. Also (at least initially) we want to avoid deleting the Asset Manager assets from the asset slaves.
Merge, deploy & apply Disable nightly sync of Asset Manager assets from asset-slave-2 to S3. This needs to be applied in production before 21:00 if we want to avoid Asset Manager assets being deleted from the govuk-attachments-production S3 bucket.
Merge, deploy & apply Remove nightly off-site backup of Asset Manager assets from production NFS to S3. This needs to be applied in production before 04:13 if we want to avoid Asset Manager assets being deleted from the off-site backup, i.e. the govuk-offsite-backups-production S3 bucket.

Integration

Warn 2nd Line that the "Asset master and slave are using about the same amount of disk space" Icinga check is likely to go "critical" on the asset slaves in integration shortly. Suggest that they acknowledge the alert with an expire time (TODO: how long?).
Run the govuk_assets:govuk_assets:delete_file_from_nfs_for_assets_uploaded_to_s3 Asset Manager Rake task in integration.

Staging

Warn 2nd Line that the "Asset master and slave are using about the same amount of disk space" Icinga check is likely to go "critical" on the asset slaves in staging shortly. Suggest that they acknowledge the alert with an expire time (TODO: how long?).
Run the govuk_assets:govuk_assets:delete_file_from_nfs_for_assets_uploaded_to_s3 Asset Manager Rake task in staging.

Production

Warn 2nd Line that the "Asset master and slave are using about the same amount of disk space" Icinga check is likely to go "critical" on the asset slaves in production shortly. Suggest that they acknowledge the alert with an expire time (TODO: how long?).
Run the govuk_assets:govuk_assets:delete_file_from_nfs_for_assets_uploaded_to_s3 Asset Manager Rake task in production.

Note: I think it's worth running the Rake task in each environment and not just in production in order to avoid the nightly sync having to delete all the Asset Manager asset files, because I think it might make that job take a long time. However, an alternative approach would be to only run the Rake task in production and allow the environment syncing to delete the asset files in staging & integration.

floehopper · 2018-01-08T12:22:27Z

I ran the Rake task on integration at 10:55 today (see output below). It took about 16 mins to queue ~600K jobs.

10:55:10 Started by user James Mead
10:55:10 [EnvInject] - Loading node environment variables.
10:55:10 Building in workspace /var/lib/jenkins/workspace/run-rake-task
10:55:10 [run-rake-task] $ /bin/sh -xe /tmp/hudson2895038369793271941.sh
10:55:10 + govuk_node_list -c backend --single-node
10:55:10 + ssh [email protected] cd /var/apps/asset-manager && govuk_setenv asset-manager bundle exec rake govuk_assets:delete_file_from_nfs_for_assets_uploaded_to_s3
10:55:24 1000 of 590552 (0%) assets
10:55:25 2000 of 590552 (0%) assets
10:55:26 3000 of 590552 (1%) assets
10:55:27 4000 of 590552 (1%) assets
10:55:28 5000 of 590552 (1%) assets
10:55:28 6000 of 590552 (1%) assets
10:55:29 7000 of 590552 (1%) assets
10:55:30 8000 of 590552 (1%) assets
10:55:31 9000 of 590552 (2%) assets
10:55:32 10000 of 590552 (2%) assets

...

11:11:23 590000 of 590552 (100%) assets
11:11:24 590552 of 590552 (100%) assets
11:11:24
11:11:24 Finished!
11:11:24 Finished: SUCCESS

floehopper · 2018-01-08T12:42:30Z

It took ~1h30m to process all the jobs and there were no errors/retries:

The CPU usage increased during this time, but not above the critical level:

The asset master/slave disk comparison alert went critical for some of the time, but all of the files were deleted within 20 mins and the slaves had caught up within 50 mins:

floehopper · 2018-01-08T14:41:13Z

I ran the following commands on asset-master-1 to work out how much disk space is being used (in kb) by Asset Manager assets:

$ du -d1 /mnt/uploads
2466360	/mnt/uploads/asset-manager
703121772	/mnt/uploads/whitehall
6872	/mnt/uploads/publisher
20321400	/mnt/uploads/support-api
16	/mnt/uploads/lost+found
725916424	/mnt/uploads

The figures we recorded on 05 Jan were:

43262172	/mnt/uploads/asset-manager
702965552	/mnt/uploads/whitehall

Thus we see there has been a drop of 43,262,172 - 2,466,360 = 40,795,812K (i.e. 38.9GB) to 2,466,360K (i.e. 2.4GB).

The following suggests that most of the disk space is being used in the /mnt/uploads/asset-manager/assets directory.

$ du -d1 /mnt/uploads/asset-manager/
6840	/mnt/uploads/asset-manager/tmp
2459516	/mnt/uploads/asset-manager/assets
2466360	/mnt/uploads/asset-manager/

This makes sense from the point of view that the /mnt/uploads/asset-manager/tmp is only used temporarily by Carrierwave before it moves files into the /mnt/uploads/asset-manager/assets directory. However, it seems slightly odd that so much disk space is being used in /mnt/uploads/asset-manager/assets.

Running the following command demonstrates that there are no files (only directories) under the /mnt/uploads/asset-manager/assets directory.

$ find . -type f | wc -l
0

And the following command shows that there are ~600K empty directories under the /mnt/uploads/asset-manager/assets directory.

$ find . -type d -empty | wc -l
590556

I think it must be the directories which are taking up the disk space.

floehopper · 2018-01-08T14:42:46Z

I've created a new issue to capture the idea of clearing up the empty directories, because I don't think this is urgent.

floehopper · 2018-01-08T15:36:02Z

The Rake task was kicked off in production at 15:22 today and 600845 jobs were queued:

15:22:46 Started by user Ruben Arakelyan
15:22:46 [EnvInject] - Loading node environment variables.
15:22:46 Building in workspace /var/lib/jenkins/workspace/run-rake-task
15:22:46 [run-rake-task] $ /bin/sh -xe /tmp/hudson5905563818501335203.sh
15:22:46 + govuk_node_list -c backend --single-node
15:22:46 + ssh [email protected] cd /var/apps/asset-manager && govuk_setenv asset-manager bundle exec rake govuk_assets:delete_file_from_nfs_for_assets_uploaded_to_s3
15:22:49 I, [2018-01-08T15:22:49.344739 #29870]  INFO -- sentry: ** [Raven] Raven 2.7.1 ready to catch errors
15:22:49 MONGODB | Topology type 'unknown' initializing.
15:22:49 MONGODB | Server mongo-1.backend:27017 initializing.
15:22:49 MONGODB | Server description for mongo-1.backend:27017 changed from 'unknown' to 'secondary'.
15:22:49 MONGODB | Server mongo-3.backend:27017 initializing.
15:22:49 MONGODB | Server description for mongo-3.backend:27017 changed from 'unknown' to 'primary'.
15:22:49 MONGODB | Server mongo-2.backend:27017 initializing.
15:22:49 MONGODB | Server description for mongo-2.backend:27017 changed from 'unknown' to 'secondary'.
15:22:49 MONGODB | There was a change in the members of the 'unknown' topology.
15:22:49 MONGODB | Topology type 'unknown' changed to type 'replica set'.
15:22:49 MONGODB | There was a change in the members of the 'replica set' topology.
15:22:49 MONGODB | mongo-3.backend:27017 | govuk_assets_production.find | STARTED | {"find"=>"assets", "filter"=>{"state"=>"uploaded"}, "projection"=>{"_id"=>1}}
15:22:49 MONGODB | mongo-3.backend:27017 | govuk_assets_production.find | SUCCEEDED | 0.011259110000000001s
15:22:49 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | STARTED | {"getMore"=>4166065128594334594, "batchSize"=>0, "collection"=>"assets"}
15:22:51 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | SUCCEEDED | 1.332767853s
15:22:51 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | STARTED | {"getMore"=>4166065128594334594, "batchSize"=>0, "collection"=>"assets"}
15:22:52 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | SUCCEEDED | 1.328760374s
15:22:53 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | STARTED | {"getMore"=>4166065128594334594, "batchSize"=>0, "collection"=>"assets"}
15:22:54 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | SUCCEEDED | 1.53358949s
15:22:55 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | STARTED | {"getMore"=>4166065128594334594, "batchSize"=>0, "collection"=>"assets"}
15:22:55 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | SUCCEEDED | 0.26837663300000003s
15:22:56 1000 of 600845 (0%) assets
15:22:56 2000 of 600845 (0%) assets
15:22:57 3000 of 600845 (0%) assets
15:22:58 4000 of 600845 (1%) assets
15:22:59 5000 of 600845 (1%) assets
15:22:59 6000 of 600845 (1%) assets
15:23:00 7000 of 600845 (1%) assets
15:23:01 8000 of 600845 (1%) assets
15:23:02 9000 of 600845 (1%) assets
15:23:03 10000 of 600845 (2%) assets

...

15:31:05 600000 of 600845 (100%) assets
15:31:05 600845 of 600845 (100%) assets
15:31:05 
15:31:05 Finished!
15:31:06 Finished: SUCCESS

floehopper · 2018-01-08T19:54:38Z

@rubenarakelyan ran the Rake task on production at 15:22 today (see output below). It took about 9 mins to queue ~600K jobs.

15:22:46 Started by user Ruben Arakelyan
15:22:46 [EnvInject] - Loading node environment variables.
15:22:46 Building in workspace /var/lib/jenkins/workspace/run-rake-task
15:22:46 [run-rake-task] $ /bin/sh -xe /tmp/hudson5905563818501335203.sh
15:22:46 + govuk_node_list -c backend --single-node
15:22:46 + ssh [email protected] cd /var/apps/asset-manager && govuk_setenv asset-manager bundle exec rake govuk_assets:delete_file_from_nfs_for_assets_uploaded_to_s3
15:22:49 I, [2018-01-08T15:22:49.344739 #29870]  INFO -- sentry: ** [Raven] Raven 2.7.1 ready to catch errors
15:22:49 MONGODB | Topology type 'unknown' initializing.
15:22:49 MONGODB | Server mongo-1.backend:27017 initializing.
15:22:49 MONGODB | Server description for mongo-1.backend:27017 changed from 'unknown' to 'secondary'.
15:22:49 MONGODB | Server mongo-3.backend:27017 initializing.
15:22:49 MONGODB | Server description for mongo-3.backend:27017 changed from 'unknown' to 'primary'.
15:22:49 MONGODB | Server mongo-2.backend:27017 initializing.
15:22:49 MONGODB | Server description for mongo-2.backend:27017 changed from 'unknown' to 'secondary'.
15:22:49 MONGODB | There was a change in the members of the 'unknown' topology.
15:22:49 MONGODB | Topology type 'unknown' changed to type 'replica set'.
15:22:49 MONGODB | There was a change in the members of the 'replica set' topology.
15:22:49 MONGODB | mongo-3.backend:27017 | govuk_assets_production.find | STARTED | {"find"=>"assets", "filter"=>{"state"=>"uploaded"}, "projection"=>{"_id"=>1}}
15:22:49 MONGODB | mongo-3.backend:27017 | govuk_assets_production.find | SUCCEEDED | 0.011259110000000001s
15:22:49 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | STARTED | {"getMore"=>4166065128594334594, "batchSize"=>0, "collection"=>"assets"}
15:22:51 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | SUCCEEDED | 1.332767853s
15:22:51 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | STARTED | {"getMore"=>4166065128594334594, "batchSize"=>0, "collection"=>"assets"}
15:22:52 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | SUCCEEDED | 1.328760374s
15:22:53 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | STARTED | {"getMore"=>4166065128594334594, "batchSize"=>0, "collection"=>"assets"}
15:22:54 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | SUCCEEDED | 1.53358949s
15:22:55 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | STARTED | {"getMore"=>4166065128594334594, "batchSize"=>0, "collection"=>"assets"}
15:22:55 MONGODB | mongo-3.backend:27017 | govuk_assets_production.getMore | SUCCEEDED | 0.26837663300000003s
15:22:56 1000 of 600845 (0%) assets
15:22:56 2000 of 600845 (0%) assets
15:22:57 3000 of 600845 (0%) assets
15:22:58 4000 of 600845 (1%) assets
15:22:59 5000 of 600845 (1%) assets
15:22:59 6000 of 600845 (1%) assets
15:23:00 7000 of 600845 (1%) assets
15:23:01 8000 of 600845 (1%) assets
15:23:02 9000 of 600845 (1%) assets
15:23:03 10000 of 600845 (2%) assets

...

15:31:05 600000 of 600845 (100%) assets
15:31:05 600845 of 600845 (100%) assets
15:31:05 
15:31:05 Finished!
15:31:06 Finished: SUCCESS

floehopper · 2018-01-08T20:01:42Z

It took ~0h25m to process all the jobs and there were no errors/retries:

The CPU usage increased during this time, but not above the critical level:

The asset master/slave disk comparison alert went critical for some of the time, but all of the files were deleted within 6 mins. asset-slave-1 had caught up within ~26 mins, but asset-slave-2 took more like 54 mins for some reason.

floehopper · 2018-01-08T20:25:32Z

After discussion with @chrisroos, I took a different approach to that outlined in this earlier comment:

Integration

Warn 2nd Line that the "Asset master and slave are using about the same amount of disk space" Icinga check is likely to go "critical" on the asset slaves in integration shortly.
Run the govuk_assets:govuk_assets:delete_file_from_nfs_for_assets_uploaded_to_s3 Asset Manager Rake task in integration.
As long as all the jobs are processed OK, proceed to production.

Staging

No need to run the Rake task in staging - we can rely on the nightly environment sync jobs to take care of this for us.

Production

Warn 2nd Line that the "Asset master and slave are using about the same amount of disk space" Icinga check is likely to go "critical" on the asset slaves in production shortly.
Run the govuk_assets:govuk_assets:delete_file_from_nfs_for_assets_uploaded_to_s3 Asset Manager Rake task in production.

Same day

Merge, deploy & apply Remove nightly off-site backup of Asset Manager assets from production NFS to S3. This needs to be applied in production before 04:13 if we want to avoid Asset Manager assets being deleted from the off-site backup, i.e. in the govuk-offsite-backups-production S3 bucket.

Next day

Merge, deploy & apply Remove sync-asset-manager-from-master cron job from asset slaves.
Merge, deploy & apply Disable nightly sync of Asset Manager assets from asset-slave-2 to S3.

I have completed all of the above steps except the two "Next day" ones which I plan to do tomorrow.

floehopper · 2018-01-09T16:26:45Z

alphagov/govuk-puppet#7016 and alphagov/govuk-puppet#7019 have both been merged and so I'm happy to close this issue.

floehopper added this to the Decommission NFS for mainstream assets milestone Nov 9, 2017

floehopper changed the title ~~Delete Asset Manager assets from NFS~~ Delete Asset Manager asset files from NFS Dec 5, 2017

floehopper changed the title ~~Delete Asset Manager asset files from NFS~~ Remove Asset Manager asset files from NFS Dec 5, 2017

floehopper changed the title ~~Remove Asset Manager asset files from NFS~~ Remove existing Asset Manager asset files from NFS Dec 5, 2017

floehopper mentioned this issue Dec 5, 2017

Remove new Asset Manager asset files from NFS once they have been uploaded to S3 #323

Closed

floehopper self-assigned this Jan 3, 2018

floehopper mentioned this issue Jan 3, 2018

Remove nightly off-site backup of Asset Manager assets from production NFS to S3 alphagov/govuk-puppet#6768

Merged

floehopper mentioned this issue Jan 8, 2018

Remove empty Carrierwave directories when asset file is removed #385

Open

2 tasks

This was referenced Jan 9, 2018

Disable nightly sync of Asset Manager assets from asset-slave-2 to S3 alphagov/govuk-puppet#7019

Merged

Remove sync-asset-manager-from-master cron job from asset slaves alphagov/govuk-puppet#7016

Merged

floehopper closed this as completed Jan 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove existing Asset Manager asset files from NFS #296

Remove existing Asset Manager asset files from NFS #296

floehopper commented Nov 9, 2017 •

edited

Loading

floehopper commented Dec 29, 2017

floehopper commented Jan 3, 2018 •

edited

Loading

floehopper commented Jan 4, 2018

floehopper commented Jan 8, 2018 •

edited

Loading

floehopper commented Jan 8, 2018 •

edited

Loading

floehopper commented Jan 8, 2018

floehopper commented Jan 8, 2018

floehopper commented Jan 8, 2018 •

edited

Loading

floehopper commented Jan 8, 2018 •

edited

Loading

floehopper commented Jan 8, 2018

floehopper commented Jan 8, 2018 •

edited

Loading

floehopper commented Jan 9, 2018

Remove existing Asset Manager asset files from NFS #296

Remove existing Asset Manager asset files from NFS #296

Comments

floehopper commented Nov 9, 2017 • edited Loading

floehopper commented Dec 29, 2017

floehopper commented Jan 3, 2018 • edited Loading

floehopper commented Jan 4, 2018

Preparatory steps

Integration

Staging

Production

floehopper commented Jan 8, 2018 • edited Loading

floehopper commented Jan 8, 2018 • edited Loading

floehopper commented Jan 8, 2018

floehopper commented Jan 8, 2018

floehopper commented Jan 8, 2018 • edited Loading

floehopper commented Jan 8, 2018 • edited Loading

floehopper commented Jan 8, 2018

floehopper commented Jan 8, 2018 • edited Loading

Integration

Staging

Production

Same day

Next day

floehopper commented Jan 9, 2018

floehopper commented Nov 9, 2017 •

edited

Loading

floehopper commented Jan 3, 2018 •

edited

Loading

floehopper commented Jan 8, 2018 •

edited

Loading

floehopper commented Jan 8, 2018 •

edited

Loading

floehopper commented Jan 8, 2018 •

edited

Loading

floehopper commented Jan 8, 2018 •

edited

Loading

floehopper commented Jan 8, 2018 •

edited

Loading