This system automates the process of crawling websites, generating site evaluation reports using the Unlighthouse tool, and uploading CSV results to Google Sheets. It consists of two main components:
-
unlighthouse-gTracker.sh: A shell script that automates the process of running Unlighthouse scans on websites, based on URLs retrieved from a YAML configuration file. This script runs the Unlighthouse scans for a specific day of the week and logs output.
-
unlighthouse-gTracker.js: A Node.js script that processes the CSV results from Unlighthouse scans, uploads the data to Google Sheets, and manages the summary sheet. It handles Google Sheets authentication, retries for scan failures, and ensures proper memory usage.
Together, these scripts allow for an automated, scheduled website scanning and reporting workflow.
- Extracts URLs from a YAML configuration file (
unlighthouse-sites.yml
) based on the current day of the week or a specified day. - Runs Unlighthouse scans for each URL and logs the results.
- Closes Chrome Canary and Chrome Helper processes after each scan to prevent resource exhaustion.
- Forces garbage collection after each scan to manage memory efficiently.
- Logs details such as the Node.js version, start/end times, and URLs scanned.
- Fetches and parses CSV files generated by the Unlighthouse scans.
- Authenticates with Google Sheets API and uploads the scan data to a newly created or existing Google Sheet.
- Handles dynamic creation of new Google Sheets, ensuring unique sheet names.
- Appends metadata, such as the current date and URL, to a summary sheet.
- Implements retry logic with exponential backoff in case of scan failures.
- Monitors memory usage and ensures efficient garbage collection during processing.
- Node.js: Ensure that Node.js is installed on your system. You can download it from Node.js.
- Google Cloud Platform: Set up a project and enable the Google Sheets API and Google Drive API.
- OAuth 2.0 Credentials: Create OAuth 2.0 credentials and download the
credentials.json
file. - Unlighthouse: Install Unlighthouse globally via npm:
npm install -g unlighthouse
- Dependencies: Install the required Node.js modules for
unlighthouse-gTracker.js
:npm install axios googleapis js-yaml csv-parse yargs
The shell script (unlighthouse-gTracker.sh
) relies on:
- yq: A command-line YAML processor. Install it using:
brew install yq
You can set up a cron job to run the unlighthouse-gTracker.sh
script weekly:
0 2 * * 1 /path/to/unlighthouse-gTracker.sh >> /path/to/log/unlighthouse-gTracker.log 2>&1
This script runs the Unlighthouse scans for a specific day of the week and manages logging.
-d <day>
: Specify the day of the week for which URLs should be processed (e.g.,-d Monday
). If no day is specified, it defaults to the current day.
./unlighthouse-gTracker.sh -d Monday
This will scan all the URLs scheduled for Monday, as defined in unlighthouse-sites.yml
.
This script is called by unlighthouse-gTracker.sh
to process the results of each scan and upload them to Google Sheets.
--url <url>
: Specify the URL to run the Unlighthouse scan for.
node unlighthouse-gTracker.js --url=https://example.com
The script will:
- Run the Unlighthouse scan for the specified URL.
- Parse the resulting CSV file.
- Upload the parsed data to a Google Sheet.
- Append the data to a summary sheet, ensuring proper logging and memory management.
The unlighthouse-sites.yml
file is used by unlighthouse-gTracker.sh
to store site information, including the day of the week each site should be scanned. This file contains URLs, Google Sheets IDs, and other relevant metadata for each site.
An example YAML entry:
example-site:
- url: https://example.com
sheet_id: '1XyzABC123SheetID'
start_date: 'Monday'
max: 500
- Add URLs: Update the
unlighthouse-sites.yml
file with URLs you want to scan and schedule them for specific days. - Run the Shell Script: Execute
unlighthouse-gTracker.sh
(optionally through a cron job) to run the scheduled scans for the day. - Process CSV Files: After the scans are complete, the
unlighthouse-gTracker.js
script will process the CSV files, upload them to Google Sheets, and update the summary.
Logs for the script are stored at /Users/mgifford/CA-Sitemap-Scans/unlighthouse-gTracker.log
, containing details of the scan, including URLs processed, errors encountered, and memory usage.
This project is licensed under the GNU General Public License v3.0. You can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This README provides a clear overview of how the two scripts (unlighthouse-gTracker.sh
and unlighthouse-gTracker.js
) work together to automate the crawling, scanning, and reporting of websites into Google Sheets, along with instructions for setting up and running the scripts.