Skip to content

Latest commit

 

History

History
121 lines (87 loc) · 5.53 KB

unlighthouse-gTracker.md

File metadata and controls

121 lines (87 loc) · 5.53 KB

Unlighthouse Site Scanning and Google Sheets Reporting System

Overview

This system automates the process of crawling websites, generating site evaluation reports using the Unlighthouse tool, and uploading CSV results to Google Sheets. It consists of two main components:

  1. unlighthouse-gTracker.sh: A shell script that automates the process of running Unlighthouse scans on websites, based on URLs retrieved from a YAML configuration file. This script runs the Unlighthouse scans for a specific day of the week and logs output.

  2. unlighthouse-gTracker.js: A Node.js script that processes the CSV results from Unlighthouse scans, uploads the data to Google Sheets, and manages the summary sheet. It handles Google Sheets authentication, retries for scan failures, and ensures proper memory usage.

Together, these scripts allow for an automated, scheduled website scanning and reporting workflow.

Features

unlighthouse-gTracker.sh

  • Extracts URLs from a YAML configuration file (unlighthouse-sites.yml) based on the current day of the week or a specified day.
  • Runs Unlighthouse scans for each URL and logs the results.
  • Closes Chrome Canary and Chrome Helper processes after each scan to prevent resource exhaustion.
  • Forces garbage collection after each scan to manage memory efficiently.
  • Logs details such as the Node.js version, start/end times, and URLs scanned.

unlighthouse-gTracker.js

  • Fetches and parses CSV files generated by the Unlighthouse scans.
  • Authenticates with Google Sheets API and uploads the scan data to a newly created or existing Google Sheet.
  • Handles dynamic creation of new Google Sheets, ensuring unique sheet names.
  • Appends metadata, such as the current date and URL, to a summary sheet.
  • Implements retry logic with exponential backoff in case of scan failures.
  • Monitors memory usage and ensures efficient garbage collection during processing.

Installation

Prerequisites

  1. Node.js: Ensure that Node.js is installed on your system. You can download it from Node.js.
  2. Google Cloud Platform: Set up a project and enable the Google Sheets API and Google Drive API.
  3. OAuth 2.0 Credentials: Create OAuth 2.0 credentials and download the credentials.json file.
  4. Unlighthouse: Install Unlighthouse globally via npm:
    npm install -g unlighthouse
  5. Dependencies: Install the required Node.js modules for unlighthouse-gTracker.js:
    npm install axios googleapis js-yaml csv-parse yargs

Shell Script Dependencies

The shell script (unlighthouse-gTracker.sh) relies on:

  • yq: A command-line YAML processor. Install it using:
    brew install yq

Cron Job (Optional)

You can set up a cron job to run the unlighthouse-gTracker.sh script weekly:

0 2 * * 1 /path/to/unlighthouse-gTracker.sh >> /path/to/log/unlighthouse-gTracker.log 2>&1

Usage

unlighthouse-gTracker.sh

This script runs the Unlighthouse scans for a specific day of the week and manages logging.

Command-Line Arguments:

  • -d <day>: Specify the day of the week for which URLs should be processed (e.g., -d Monday). If no day is specified, it defaults to the current day.

Example:

./unlighthouse-gTracker.sh -d Monday

This will scan all the URLs scheduled for Monday, as defined in unlighthouse-sites.yml.

unlighthouse-gTracker.js

This script is called by unlighthouse-gTracker.sh to process the results of each scan and upload them to Google Sheets.

Command-Line Arguments:

  • --url <url>: Specify the URL to run the Unlighthouse scan for.

Example:

node unlighthouse-gTracker.js --url=https://example.com

The script will:

  1. Run the Unlighthouse scan for the specified URL.
  2. Parse the resulting CSV file.
  3. Upload the parsed data to a Google Sheet.
  4. Append the data to a summary sheet, ensuring proper logging and memory management.

YAML Configuration (unlighthouse-sites.yml)

The unlighthouse-sites.yml file is used by unlighthouse-gTracker.sh to store site information, including the day of the week each site should be scanned. This file contains URLs, Google Sheets IDs, and other relevant metadata for each site.

An example YAML entry:

example-site:
  - url: https://example.com
    sheet_id: '1XyzABC123SheetID'
    start_date: 'Monday'
    max: 500

Workflow

  1. Add URLs: Update the unlighthouse-sites.yml file with URLs you want to scan and schedule them for specific days.
  2. Run the Shell Script: Execute unlighthouse-gTracker.sh (optionally through a cron job) to run the scheduled scans for the day.
  3. Process CSV Files: After the scans are complete, the unlighthouse-gTracker.js script will process the CSV files, upload them to Google Sheets, and update the summary.

Logs

Logs for the script are stored at /Users/mgifford/CA-Sitemap-Scans/unlighthouse-gTracker.log, containing details of the scan, including URLs processed, errors encountered, and memory usage.

License

This project is licensed under the GNU General Public License v3.0. You can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.


This README provides a clear overview of how the two scripts (unlighthouse-gTracker.sh and unlighthouse-gTracker.js) work together to automate the crawling, scanning, and reporting of websites into Google Sheets, along with instructions for setting up and running the scripts.