The U.S. Army Corps of Engineers (USACE) evaluates permit applications for any work, including construction and dredging, in the Nation’s navigable waters. The Wetlands Impact Tracker compiles public notices of those permit applications from USACE, typically stored in PDFs. The data pulled from these notices can help users better understand the impact of development projects on sensitive areas by revealing and summarizing individual notices, and aggregating data. We encourage you to use this tool to explore and better understand how development projects are impacting the communities you work with and live in.
All data powering the dashboard is located on an AWS S3 bucket with public read access. Here are the links to the CSVs:
- aws_df.csv
- character_df.csv
- fulltext_df.csv
- geocoded_df.csv
- location_df.csv
- main_df.csv
- manager_df.csv
- mitigation_df.csv
- wetland_final_df.csv
- summary_df.csv
- Install Python: If you don't have Python installed, download and install it from the official Python website.
- Clone or Download the Repository:
- If Git is installed, clone the repository using the following command in Git Bash on Windows or terminal in other systems:
cd the-path-you-would-like-to-hold-the-repository git clone https://github.com/AtlasPublicPolicy/wetlands-tracker.git
- If you don't have Git installed, you can download the repository as a ZIP file from the GitHub page. Click on the "Code" button and select "Download ZIP." After downloading, extract the ZIP file to the directory of your choice.
- If Git is installed, clone the repository using the following command in Git Bash on Windows or terminal in other systems:
- Set Up and Activate a Virtual Environment in PowerShell or Command Prompt on Windows or in terminal in Other Systems:
- Create a new virtual environment:
# Navigate to the project directory: cd the-path-you-hold-the-repository # Create a virtual environment: virtualenv venv # or python -m venv venv
- Activate the virtual environment:
Windows:macOS and Linux.\venv\Scripts\activate
source venv/bin/activate
- Create a new virtual environment:
- Set up an AWS S3 bucket:
- A folder to place the scrapped data: dashboard-data
- A folder to place notice PDFs: full-pdf
NOTE: If you do not want to use AWS S3 bucket, you may uncomment the 8th parameter in the configurationself.directory = "data_schema/"
and the function to export tables to your directory in the main()[main_extractor.dataframe_to_csv(main_tbls[df_name], df_name, config.directory) for df_name in main_tbls]
to store data locally.
In an active virtual environment
-
Set up the configuration in main.py:
-
Create a file named "api_key.env" and provide the following keys:
- AZURE_API_KEY=your_azure_api_key
- AZURE_ENDPOINT=https://your-azure-endpoint.com
- AWS_ACCESS_KEY_ID=your_aws_access_key_id
- AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key
- OPENAI_API_KEY=your_openai_api_key
- REDIVIS_API_KEY=your_redivis_api_key
-
Modify the following parameters as needed:
-
update: Do you want to scrape all historical notices or only recently updated ones. 1, update; 0, first-time-scraping; default as 1. Note: Be cautious when setting the update parameter to 1 (scrape all historical notices), as it might run for an extended period and incur high costs for Azure and LLM services.
-
n_days: How many days in the past would you like to search for updated notices: numeric # from 0 to 500; default as 14.
-
max_notices: How many maximum notices (sorted by date) to download?
-
district: which district you would like to scrape: "New Orleans", "Galveston", "Jacksonville", "Mobile", or "all"; default as "all".
-
tbl_to_upload: which table you would like to upload to Redivis? Any of tables in the list = ["main", "manager", "character", "mitigation", "location", "fulltext", "summary", "wetland", "embed", "validation", "aws", "geocoded"], "none" or "all"; defaul as "all".
-
price_cap: For Azure summarization, please set a price cap; defaul as 5 ($).
-
n_sentences: How many sentences you would like to have for summarization; defaul as 4.
-
directory: file directory; default as "data_schema/".
-
overwrite_redivis: Overwrite file with same name on Redivis; 1, yes; 0, no; default as no.
-
skipPaid: Skip paid services including OpenAI and Azure Summaries. 1, skip; 0, do not skip; default = 0
-
tesseract_path: If you have problem running OCR(Optical Character Recognition), please specify the path for tesseract.exe such as "C:/Program Files/Tesseract-OCR/tesseract.exe"; default as None.
-
GPT_MODEL_SET: Set GPT model; default as "gpt-3.5-turbo-0613"
-
-
-
Run main.py in the virtual environment:
(venv) $ python main.py
- log.txt: You can find messages, warnings, and errors here.
- error_report.md: This file captures the potential problems with the PDF reading process, special notices, Regex patterns, and LLM performance, which will not break the running and will not be reported in log.txt.
Users are encouraged to report issues directly in the GitHub repository. We plan to maintain this repository at-least through 2024. While we welcome pull requests, we cannot guarantee that they will be reviewed or accepted in a timely manner.
Please reach out to us at [email protected] with any questions or comments.