GitHub - FIREWALL-cafe/great-firewall-notebooks: Exploring automated searches and scraping of Google and Baidu in support of the Firewall Cafe project.

These notebooks are prototypes, research, and sanity checks for the Firewall Cafe project.

Setup

Install these packages at a minimum:

Jupyter Notebooks (or the Anaconda stack)

For some of them, you'll need:

Selenium
Google Cloud Translate
ipyplot

If you want to run those notebooks, you'll need to set up some credentials with Google Cloud Translation and you'll need to download the appropriate Chrome webdriver for your version of Chrome.

Prototyping a scraper

1_requests-google-baidu. Reverse-engineering search results.

2_using-google-cloud-translation. Getting some basic automatic translation with Google Translate.

3_compare-languages-Google. Comparing what search results look like in different languages on Google.

4_compare-languages-Baidu. Comparing what search results look like in different languages on Baidu.

5_querying-many-sensitive-words-archive. Testing rate limits to see if Google or Baidu have automatic ban-hammers at a certain rate.

API integration

6_firewall-api. Testing Firewall Cafe API endpoints and demonstrating their use.

7_firewall-babelfish. Demonstrating how to use the Babelfish translate API (if you have a key).

8_image-hashing. Testing different image hashing algorithms.

9_wordpress-node-APIs. Looking at similarities between the old and new Firewall Cafe APIs.

Migrations

10_transfer-images-http. A first attempt at getting 10k images from one place to another.

11_extract-images-postgres-dump. Extracting images from a postgresql dump; never got it working.

Data integrity checks

12_data-integrity. Checking that search results are getting entered correctly into the API, and returning as expected when we ask for them.

13_clean-up-searches-API. Delete searches that incorrectly stored way too many images.

14_wordpress-and-db-check. Take a closer look at Wordpress API vs new API to see if there are discrepencies in image results (they all seem to match).

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
thumbnails		thumbnails
10_transfer-images-http.ipynb		10_transfer-images-http.ipynb
11_extract-images-postgres-dump.ipynb		11_extract-images-postgres-dump.ipynb
12_data-integrity.ipynb		12_data-integrity.ipynb
13_clean-up-searches-API.ipynb		13_clean-up-searches-API.ipynb
14_wordpress-and-db-check.ipynb		14_wordpress-and-db-check.ipynb
15_multithreaded-prototype.ipynb		15_multithreaded-prototype.ipynb
16_google-large-images.ipynb		16_google-large-images.ipynb
1_requests-google-baidu.ipynb		1_requests-google-baidu.ipynb
2_using-google-cloud-translation.ipynb		2_using-google-cloud-translation.ipynb
3_compare-languages-Google.ipynb		3_compare-languages-Google.ipynb
4_compare-languages-Baidu.ipynb		4_compare-languages-Baidu.ipynb
5_querying-many-sensitive-words-archive.ipynb		5_querying-many-sensitive-words-archive.ipynb
6_firewall-api.ipynb		6_firewall-api.ipynb
7_firewall-babelfish.ipynb		7_firewall-babelfish.ipynb
8_image-hashing.ipynb		8_image-hashing.ipynb
9_wordpress-node-APIs.ipynb		9_wordpress-node-APIs.ipynb
README.md		README.md
translate.py		translate.py
url_to_hash_1618157001.json		url_to_hash_1618157001.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setup

Prototyping a scraper

API integration

Migrations

Data integrity checks

About

Releases

Packages

Languages

FIREWALL-cafe/great-firewall-notebooks

Folders and files

Latest commit

History

Repository files navigation

Setup

Prototyping a scraper

API integration

Migrations

Data integrity checks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages