Skip to content

Latest commit

 

History

History
120 lines (81 loc) · 5.95 KB

README.md

File metadata and controls

120 lines (81 loc) · 5.95 KB

This project translates Reddit API responses into a PL/pgSQL script which loads the data into a Lemmy database.

In other words, it takes Reddit posts/comments and puts them into Lemmy.

Screenshots

Here's an example of a backup of the now-banned r/GenZhou up and running on a Lemmy test instance:

Community Post
comm screenshot post screenshot

Getting input data

To get the JSON API response for a single post, you can call the proper Reddit API (requires an API key), or just append .json to the comments URL, like this:

HTML: https://www.reddit.com/r/GenZedong/comments/laucjl/china_usa/
      https://www.reddit.com/r/GenZedong/comments/laucjl

JSON: https://www.reddit.com/r/GenZedong/comments/laucjl/china_usa/.json?limit=10000
      https://www.reddit.com/r/GenZedong/comments/laucjl.json?limit=10000

Note that we've also added the limit parameter, because otherwise Reddit will pretty aggressively prune the comment tree with "Load more comments" links.

The response object contains the data for that one post and any replies. You can feed this directly into RedditLemmyImporter. However, if you want to import multiple posts, you can put multiple responses in the same input file, with each one separated by a newline. For example:

~ $ cat urls
https://www.reddit.com/r/GenZedong/comments/tpyft9/why_is_like_half_this_sub_made_of_trans_women/
https://www.reddit.com/r/GenZedong/comments/pet8zc/therapist_trans_stalin_isnt_real_she_cant_hurt/
https://www.reddit.com/r/GenZedong/comments/ttcyok/happy_trans_visibility_day_comrades/
https://www.reddit.com/r/GenZedong/comments/t9kbdm/women_of_genzedong_i_congratulate_you_for_your_day/
~ $ xargs -I URL curl --silent --user-agent "Subreddit archiver" --cookie "REDACTED" URL.json?limit=10000 < urls > dump.json

Cloning an entire subreddit

If you need a complete scraping solution, check out this Python script. It pulls posts into a local MongoDB database, which means you can run it on a cron to keep a local clone of posts as they're made. To export your dump.json try something like this:

mongoexport --uri="mongodb://localhost:27017/subredditArchiveDB" --collection=GenZedong --out=dump-wrapped.json

/r/GenZhou was scraped by @[email protected] using this method. Data is available up to about a week before it was banned:
https://mega.nz/file/knBwmTJL#PpqO0I3Jv-xw-o7RBWSi0JSScjSV7-4Eb3JR5HzTc5w

Note that the script buries the data we need within a top-level property named json. RedditLemmyImporter can handle this directly using the --json-pointer option. For example:

java -jar redditLemmyImporter-0.3.jar -c genzhouarchive -u archive_bot -o import.sql --json-pointer=/json GenZhouArchive.json

Generating a SQL script using the release binary

Prerequisites: Java 8 or above

Download the jar file from the releases page and run it:

java -jar redditLemmyImporter-0.3.jar -c genzhouarchive -u archive_bot -o import.sql dump.json

In this case we're generating a PL/pgSQL script that will load the data from dump.json into the comm genzhouarchive under the user archive_bot. The script will be written to import.sql. Full command usage:

Usage: redditLemmyImporter [OPTIONS] dump
      dump                   Path to the JSON dump file from the Reddit API. Required.
                             Specify - to read from stdin.
  -c, --comm=name            Target community name. Required.
  -u, --user=name            Target user name. Required.
      --json-pointer=pointer Locate the Reddit API response somewhere within the top-level object in each input line.
                             See RFC 6901 for the JSON Pointer specification.
  -o, --output-file=file     Output file. Prints to stdout if this option isn't specified.
  -h, --help                 Show this help message and exit.
  -V, --version              Print version information and exit.

Generating a SQL script using the source repository

Prerequisites: JDK >=1.8, Maven 3.

Clone the repo and cd to the source tree. Run:

mvn compile
mvn exec:java -Dexec.args="-c genzhouarchive -u archive_bot -o import.sql path/to/dump.json"

(This will pull down dependencies from Maven Central so you must be connected to the internet during the compile step.)

You could also package a release and then follow the instructions from the previous section:

mvn clean package
java -jar target/redditLemmyImporter-0.3-SNAPSHOT.jar -c genzhouarchive -u archive_bot -o import.sql dump.json

Running the SQL script

Copy import.sql to the server running Postgres and run this:

psql --dbname=lemmy --username=lemmy --file=import.sql

Note that this uses the default values for the database name and database username. If you've changed them in your Lemmy configuration then update the values accordingly.

The target comm and target user must already exist in your Lemmy instance or the SQL script will do nothing.

Running the SQL script with Dockerized Lemmy

Copy import.sql to the server running Docker and run this:

<import.sql docker exec -i $(docker ps -qf name=postgres) psql --dbname=lemmy --username=lemmy -

The target comm and target user must already exist in your Lemmy instance or the SQL script will do nothing.