Skip to content

Latest commit

 

History

History
141 lines (91 loc) · 6.45 KB

README.md

File metadata and controls

141 lines (91 loc) · 6.45 KB

Discard2

Discard2 is a high fidelity archival tool for the Discord chat platform. It supports downloading channels, threads, servers, and DMs via a command-line interface.

Warning: Discard2 is experimental software and there are known issues. The authors are not responsible for possible negative consequences of using this program, including account suspension and data loss.

Overview

Discard2 is written in TypeScript using Node.js. It consists of two main components: the crawler and the reader. The crawler is responsible for connecting to the Discord servers and downloading the requested data into a specified directory in a format suitable for archival. It uses a capture tool to accomplish the task of saving the client-server traffic. The reader is responsible for reading the data from the archive and converting it to other usable formats.

Capture tools

Discard2 supports the following capture tools:

  • none - a dummy capture tool which does not save any data (useful only for verifying functionality)
  • mitmdump (mitmproxy) - captures HTTP and Websocket traffic using a proxy
  • tshark (Wireshark) - captures all traffic using packet capture. Not recommended

While tshark creates higher fidelity archives, due to a bug in Wireshark, it is currently not possible to reliably recover data from the packet capture. Therefore, it's currently recommended to use the mitmdump capture tool.

Setup

To ensure a consistent environment, it is recommended to install Discard2 as a container in Docker or Podman. The following command will set up the discard2 container image:

docker build -f Containerfile -t discard2 --target run .

By default, Discard2 creates a new directory for each job in out/.

Alternatively, Discard2 can be set up on a Linux system when all dependencies are installed, that being Node.js, Python, mitmproxy (in the bin/ directory), and the Python packages brotli and mitmproxy. Please reference the Containerfile on how these dependencies are installed on Fedora Linux.

Usage

To use Discard2 in Docker, please prefix all commands with the following line:

docker run --env-file=.env -v $PWD/out:/app/out:Z,U -it

You may replace /out with an output directory of your choosing.

Crawler

To operate, Discard2's crawler needs to be provided with a user account. Please create a .env file with the following contents:

DISCORD_EMAIL=
DISCORD_PASSWORD=

and fill in the email and password for the account you wish to use. It is currently not recommended to use your primary user account as using any unofficial tool may result in account termination and Discard2 hasn't gone through enough testing yet.

First, it is recommended to test logging in using the following command.

npm run start -- crawler profile --capture-tool none --headless  

Discard2's crawler supports performing a variety of tasks. For example, downloading all messages from the channel ID 954365219411460138 in server ID 954365197735317514 sent between 2022-01-01 and 2022-03-18, you would use:

npm run start -- crawler channel 954365197735317514 954365219411460138 --after 2022-01-01 --before 2022-03-18 -c mitmdump --headless

Note that Discord's date search is exclusive (so 2022-01-01 only downloads messages beginning with 2022-01-02).

Full usage of the crawler is available under crawler --help:

Usage: discard2 crawler [options] [command]

Start or resume a crawling job

Options:
  -h, --help                                             display help for command

Commands:
  profile [options]                                      Log in and fetch profile information
  dm [options] <dm-id>                                   Download a single DM
  servers [options]                                      Download all servers
  server [options] <server-id>                           Download a single server
  channel [options] <server-id> <channel-id>             Download a single channel
  thread [options] <server-id> <channel-id> <thread-id>  Download a single thread
  resume [options] <path>                                Resume an interrupted job

Note: To use the tshark capture tool with Docker, you may have to add --cap-add=NET_RAW --cap-add=NET_ADMIN to your Docker command. This is not necessary with Podman.

Warning: When you use the tshark capture tool outside of a container, all (possibly sensitive) traffic on your system gets saved. Only use this capture tool without a container for testing purposes, never publish the resulting captures.

Reader

To convert captures into JSONL suitable for further processing, use:

npm run --silent start -- reader -f raw-jsonl $JOB_DIRECTORY > out.jsonl

In order to import data into a running ElasticSearch instance, the following command should do the trick:

npm run --silent start -- reader -f elasticsearch $JOB_DIRECTORY | curl --cacert $ELASTICSEARCH_CRT -u elastic:$ELASTICSEARCH_PASS -s -H "Content-Type: application/x-ndjson" -XPOST https://$ELASTICSEARCH_HOST/_bulk --data-binary @-; echo

The currently supported output formats are:

  • raw-print - plain text overview of requests and responses
  • raw-jsonl - machine-readable JSON lines with full request and response data
  • print - plain text log of messages (suitable for grep)
  • elasticsearch - message data in format for import to an Elasticsearch index
  • derive-urls - URLs of images and attachments for archival by other tools.

Running tests

npm t

Or in Docker/Podman:

docker build -f Containerfile -t discard2-test --target test .
docker run --cap-add=NET_RAW --cap-add=NET_ADMIN  discard2-test

FAQ

Q: Why is the account password included in the job state file?

A: Because it is also included in the capture and cannot be removed unless you derive the capture. This way it is more obvious.

Q: How do I get IDs of various objects (channels, servers...)?

A: Open the settings of your Discord client and enable "Developer Mode". Then you can right-click objects such as channels and servers and copy their IDs.

Q: How do I track memory and CPU usage of the browser used during crawling?

A: You can use the pidusage.jsonl file in the job directory. To extract just the value for memory (e.g. to plot with other tools) use jq .browser.memory pidusage.jsonl.

Known issues

  • Processing threads is slow and not incremental
  • No attempt is made to fetch all emoji in a server