Skip to content

Commit

Permalink
Merge pull request #91 from tikazyq/develop
Browse files Browse the repository at this point in the history
Develop
  • Loading branch information
Marvin Zhang authored Jul 31, 2019
2 parents 3f872dc + e47769c commit a12a095
Show file tree
Hide file tree
Showing 11 changed files with 187 additions and 172 deletions.
55 changes: 44 additions & 11 deletions README-zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@

中文 | [English](https://github.com/tikazyq/crawlab)

[安装](#安装) | [运行](#运行) | [截图](#截图) | [架构](#架构) | [集成](#与其他框架的集成) | [比较](#与其他框架比较) | [相关文章](#相关文章) | [社区&赞助](#社区--赞助)

基于Golang的分布式爬虫管理平台,支持Python、NodeJS、Go、Java、PHP等多种编程语言以及多种爬虫框架。

[查看演示 Demo](http://114.67.75.98:8080) | [文档](https://tikazyq.github.io/crawlab-docs)
Expand Down Expand Up @@ -48,7 +50,38 @@ docker run -d --rm --name crawlab \
tikazyq/crawlab:0.3.0
```

当然也可以用`docker-compose`来一键启动,甚至不用配置MongoDB和Redis数据库。
当然也可以用`docker-compose`来一键启动,甚至不用配置MongoDB和Redis数据库。在当前目录中创建`docker-compose.yml`文件,输入以下内容。

```bash
version: '3.3'
services:
master:
image: tikazyq/crawlab:latest
container_name: crawlab-master
environment:
CRAWLAB_API_ADDRESS: "192.168.99.100:8000"
CRAWLAB_SERVER_MASTER: "Y"
CRAWLAB_MONGO_HOST: "mongo"
CRAWLAB_REDIS_ADDRESS: "redis:6379"
ports:
- "8080:8080" # frontend
- "8000:8000" # backend
depends_on:
- mongo
- redis
mongo:
image: mongo:latest
restart: always
ports:
- "27017:27017"
redis:
image: redis:latest
restart: always
ports:
- "6379:6379"
```

然后执行以下命令,Crawlab主节点+MongoDB+Redis就启动了。打开`http://localhost:8080`就能看到界面。

```bash
docker-compose up
Expand All @@ -64,43 +97,43 @@ Docker部署的详情,请见[相关文档](https://tikazyq.github.io/crawlab/I

#### 登录

![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/login.png)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/login.png?v0.3.0)

#### 首页

![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/home.png)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/home.png?v0.3.0)

#### 节点列表

![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/node-list.png)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/node-list.png?v0.3.0)

#### 节点拓扑图

![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/node-network.png)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/node-network.png?v0.3.0)

#### 爬虫列表

![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/spider-list.png)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/spider-list.png?v0.3.0)

#### 爬虫概览

![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/spider-overview.png)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/spider-overview.png?v0.3.0)

#### 爬虫分析

![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/spider-analytics.png)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/spider-analytics.png?v0.3.0)

#### 爬虫文件

![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/spider-file.png)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/spider-file.png?v0.3.0)

#### 任务详情 - 抓取结果

![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/task-results.png)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/task-results.png?v0.3.0_1)

#### 定时任务

![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/schedule.png)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/schedule.png?v0.3.0)

## 架构

Expand Down
164 changes: 126 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,91 +1,179 @@
# Crawlab

![](http://114.67.75.98:8081/buildStatus/icon?job=crawlab%2Fdevelop)
![](https://img.shields.io/badge/version-v0.2.3-blue.svg)
![](https://img.shields.io/badge/version-v0.3.0-blue.svg)
<a href="https://github.com/tikazyq/crawlab/blob/master/LICENSE" target="_blank">
<img src="https://img.shields.io/badge/license-BSD-blue.svg">
</a>

[中文](https://github.com/tikazyq/crawlab/blob/master/README-zh.md) | English

Celery-based web crawler admin platform for managing distributed web spiders regardless of languages and frameworks.
[Installation](#installation) | [Run](#run) | [Screenshot](#screenshot) | [Architecture](#architecture) | [Integration](#integration-with-other-frameworks) | [Compare](#comparison-with-other-frameworks) | [Community & Sponsorship](#community--sponsorship)

Golang-based distributed web crawler management platform, supporting various languages including Python, NodeJS, Go, Java, PHP and various web crawler frameworks including Scrapy, Puppeteer, Selenium.

[Demo](http://114.67.75.98:8080) | [Documentation](https://tikazyq.github.io/crawlab-docs)

## Pre-requisite
- Go 1.12+
- Node.js 8.12+
## Installation

Two methods:
1. [Docker](https://tikazyq.github.io/crawlab/Installation/Docker.md) (Recommended)
2. [Direct Deploy](https://tikazyq.github.io/crawlab/Installation/Direct.md) (Check Internal Kernel)

### Pre-requisite (Docker)
- Docker 18.03+
- Redis
- MongoDB 3.6+

### Pre-requisite (Direct Deploy)
- Go 1.12+
- Node 8.12+
- Redis
- MongoDB 3.6+

## Installation
## Run

### Docker

Run Master Node for example. `192.168.99.1` is the host machine IP address in Docker Machine network. `192.168.99.100` is the Master Node's IP address.

```bash
docker run -d --rm --name crawlab \
-e CRAWLAB_REDIS_ADDRESS=192.168.99.1:6379 \
-e CRAWLAB_MONGO_HOST=192.168.99.1 \
-e CRAWLAB_SERVER_MASTER=Y \
-e CRAWLAB_API_ADDRESS=192.168.99.100:8000 \
-e CRAWLAB_SPIDER_PATH=/app/spiders \
-p 8080:8080 \
-p 8000:8000 \
-v /var/logs/crawlab:/var/logs/crawlab \
tikazyq/crawlab:0.3.0
```

Surely you can use `docker-compose` to one-click to start up. By doing so, you don't even have to configure MongoDB and Redis databases. Create a file named `docker-compose.yml` and input the code below.


```bash
version: '3.3'
services:
master:
image: tikazyq/crawlab:latest
container_name: crawlab-master
environment:
CRAWLAB_API_ADDRESS: "192.168.99.100:8000"
CRAWLAB_SERVER_MASTER: "Y"
CRAWLAB_MONGO_HOST: "mongo"
CRAWLAB_REDIS_ADDRESS: "redis:6379"
ports:
- "8080:8080" # frontend
- "8000:8000" # backend
depends_on:
- mongo
- redis
mongo:
image: mongo:latest
restart: always
ports:
- "27017:27017"
redis:
image: redis:latest
restart: always
ports:
- "6379:6379"
```

Then execute the command below, and Crawlab Master Node + MongoDB + Redis will start up. Open the browser and enter `http://localhost:8080` to see the UI interface.

```bash
docker-compose up
```

For Docker Deployment details, please refer to [relevant documentation](https://tikazyq.github.io/crawlab/Installation/Docker.md).

Threee methods:
1. [Docker](https://tikazyq.github.io/crawlab/Installation/Docker.md) (Recommended)
2. [Direct Deploy](https://tikazyq.github.io/crawlab/Installation/Direct.md)
3. [Preview](https://tikazyq.github.io/crawlab/Installation/Direct.md) (Quick start)

## Screenshot

#### Login

![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/login.png?v0.3.0)

#### Home Page

![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/gitbook/home.png)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/home.png?v0.3.0)

#### Node List

![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/node-list.png?v0.3.0)

#### Node Network

![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/node-network.png?v0.3.0)

#### Spider List

![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/gitbook/spider-list.png)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/spider-list.png?v0.3.0)

#### Spider Overview

![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/spider-overview.png?v0.3.0)

#### Spider Detail - Overview
#### Spider Analytics

![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/gitbook/spider-detail-overview.png)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/spider-analytics.png?v0.3.0)

#### Spider Detail - Analytics
#### Spider Files

![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/gitbook/spider-detail-analytics.png)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/spider-file.png?v0.3.0)

#### Task Detail - Results
#### Task Results

![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/gitbook/task-detail-results.png)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/task-results.png?v0.3.0_1)

#### Cron Schedule
#### Cron Job

![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/gitbook/schedule-generate-cron.png)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/schedule.png?v0.3.0)

## Architecture

Crawlab's architecture is very similar to Celery's, but a few more modules including Frontend, Spiders and Flower are added to feature the crawling management functionality.
The architecture of Crawlab is consisted of the Master Node and multiple Worker Nodes, and Redis and MongoDB databases which are mainly for nodes communication and data storage.

![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/gitbook/architecture.png)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/v0.3.0/architecture.png)

### Nodes
The frontend app makes requests to the Master Node, which assigns tasks and deploys spiders through MongoDB and Redis. When a Worker Node receives a task, it begins to execute the crawling task, and stores the results to MongoDB. The architecture is much more concise compared with versions before `v0.3.0`. It has removed unnecessary Flower module which offers node monitoring services. They are now done by Redis.

Nodes are actually the workers defined in Celery. A node is running and connected to a task queue, redis for example, to receive and run tasks. As spiders need to be deployed to the nodes, users should specify their ip addresses and ports before the deployment.
### Master Node

### Spiders
The Master Node is the core of the Crawlab architecture. It is the center control system of Crawlab.

The spider source codes and configured crawling rules are stored on `App`, which need to be deployed to each `worker` node.
The Master Node offers below services:
1. Crawling Task Coordination;
2. Worker Node Management and Communication;
3. Spider Deployment;
4. Frontend and API Services;
5. Task Execution (one can regard the Master Node as a Worker Node)

### Tasks
The Master Node communicates with the frontend app, and send crawling tasks to Worker Nodes. In the mean time, the Master Node synchronizes (deploys) spiders to Worker Nodes, via Redis and MongoDB GridFS.

Tasks are triggered and run by the workers. Users can view the task status, logs and results in the task detail page.
### Worker Node

### App
The main functionality of the Worker Nodes is to execute crawling tasks and store results and logs, and communicate with the Master Node through Redis `PubSub`. By increasing the number of Worker Nodes, Crawlab can scale horizontally, and different crawling tasks can be assigned to different nodes to execute.

This is a Flask app that provides necessary API for common operations such as CRUD, spider deployment and task running. Each node has to run the flask app to get spiders deployed on this machine. Simply run `python manage.py app` or `python ./bin/run_app.py` to start the app.
### MongoDB

### Broker
MongoDB is the operational database of Crawlab. It stores data of nodes, spiders, tasks, schedules, etc. The MongoDB GridFS file system is the medium for the Master Node to store spider files and synchronize to the Worker Nodes.

Broker is the same as defined in Celery. It is the queue for running async tasks.
### Redis

Redis is a very popular Key-Value database. It offers node communication services in Crawlab. For example, nodes will execute `HSET` to set their info into a hash list named `nodes` in Redis, and the Master Node will identify online nodes according to the hash list.

### Frontend

Frontend is basically a Vue SPA that inherits from [Vue-Element-Admin](https://github.com/PanJiaChen/vue-element-admin) of [PanJiaChen](https://github.com/PanJiaChen). Thanks for his awesome template.
Frontend is a SPA based on
[Vue-Element-Admin](https://github.com/PanJiaChen/vue-element-admin). It has re-used many Element-UI components to support correspoinding display.

## Integration with Other Frameworks

A task is triggered via `Popen` in python `subprocess` module. A Task ID is will be defined as a variable `CRAWLAB_TASK_ID` in the shell environment to link the data to the task.

In your spider program, you should store the `CRAWLAB_TASK_ID` value in the database with key `task_id`. Then Crawlab would know how to link those results to a particular task. For now, Crawlab only supports MongoDB.
A crawling task is actually executed through a shell command. The Task ID will be passed to the crawling task process in the form of environment variable named `CRAWLAB_TASK_ID`. By doing so, the data can be related to a task. Also, another environment variable `CRAWLAB_COLLECTION` is passed by Crawlab as the name of the collection to store results data.

### Scrapy

Expand Down Expand Up @@ -125,9 +213,9 @@ Crawlab is easy to use, general enough to adapt spiders in any language and any
|Framework | Type | Distributed | Frontend | Scrapyd-Dependent |
|:---:|:---:|:---:|:---:|:---:|
| [Crawlab](https://github.com/tikazyq/crawlab) | Admin Platform | Y | Y | N
| [Gerapy](https://github.com/Gerapy/Gerapy) | Admin Platform | Y | Y | Y
| [SpiderKeeper](https://github.com/DormyMo/SpiderKeeper) | Admin Platform | Y | Y | Y
| [ScrapydWeb](https://github.com/my8100/scrapydweb) | Admin Platform | Y | Y | Y
| [SpiderKeeper](https://github.com/DormyMo/SpiderKeeper) | Admin Platform | Y | Y | Y
| [Gerapy](https://github.com/Gerapy/Gerapy) | Admin Platform | Y | Y | Y
| [Scrapyd](https://github.com/scrapy/scrapyd) | Web Service | Y | N | N/A

## Community & Sponsorship
Expand Down
Binary file modified frontend/favicon.ico
Binary file not shown.
1 change: 1 addition & 0 deletions frontend/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<meta name="renderer" content="webkit">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no">
<link rel="icon" href="/static/favicon.ico" type="image/x-icon">
<title>Crawlab</title>
</head>
<body>
Expand Down
Binary file added frontend/public/favicon.ico
Binary file not shown.
16 changes: 16 additions & 0 deletions frontend/public/index.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<meta name="renderer" content="webkit">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no">
<link rel="icon" href="/favicon.ico" type="image/x-icon">
<title>Crawlab</title>
</head>
<body>
<!--<script src=<%= BASE_URL %>/tinymce4.7.5/tinymce.min.js></script>-->
<div id="app"></div>
<!-- built files will be auto injected -->
</body>
</html>
Binary file added frontend/src/assets/favicon.ico
Binary file not shown.
Binary file added frontend/static/favicon.ico
Binary file not shown.
Loading

0 comments on commit a12a095

Please sign in to comment.