gocrawl

gocrawl is a polite, slim and concurrent web crawler (or spider).

It is currently in a very early stage of active development, unfit for any serious use. It will kill kittens. Don't. Not yet.

Features

Configurable URLs to visit, inspect and query (using a pre-initialized goquery document)
Crawl delays applied per host
Obedience to robots.txt rules (using robotstxt go library)
Configurable concurrency
Loggable using the builtin configurable Go logger
...and possibly more, we'll see!

gocrawl does not attempt to detect staleness of a page, nor does it implement a caching mechanism. If an URL is enqueued to be processed, it will make a request to fetch it. However, it provides some hooks where some URL analysis can take place to alter its behaviour (based on the last visited date of the URL, saved in a persistent store, for example). Instead of trying to do everything and impose a way to do it, it offers ways to manipulate and adapt it to anyone's needs (hopefully!).

Likewise, there is no prioritization among the URLs to process. It assumes that all enqueued URLs must be visited at some point, and that the order in which they are is unimportant.

By default, uses the default net/http.Client. This default will automatically follow redirects up to 10 times (see the net/http doc for Client struct). It will be possible to provide a custom Fetcher interface implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
README.md		README.md
crawler.go		crawler.go
crawler_test.go		crawler_test.go
fetcher.go		fetcher.go
logger.go		logger.go
popchannel.go		popchannel.go
worker.go		worker.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gocrawl

Features

About

Releases

Packages

uriel/gocrawl

Folders and files

Latest commit

History

Repository files navigation

gocrawl

Features

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages