Web Crawling and Scraping Framework
- Cache web pages with Redis
- Export the results automatically to MongoDB
- Manny http request providers
- Helper to ignore invalid ssl certificate
- Helper to rename json property name token
- Helper to get or delete N itens of collection
- Common regular expression to grab data
- Extract youtube video metadata
There are some Regex done to grab data
- Cookies: Can slice each part of cookies
Each provider has your own features. You can choose anyone to get a resource
- HttpClient: Provides a class for sending HTTP requests and receiving HTTP responses from a resource.
- AngleSharp: WebRequester for navigation like a browser.
- WebClient: Provides common methods for sending data to and receiving data from a resource.
- CookieWebClient: Extension of WebClient including Cookies of transaction.
- BetterWebClient: Extension of WebClient including session through with cookie container, GZip header, HTTP status code.
- Chrome Headless: Request pages interpreting javascript. You must have Google Chrome instaled
Pipeline get a data and transform sending the result to next pipeline. Use can use how much pipelines you want.
Call .RunDasherizePipeline()
This pipeline is used to normalize an object transforming the json key fields like:
"some_title" => "some-title"
"someTitle" => "some-title"
Call .RunPrintToConsolePipeline(fields)
This pipeline is used to print some scraped data to stdout. The fields parameter is optional
Call .RunSafeUrlsPipeline(result, prefixUrl, fieldArgs[])
This pipeline is used to add a http prefix to any scraped link data because sometimes the href attributes says only the path and hide the schema://host:port like:
"href="image.jpg" => "http://user-domain.com/image.jpg"
Call .RunYoutubeDetailPipeline()
This pipeline is used to extract information of metadata from youtube video url. The attributes like: Id, Author, UploadDate, Title, Description, Thumbnails, Duration, Keywords, Statistics, StreamMediaInfo and ClosedCaptionInfo can be extract from youtube video url.
- Retrieves information about videos, playlists, channels, media streams and closed caption tracks
- Handles all types of videos, including legacy, signed, restricted, non-embeddable and unlisted videos
- Works with media streams of all types -- muxed, embedded adaptive, dash adaptive
- Parses and downloads closed caption tracks
- Provides static methods to validate IDs and to parse IDs from URLs
You must specify the field name that will receive the metadata information
Call .RunSaveToMongoAsyncPipeline(collection, unique-id-filed);
This pipeline is used to do UpSert data on MongoDB. The collection parameter define the collection name to save and unique-id-field define the primary key of document to choose if the operation will be Insert or Update
Call .RunSaveToElasticSearchPipeline(type, unique-id-filed);
This pipeline is used to Index data on ElasticSearch. The type parameter define the type name to save and unique-id-field define the primary key of document.
The field unique-id-field is optional, its will be generated if not specified.
Call .RunHtmlDecode();
This pipeline converts a string that has been HTML-encoded for HTTP transmission into a decoded string
- Start a new console project and add Nuget Reference
- PM>
Install-Package SpiderSharp
- Define a spider class inherit from SpiderEngine, ISpiderEngine
- Set the first url on constructor
- Set the link for next page on override FollowPage method
- Implement whats you want to scrap on override OnRun method
TIP: you can use this.node to view the html object content
public class ScrapQuotesSpider : SpiderEngine, ISpiderEngine
{
public ScrapQuotesSpider()
{
this.SetUrl("http://quotes.toscrape.com");
}
protected override string FollowPage()
{
return $"http://quotes.toscrape.com{this.node.GetHref("ul > li > a")}";
}
protected override IEnumerable<dynamic> OnRun()
{
dynamic json = new ExpandoObject();
var quotes = this.node.SelectNodes("div.quote");
foreach (var item in quotes)
{
yield return this.Fetch(() => {
ct.Data.text = item.GetInnerText("span.text");
ct.Data.author = item.GetInnerText("small.author");
ct.Data.tags = item.SelectInnerText("div.tags > a.tag");
});
}
}
protected override void SuccessPipeline(SpiderContext context)
{
context.RunPrintToConsolePipeline();
}
protected override void ErrorPipeline(SpiderContext context)
{
context.RunEmbedMetadata();
context.RunPrintToConsolePipeline();
}
}
- Create a instance of spider
- Choose almost one pipeline to return
- Run the spider
ScrapQuotesSpider spider = new ScrapQuotesSpider();
spider.Run();
The project include a simple quotes scrap of http://quotes.toscrape.com
This application can be used to inspect and debug a spider with the bellow features
- set => set httpclient or anglesharp driver to request
- load => load url or local file
- save => save content to local file
- innertext => select innertext using css selector
- attribute => select attribute value using css selector
- innerhtml => select innerhtml using css selector
- outerhtml => select outerhtml using css selector
- links => select links using css selector
- json => select links using css selector
- path => select links using css selector and print json path
- cls => clear screen
- browse => open browser with content
- notepad => open notepad with content
- help => show all commands
- quit => exit program
- Use the set command to choose what kind of driver to request
- Load any url or local file with a page content
- If you want use the command browse to see the page content inside the browser
- Try to get data with any css selector with the commands: innertext, attribute, innerhtml, outerhtml, links, json or path
- You will see the result on console
** You can use the command notead to see the result of last css selector command on text editor or use the command save to write content to a file **
- AngleSharp - The ultimate angle brackets parser library parsing HTML5, MathML, SVG and CSS to construct a DOM based on the official W3C specifications
- YoutubeExplode - The ultimate dirty YouTube library