Skip to content

Latest commit

 

History

History
374 lines (281 loc) · 14.7 KB

File metadata and controls

374 lines (281 loc) · 14.7 KB

Web_Scraping_VCoin_w_Scrapy-Pandas

image

As an enthusiastic collector of antique coins, I have always been fascinated by the rich history each piece embodies. These coins are not only currency, but also links to our past. Some may have been used in times of war, others for everyday transactions like buying food, book, clothes or acquiring a new home. Each coin holds a story, a glimpse into the lives and times of those who once held them.

This is Part IV on Web Scraping if you want to see the first three here are links:

Virtual Environment

Pip Installation

python get-pip.py

Add Python to Path

You need to find your python executable location to add it to Path generally you can find it under C:\Python it is going to look like this:

C:\Users\USER\AppData\Local\Programs\Python

then you need to add find Edit the system environment variables click on Environment Variables and add this path to there.

image

Creating Virtual Environment

python -m venv venv

When you create venv named virtual environment you will find Scripts folder in it and inside it there is file called activate this is batch file we will activate our environment with this

Activating venv

venv\Scripts\activate

Now we are ready to use our virtual environment.

If you face with “Execution_Policies” problem, you can run the following script on powershell:

Set-ExecutionPolicy RemoteSigned

Installing packages into venv

python -m pip install "package-name"

That’s it we can install any package we want with any version we need without making our environment messier, or dealing with issues problems because of all other modules in the same environment etc.

Deactivating venv

deactivate

When we are done, we can simply close our virtual environment with deactivate.

You can check this documentation for more: Virtual Environments and Packages

Let’s install our packages:

pip install Scrapy
pip install pandas
pip install numpy

What is Scrapy?

Scrapy is a simply high-level web crawling and scraping framework that helps us to extract structured data from websites. It can be used for various purposes, including data mining, monitoring, and automated testing.

Documentation => Scrapy 2.11 documentation — Scrapy 2.11.0 documentation

Starting Project

scrapy startproject "project_name"

image

After we started our project, our folder has files like settings.py where you can configure your spider settings, items.py, pipelines.py etc. and the most important one is spiders folder we will configure our spider and it will do what we ask for we can create different spiders for different jobs. To begin with I suggest you to check the documentation above it really helps for you to understand what’s going on.

from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            "https://quotes.toscrape.com/page/1/",
            "https://quotes.toscrape.com/page/2/",
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f"quotes-{page}.html"
        Path(filename).write_bytes(response.body)
        self.log(f"Saved file {filename}")

You can find examples like this in the documentation, spider structure looks like this, the important thing is name must be unique you will use spider’s name to crawl data.

There is shortcut to start_requests method:

from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/page/1/",
        "https://quotes.toscrape.com/page/2/",
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f"quotes-{page}.html"
        Path(filename).write_bytes(response.body)

Crawling with Scrapy

scrapy crawl quotes #quetos spider will get into action

scrapy crawl quotes -o quotes.json #with this you can save data into json format

Crawling via Scrapy Shell

scrapy shell "url" #with this you can directly get the data to analyze in the shell
  • from scrapy shell you can directly analyze to see output
response.css("title::text").get()
# Output: 
'Quotes to Scrape'

That’s it for now for more information you can check documentation I gave above it is enough for you to understand deeper.

What is Pandas?

Pandas is an open-source Python library that has changed the game in data analysis and manipulation. Think of pandas as Swiss Army knife. It’s powerful yet user-friendly, complex yet approachable, and it’s the tool for anyone looking to make sense of data.

With pandas, tasks like reading data from various sources, cleaning it to a usable format, exploring it to find trends, and even visualizing it for presentations are simplified.

Why pandas? Because it streamlines complex processes into one or two lines of code — processes that otherwise would have taken countless steps in traditional programming languages. It’s especially popular in academic research, finance, and commercial data analytics because of its ability to handle large datasets efficiently and intuitively.

Installation

pip install pandas

Example from pandas documentation

import pandas as pd
import numpy as np

df_exe = pd.DataFrame(
    {
        "One": 1.0,
        "Time data": pd.Timestamp("20130102"),
        "Series": pd.Series(1, index=list(range(4)), dtype="float32"),
        "Numpy Array": np.array([3] * 4, dtype="int32"),
        "Catalog": pd.Categorical(["Chair", "Tv", "Mirror", "Sofa"]),
        "F": "example",
    }
)

df_exe

image

df_exe[df_exe["Catalog"]=="Mirror"]

image

We will explore more with the project so for now we are done with it for detailed information you can check pandas documentation.

Project Vcoin

|Part 1: Getting the Data

image

When I check the website I see the structure is like this and I decided to get the seller name, money, and price of it so I checked its html structure to see what to extract.

image

scrapy shell "https://www.vcoins.com/en/coins/world-1945.aspx"

image

response.css("div.item-link a::text").extract()

image

response.css("p.description a::text").extract()

image

response.css("div.prices span.newitemsprice::text").extract()[::2]

image

response.css("div.prices span.newitemsprice::text").extract()[1::2]

image

These are data of one page, and I want my spider to search all the pages available and return the data so I will check pagination part on the bottom.

image

It will go until there is nothing. I first done this project with getting text output, then for csv(comma separated values) output for data analysis.

image

| Importing Libraries

import scrapy  # Import the scrapy library
import csv  # Import the csv library

| init

# Define a new spider class which inherits from scrapy.Spider.
class MoneySpider(scrapy.Spider):
    name = "moneyspider_csv"
    page_count = 0  
    money_count = 1 
    start_urls = ["https://www.vcoins.com/en/coins/world-1945.aspx"]

| Start_request

def start_requests(self):
        self.file = open('money.csv', 'w', newline='', encoding='UTF-8')  # Open a new CSV file in write mode.
        self.writer = csv.writer(self.file)  # Create a CSV writer object.
        self.writer.writerow(['Count', 'Seller', 'Money', 'Price'])  # Write the header row in the CSV file.
        return [scrapy.Request(url=url) for url in self.start_urls]  # Return a list of scrapy.Request objects for each URL.

| parse

    # This method processes the response from each URL
    def parse(self, response):
        # Extract the names
        money_names = response.css("div.item-link a::text").extract()
        # Extract the years
        money_years = response.css("p.description a::text").extract()
        # Extract the currency symbols
        money_symbols = response.css("div.prices span.newitemsprice::text").extract()[::2]
        # Extract the prices
        money_prices = response.css("div.prices span.newitemsprice::text").extract()[1::2]
        # Combine the currency symbols and prices
        combined_prices = [money_symbols[i] + money_prices[i] for i in range(len(money_prices))]

|

        # Loop through the extracted items and write each to a row in the CSV file.
        for i in range(len(money_names)):
            self.writer.writerow([self.money_count, money_names[i], money_years[i], combined_prices[i]])
            self.money_count += 1
    # Extract the URL for the next page
        next_page_url = response.css("div.pagination a::attr(href)").extract_first()
        # If there is a URL for the next page, construct the full URL and continue scraping.
        if next_page_url:
            absolute_next_page_url = response.urljoin(next_page_url)
            self.page_count += 1

            if self.page_count != 10:
                yield scrapy.Request(url=absolute_next_page_url, callback=self.parse, dont_filter=True)
            else:
                self.file.close()

| Output

image

|Part 2: Data Analysis

We successfully extracted the csv data, now it is time for us to analyze.

import pandas as pd
import numpy as np

test = pd.read_csv("money.csv",index_col="Count")
test

image

test.shape
test.info()
test.describe()

image

test.isnull().sum()

image

test.drop_duplicates().sort_values(by="Price",ascending=False).head(25)

image

# Regular expression(Regex) pattern to match 2, 3, or 4 consecutive digits.
pattern = r'(\b\d{4}\b|\b\d{3}\b|\b\d{2}\b)'

test['Extracted_Year'] = test['Money'].str.extract(pattern, expand=False)

test['Extracted_Year'] = pd.to_numeric(test['Extracted_Year'], errors='coerce').fillna(-1).astype(int)

test.drop_duplicates().sort_values(by='Extracted_Year',ascending=False).head(60)

image

def clean_price(price):
    price = price.replace('US$', '').replace('€', '').replace('£', '').replace('NOK', '')
    price = price.replace(',', '').replace('.', '')
    price = price.strip()
    return price

# Apply the cleaning function to Price Column
test['Price'] = test['Price'].apply(clean_price)
test['Price'] = pd.to_numeric(test['Price'], errors='coerce')
test[["Money", "Price"]].drop_duplicates().sort_values(by="Price", ascending=False).head(40)
  • This is not correct thing to do but I wanted to show you clean process I haven’t decide how to correctly sort my values because there are different currencies.

image

test[test["Money"].apply(lambda x : x.startswith("Elizabeth"))]

test[test["Seller"].apply(lambda x : x.startswith("Sovereign"))]

image

test[test["Money"].isin(["Elizabeth II 1966 Gillick Sovereign MS64"])]

image

If you want to understand this in a more simpler language you can check my Medium writing published on Level Up Coding

LINK => https://levelup.gitconnected.com/web-scraping-series-part-iv-world-coins-with-scrapy-data-analysis-with-pandas-6222bb8d6aa7