Spidermon Field Coverage - Item Validation impossible with nested scrapy.Items #314

Criamos · 2021-09-29T13:36:13Z

Disclaimer: I'm a Python beginner and have been using Scrapy for the past 5 months, so please bear with me since this is the first GitHub Issue I've ever written.

Problem Description

I've tried integrating spidermon into an existing codebase with ~40 crawlers that use scrapy.Items as their data model. Upon trying to integrate Item Validation (both via schematics and jsonschema) I've noticed that spidermon only seems to be able to "see" the first level of a scrapy.Item (class: scrapy.Item), but not the other scrapy.Item-classes that are nested within mentioned Item.

Source code examples

I've tried to illustrate the problem with a simplified abstraction that's close to the Scrapy Tutorial - here's my items.py:

# items.py
import scrapy
from scrapy import Field
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst


class JoinMultivalues(object):
    def __init__(self, separator=u" "):
        self.separator = separator

    def __call__(self, values):
        return values


class LicenseItem(scrapy.Item):
    description = Field()


class QuotesItem(scrapy.Item):
    text = Field()
    author = Field()
    tags = Field()
    license = Field(output_processor=JoinMultivalues())


class QuotesItemLoader(ItemLoader):
    default_item_class = QuotesItem
    default_output_processor = TakeFirst()


class LicenseItemLoader(ItemLoader):
    default_item_class = LicenseItem
    default_output_processor = TakeFirst()

The main idea is: Within the QuotesItem there's a LicenseItem that should hold a license-description. Within the QuotesItem there could be other scrapy.Items nested within, sometimes several layers deep.

This is how a yielded Item looks like in the Terminal (please ignore the "raw" formatting of the license description string):

2021-09-29 14:03:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': ['Allen Saunders'],
 'license': [{'description': ['[\'<p class="copyright">\\n                Made with <span '
                 'class="sh-red">❤</span> by <a '
                 'href="https://scrapinghub.com">Scrapinghub</a>\\n            '
                 "</p>']"]}],
 'tags': ['fate', 'life', 'misattributed-john-lennon', 'planning', 'plans'],
 'text': ['“Life is what happens to us while we are making other plans.”']}

The crawler itself uses the Basic Scrapy Monitors and doesn't do much else than scraping the Tutorial website, using the scrapy.ItemLoader-class to nest one scrapy.Item within another and yields the QuotesItem in the end.

# quotes_spider.py
import scrapy
from scrapy.loader import ItemLoader

from ..items import QuotesItem, LicenseItem


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
    custom_settings = {
        'SPIDERMON_SPIDER_CLOSE_MONITORS': 'spidermon.contrib.scrapy.monitors.SpiderCloseMonitorSuite',
        'SPIDERMON_MIN_ITEMS': 1,
        'SPIDERMON_VALIDATION_MODELS': 'tutorial.tutorial.validators',
        'SPIDERMON_ADD_FIELD_COVERAGE': True,
        'SPIDERMON_FIELD_COVERAGE_RULES': {
            'QuotesItem/text': 0.9,
            'QuotesItem/author': 0.8,
            'QuotesItem/tags': 1,
            'QuotesItem/license/description': 1
        }
    }

    def parse(self, response, **kwargs):
        for quote in response.css('div.quote'):
            item_loader = ItemLoader(item=QuotesItem(), response=response)
            # Implementation via nested dictionaries
            # quotes_item = {
            #     'text': quote.css('span.text::text').get(),
            #     'author': quote.css('small.author::text').get(),
            #     'tags': quote.css('div.tags a.tag::text').getall(),
            #     'nested_dict': {
            #         'nested_field_1': 'can you read this?',
            #         'nested_field_2': 'I hope you can',
            #         'we_must_go_deeper': {
            #             'description': 'but do we really have to?',
            #             "4th_level": {
            #                 "five_levels_deep": "we can't turn back now!"
            #             }
            #         }
            #     }
            # }
            # yield quotes_item

            # Implementation #2 via ItemLoaders - see items.py:
            item_loader.add_value('text', quote.css('span.text::text').get())
            item_loader.add_value('author', quote.css('small.author::text').get())
            item_loader.add_value('tags', quote.css('div.tags a.tag::text').getall())

            license_loader = ItemLoader(item=LicenseItem(), response=response)
            license_raw = response.xpath('//footer/div/p[@class="copyright"]').getall()
            license_description = str(license_raw)
            license_loader.add_value('description', license_description)

            item_loader.add_value('license', license_loader.load_item())
            yield item_loader.load_item()

Output examples

This is what the field coverage output looks like:

'spidermon_field_coverage/QuotesItem/author': 1.0,
 'spidermon_field_coverage/QuotesItem/license': 1.0,
 'spidermon_field_coverage/QuotesItem/tags': 1.0,
 'spidermon_field_coverage/QuotesItem/text': 1.0,
 'spidermon_item_scraped_count': 20,
 'spidermon_item_scraped_count/QuotesItem': 20,
 'spidermon_item_scraped_count/QuotesItem/author': 20,
 'spidermon_item_scraped_count/QuotesItem/license': 20,
 'spidermon_item_scraped_count/QuotesItem/tags': 20,
 'spidermon_item_scraped_count/QuotesItem/text': 20,

My expectation/hope was that I'd be able to "look inside" the LicenseItem and spidermon would show me spidermon_item_scraped_count/QuotesItem/license/description. But as you can see above, spidermon stops at the depth of QuotesItem/license. Spidermon's field coverage monitor can't look inside my license-Item and therefore fails while trying to access the description-field with the following output:

2021-09-29 14:03:15 [quotes] ERROR: [Spidermon] 
======================================================================
FAIL: Field Coverage Monitor/test_check_if_field_coverage_rules_are_met
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/criamos/PycharmProjects/spidermonTesting/venv/lib/python3.9/site-packages/spidermon/contrib/scrapy/monitors.py", line 275, in test_check_if_field_coverage_rules_are_met
    self.assertTrue(len(failures) == 0, msg=msg)
AssertionError: 
The following items did not meet field coverage rules:
QuotesItem/license/description (expected 1, got 0)

Item Validators - validators.py

(the commented-out parts were different approaches until I realized that the problem must lie elsewhere)

# validators.py
from schematics import Model
from schematics.types import *


class LicenseItemValidator(Model):
    description = StringType()


class QuoteItemValidator(Model):
    text = StringType(required=True)
    author = StringType(required=True)
    tags = ListType(StringType)
    # license = DictType(field=StringType, coerce_key=BaseType)
    # license = ModelType(model_spec=LicenseItemValidator, required=True)
    license = ListType(ModelType(LicenseItemValidator))

Expected Behaviour:

If I yield a normal python dictionary (see "Implementation via dict class" in my quotes_spider-example above), I'll get the following output:

'spidermon_item_scraped_count': 20,
 'spidermon_item_scraped_count/dict': 20,
 'spidermon_item_scraped_count/dict/author': 20,
 'spidermon_item_scraped_count/dict/nested_dict': 20,
 'spidermon_item_scraped_count/dict/nested_dict/nested_field_1': 20,
 'spidermon_item_scraped_count/dict/nested_dict/nested_field_2': 20,
 'spidermon_item_scraped_count/dict/nested_dict/we_must_go_deeper': 20,
 'spidermon_item_scraped_count/dict/nested_dict/we_must_go_deeper/4th_level': 20,
 'spidermon_item_scraped_count/dict/nested_dict/we_must_go_deeper/4th_level/five_levels_deep': 20,
 'spidermon_item_scraped_count/dict/nested_dict/we_must_go_deeper/description': 20,
 'spidermon_item_scraped_count/dict/tags': 20,
 'spidermon_item_scraped_count/dict/text': 20,

Which works as expected. The nested dictionaries are accessible by spidermon.

Now I'm all out of ideas since the only approach I currently see as a solution is to "flatten" all the sub-Items into a big scrapy.Item-structure. This could totally be my fault and I'm simply using spidermon and schematics wrong here, but if anyone could confirm or deny if this is intended behaviour or not, it would be really appreciated. Thank you in advance for taking the time to read this wall of text (and thank you for developing scrapy / spidermon!)

The text was updated successfully, but these errors were encountered:

mushtaqak · 2022-06-10T13:14:58Z

@Criamos I think it is expected. As of now spidermon does not support nested items inside a list such as license field in your case. We can not apply coverage rules on license. description

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spidermon Field Coverage - Item Validation impossible with nested scrapy.Items #314

Spidermon Field Coverage - Item Validation impossible with nested scrapy.Items #314

Criamos commented Sep 29, 2021

mushtaqak commented Jun 10, 2022

Spidermon Field Coverage - Item Validation impossible with nested scrapy.Items #314

Spidermon Field Coverage - Item Validation impossible with nested scrapy.Items #314

Comments

Criamos commented Sep 29, 2021

Problem Description

Source code examples

Output examples

Item Validators - validators.py

Expected Behaviour:

mushtaqak commented Jun 10, 2022