You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Disclaimer: I'm a Python beginner and have been using Scrapy for the past 5 months, so please bear with me since this is the first GitHub Issue I've ever written.
Problem Description
I've tried integrating spidermon into an existing codebase with ~40 crawlers that use scrapy.Items as their data model. Upon trying to integrate Item Validation (both via schematics and jsonschema) I've noticed that spidermon only seems to be able to "see" the first level of a scrapy.Item (class: scrapy.Item), but not the other scrapy.Item-classes that are nested within mentioned Item.
Source code examples
I've tried to illustrate the problem with a simplified abstraction that's close to the Scrapy Tutorial - here's my items.py:
The main idea is: Within the QuotesItem there's a LicenseItem that should hold a license-description. Within the QuotesItem there could be other scrapy.Items nested within, sometimes several layers deep.
This is how a yielded Item looks like in the Terminal (please ignore the "raw" formatting of the license description string):
2021-09-29 14:03:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': ['Allen Saunders'],
'license': [{'description': ['[\'<p class="copyright">\\n Made with <span ''class="sh-red">❤</span> by <a ''href="https://scrapinghub.com">Scrapinghub</a>\\n ' "</p>']"]}], 'tags': ['fate', 'life', 'misattributed-john-lennon', 'planning', 'plans'], 'text': ['“Life is what happens to us while we are making other plans.”']}
The crawler itself uses the Basic Scrapy Monitors and doesn't do much else than scraping the Tutorial website, using the scrapy.ItemLoader-class to nest one scrapy.Item within another and yields the QuotesItem in the end.
My expectation/hope was that I'd be able to "look inside" the LicenseItem and spidermon would show me spidermon_item_scraped_count/QuotesItem/license/description. But as you can see above, spidermon stops at the depth of QuotesItem/license. Spidermon's field coverage monitor can't look inside my license-Item and therefore fails while trying to access the description-field with the following output:
2021-09-29 14:03:15 [quotes] ERROR: [Spidermon]
======================================================================
FAIL: Field Coverage Monitor/test_check_if_field_coverage_rules_are_met
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/criamos/PycharmProjects/spidermonTesting/venv/lib/python3.9/site-packages/spidermon/contrib/scrapy/monitors.py", line 275, in test_check_if_field_coverage_rules_are_met
self.assertTrue(len(failures) == 0, msg=msg)
AssertionError:
The following items did not meet field coverage rules:
QuotesItem/license/description (expected 1, got 0)
Item Validators - validators.py
(the commented-out parts were different approaches until I realized that the problem must lie elsewhere)
Which works as expected. The nested dictionaries are accessible by spidermon.
Now I'm all out of ideas since the only approach I currently see as a solution is to "flatten" all the sub-Items into a big scrapy.Item-structure. This could totally be my fault and I'm simply using spidermon and schematics wrong here, but if anyone could confirm or deny if this is intended behaviour or not, it would be really appreciated. Thank you in advance for taking the time to read this wall of text (and thank you for developing scrapy / spidermon!)
The text was updated successfully, but these errors were encountered:
@Criamos I think it is expected. As of now spidermon does not support nested items inside a list such as license field in your case. We can not apply coverage rules on license. description
Disclaimer: I'm a Python beginner and have been using Scrapy for the past 5 months, so please bear with me since this is the first GitHub Issue I've ever written.
Problem Description
I've tried integrating spidermon into an existing codebase with ~40 crawlers that use
scrapy.Item
s as their data model. Upon trying to integrate Item Validation (both viaschematics
andjsonschema
) I've noticed that spidermon only seems to be able to "see" the first level of ascrapy.Item
(class:scrapy.Item
), but not the otherscrapy.Item
-classes that are nested within mentionedItem
.Source code examples
I've tried to illustrate the problem with a simplified abstraction that's close to the Scrapy Tutorial - here's my
items.py
:The main idea is: Within the
QuotesItem
there's aLicenseItem
that should hold a license-description. Within theQuotesItem
there could be otherscrapy.Item
s nested within, sometimes several layers deep.This is how a yielded Item looks like in the Terminal (please ignore the "raw" formatting of the license description string):
The crawler itself uses the
Basic Scrapy Monitors
and doesn't do much else than scraping the Tutorial website, using thescrapy.ItemLoader
-class to nest onescrapy.Item
within another and yields theQuotesItem
in the end.Output examples
This is what the field coverage output looks like:
My expectation/hope was that I'd be able to "look inside" the
LicenseItem
and spidermon would show mespidermon_item_scraped_count/QuotesItem/license/description
. But as you can see above, spidermon stops at the depth ofQuotesItem/license
. Spidermon's field coverage monitor can't look inside mylicense
-Item and therefore fails while trying to access thedescription
-field with the following output:Item Validators - validators.py
(the commented-out parts were different approaches until I realized that the problem must lie elsewhere)
Expected Behaviour:
If I yield a normal python dictionary (see "Implementation via dict class" in my
quotes_spider
-example above), I'll get the following output:Which works as expected. The nested dictionaries are accessible by spidermon.
Now I'm all out of ideas since the only approach I currently see as a solution is to "flatten" all the sub-Items into a big
scrapy.Item
-structure. This could totally be my fault and I'm simply usingspidermon
andschematics
wrong here, but if anyone could confirm or deny if this is intended behaviour or not, it would be really appreciated. Thank you in advance for taking the time to read this wall of text (and thank you for developing scrapy / spidermon!)The text was updated successfully, but these errors were encountered: