Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turning rdflib jsonld into a "full processor" (a.o. for schema.org compliance) #2692

Closed
ghost opened this issue Jan 28, 2024 · 3 comments
Closed

Comments

@ghost
Copy link

ghost commented Jan 28, 2024

Cloned from RDFLib/rdflib-jsonld#62 :

The JSON-LD 1.1 draft spec mentions different levels of processing for JSON-LD: https://w3c.github.io/json-ld-syntax/#processor-levels

A pure processor can only parse JSON-LD expressed in JSON directly, but a full processor can also parse JSON-LD embedded in HTML.

It would be great if rdflib-jsonld would support this. It would make rdflib-jsonld a library that could be used for HTML documents following the schema.org guidelines for embedding (meta)data in HTML pages as described in their getting started guide https://schema.org/docs/gs.html.

Together with the RDFa & microdata parsers this can then work as a fully RDF based version of the Structured Data Testing tool from Google: https://search.google.com/structured-data/testing-tool.

@ghost
Copy link
Author

ghost commented Jan 28, 2024

I'm looking into implementing this using RDFLib/rdflib-jsonld#63 as a starting point. In that code HTML parsing is attempted after JSON parsing fails, but I'm looking at choosing the parse using the content type of the source.

I see that the the JSON-LD test suite in "test/jsonld/1.1" already provides a number of tests which just need to be enabled.

@ghost ghost changed the title Turning rdflib-jsonld into a "full processor" (a.o. for schema.org compliance) Turning rdflib jsonld into a "full processor" (a.o. for schema.org compliance) Jan 29, 2024
@ghost
Copy link
Author

ghost commented Apr 1, 2024

FYI: I'm occasionally working on this at https://github.com/wallberg-umd/rdflib/tree/issue-2692-embedded-jsonld-draft . I've added the basic functionality and enabled the existing tests. I'm now working through making the tests pass.

wallberg added a commit to wallberg/rdflib that referenced this issue May 8, 2024
See https://w3c.github.io/json-ld-syntax/#embedding-json-ld-in-html-documents
and https://www.w3.org/TR/json-ld11-api/#html-content-algorithms .

Implementation summary:

rdflib.plugins.parsers.jsonld.JsonLDParser.parse
* add docstring
* change parameter list from **kwargs to explicit list
* add optional extract_all_scripts parameter
* get the fragment identifier from source.getSystemId()
* add fragment_id and extract_all_scripts parameters to the call to source_to_json

rdflib.plugins.shared.jsonld.util.source_to_json
* add docstring
* add optional fragment_id and extract_all_scripts parameters
* change the return value to a tuple with the extracted JSON document and value of the HTML base element
* if source.content_type is "text/html" or "application/xhtml+xml" then parse source as HTML and extract the appropriate script element(s) and the HTML base element

Testing

test/jsonld/test_onedotone.py
* enable all existing html tests (except html/f004-in)
* if inputpath ends with ".html" (with optional fragment identifier) then invoke runner.do_test_html

For more information on the failing html/f004-in test, see https://lists.w3.org/Archives/Public/public-json-ld-wg/2024May/0000.html .

test/jsonld/runner.py
* add new do_test_html function

Note that the html test cases from the JSON-LD Test Suite combine testing
for JSON-LD extraction from the HTML with testing for other algorithms (e.g. compact/flatten),
which rdflib does not currently support. In order to test extraction only and ignore
the compact/flatten algorithms, do_test_html performs a graph comparison using
rdflib.compare.isomorphic, without serializing back to JSON .
@wallberg
Copy link
Contributor

wallberg commented May 8, 2024

I've completed an initial implementation for this issue, see https://github.com/wallberg/rdflib/tree/issue-2692-embedded-jsonld .

It contains one breaking change: when rdflib.plugins.shared.jsonld.util.source_to_json extracts JSON-LD from HTML, it needs to return the value of the HTML base element in addition to the JSON. I took the simplest path and returned a tuple containing the JSON and the base.

I can think of other ways to return the base without breaking the current return value:

  1. Return json when processing a json document and (json, base) when processing an html document.
  2. Add an optional parameter to return (json, base) instead of json.
  3. Continue returning only json, but add an optional parameter which will receive the value of base.

I'd like to get some feedback on the preferred approach before submitting the PR.

A note on the current status of validation:

  1. task test passes
  2. task lint passes
  3. task mypy fails, I believe to existing problems in main
  4. poetry run python -m mypy rdflib/plugins/parsers/jsonld.py rdflib/plugins/shared/jsonld/context.py rdflib/plugins/shared/jsonld/util.py test/jsonld/runner.py test/jsonld/test_context.py test/jsonld/test_onedotone.py passes

nicholascar added a commit that referenced this issue Jul 26, 2024
See https://w3c.github.io/json-ld-syntax/#embedding-json-ld-in-html-documents
and https://www.w3.org/TR/json-ld11-api/#html-content-algorithms .

Implementation summary:

rdflib.plugins.parsers.jsonld.JsonLDParser.parse
* add docstring
* change parameter list from **kwargs to explicit list
* add optional extract_all_scripts parameter
* get the fragment identifier from source.getSystemId()
* add fragment_id and extract_all_scripts parameters to the call to source_to_json

rdflib.plugins.shared.jsonld.util.source_to_json
* add docstring
* add optional fragment_id and extract_all_scripts parameters
* change the return value to a tuple with the extracted JSON document and value of the HTML base element
* if source.content_type is "text/html" or "application/xhtml+xml" then parse source as HTML and extract the appropriate script element(s) and the HTML base element

Testing

test/jsonld/test_onedotone.py
* enable all existing html tests (except html/f004-in)
* if inputpath ends with ".html" (with optional fragment identifier) then invoke runner.do_test_html

For more information on the failing html/f004-in test, see https://lists.w3.org/Archives/Public/public-json-ld-wg/2024May/0000.html .

test/jsonld/runner.py
* add new do_test_html function

Note that the html test cases from the JSON-LD Test Suite combine testing
for JSON-LD extraction from the HTML with testing for other algorithms (e.g. compact/flatten),
which rdflib does not currently support. In order to test extraction only and ignore
the compact/flatten algorithms, do_test_html performs a graph comparison using
rdflib.compare.isomorphic, without serializing back to JSON .

Co-authored-by: Ashley Sommer <[email protected]>
Co-authored-by: Nicholas Car <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants