Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load html #347

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open

Load html #347

wants to merge 18 commits into from

Conversation

gkellogg
Copy link
Collaborator

Load JSON-LD from HTML documents.

  • Adds options parameter to documentLoader
  • Uses xmldom, if loaded.
  • Adds util.ParseContentTypeHeader
  • Adds documentLoader implementations for xhr and node (still requires tests).
  • Set default for expandAllScripts to true for flatten and toRdf.

@gkellogg
Copy link
Collaborator Author

Some tests still required for XHR and Node documentLoaders.

Some work needed to optionally append text/html (and application/xhtml+xml) to accept headers if we can parser HTML.

Perhaps some abstracted parsing support to handle other HTML parsers?

Otherwise, please see if you agree with the general direction of this work.

@@ -865,7 +871,10 @@ jsonld.get = async function(url, options) {
load = jsonld.documentLoader;
}

const remoteDoc = await load(url);
// FIXME: unescape frag?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is necessary; it's not to pass tests.

continue;
}
try {
remoteDoc.document.push(JSON.parse(script.textContent));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that textContent always decodes entities, so there are tests we can't pass. Any ideas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWLIW, I threw together a JSBin that uses example 10 to test parsing via jsonld.js 2.0.2:
https://jsbin.com/rewaxiquki/edit?html,console

There aren't any entity decoding issues, so I think this issue is probably a bug in xmldom's implementation.

Here's another one which works using jsdom:
https://runkit.com/embed/9qrv8dl5g2bs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I use htmlparser2 in my implementation, and it doesn't seem to be decoding entities, so it allows all tests to pass.
So I also think that this is a bug/feature in xmldom.

lib/jsonld.js Outdated
@@ -934,6 +995,20 @@ jsonld.documentLoaders = {};
jsonld.documentLoaders.node = require('./documentLoaders/node');
jsonld.documentLoaders.xhr = require('./documentLoaders/xhr');

// Optional DOM parser
try {
jsonld.domParser = require('xmldom').DOMParser;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will support xmldom, if it's loaded. Are there other parsers that should be included? How to configure and test?

@@ -15,6 +15,7 @@ const REGEX_LINK_HEADER = /\s*<([^>]*?)>\s*(?:;\s*(.*))?/;
const REGEX_LINK_HEADER_PARAMS =
/(.*?)=(?:(?:"([^"]*?)")|([^"]*?))\s*(?:(?:;\s*)|$)/g;

// FIXME: conditinally support text/html
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With HTML support, we should include text/html and application/ld+json, but not sure best way to do that. It will also mess up some documentLoader tests?

@@ -30,7 +34,7 @@ module.exports = ({
const queue = new RequestQueue();
return queue.wrapLoader(loader);

async function loader(url) {
async function loader(url, options) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There don't seem to be standalone XHR documentLoader tests, as there are node tests.

@gkellogg
Copy link
Collaborator Author

Perhaps the webpack build is failing because it doesn't include xmldom? Should there be a way to condition tests on this? How?

@davidlehn davidlehn added this to the JSON-LD 1.1 milestone Dec 31, 2019
@gkellogg gkellogg marked this pull request as ready for review January 5, 2020 21:46
lib/documentLoaders/node.js Outdated Show resolved Hide resolved
lib/documentLoaders/xhr.js Outdated Show resolved Hide resolved
lib/documentLoaders/xhr.js Outdated Show resolved Hide resolved
lib/jsonld.js Outdated Show resolved Hide resolved
lib/jsonld.js Outdated Show resolved Hide resolved
lib/jsonld.js Outdated Show resolved Hide resolved
@@ -934,6 +995,27 @@ jsonld.documentLoaders = {};
jsonld.documentLoaders.node = require('./documentLoaders/node');
jsonld.documentLoaders.xhr = require('./documentLoaders/xhr');

// Optional DOM parser
try {
jsonld.domParser = require('xmldom').DOMParser || class NoDOMParser {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidlehn -- can you comment here? I don't think we can easily support this pattern with webpack. Can you suggest an alternative path forward? Instead of a require here, the user may need to have installed another package themselves that registered a DOM parser with jsonld in a similar way we do with RDF parsers. If so -- we should copy that pattern since it's already used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: PR #341 removes xmldom as a dependency.

This PR here (#347) should be updated to take that into account (specifically line 1022 here).

@dlongley
Copy link
Member

dlongley commented Jan 6, 2020

Thanks @gkellogg! This is great. A few minor changes needed plus one architectural thing we need some feedback from @davidlehn on (regarding how to pull in a dom parser).

@gkellogg gkellogg requested a review from dlongley January 6, 2020 20:23
@gkellogg gkellogg force-pushed the load-html branch 2 times, most recently from c1c0221 to d1c1290 Compare January 26, 2020 20:59
@gkellogg gkellogg force-pushed the load-html branch 2 times, most recently from b0f158d to a45ef6d Compare January 29, 2020 01:01
lib/util.js Outdated Show resolved Hide resolved
@gkellogg
Copy link
Collaborator Author

gkellogg commented Mar 3, 2020

@davidlehn are we waiting for something else to merge this?

lib/jsonld.js Outdated Show resolved Hide resolved
@davidlehn
Copy link
Member

@gkellogg I was waiting on me to look into how to improve the way the xmldom lib is used. Otherwise I think patch is good. As Dave's comment mentioned, might want some sort of registered parser pattern. Need to check how the current code works with bundlers and such.

gkellogg and others added 18 commits April 9, 2020 14:58
* Adds options parameter to documentLoader
* Uses xmldom, if loaded.
* Adds util.ParseContentTypeHeader
* Adds documentLoader implementations for xhr and node (still requires tests).
Skip HTML tests if there is  no DOMParser, or loading the module raises an exception.
Allows Karma tests to pass.
Co-Authored-By: David I. Lehn <[email protected]>
- Updated tests switch from "invalid script element" to "loading
  document failed".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants