Fix #377: Implemented support for adding the tokens to parsed nodes. #388

faubulous · 2024-07-03T15:15:56Z

I extended the factory methods namedNode, blankNode, literalNode and variable to also accept an optional token as additional parameter and pass it to the created Term instance.

Then I changed the Parser to provide the tokens as an additional argument when parsing a file.

The default exported DataFactory does, however, not pass the token to the factory methods and thus creates nodes without additional added tokens as a default behaviour. This means that the current behaviour is not changed and all existing tests pass.

Using the exported factory methods for one can create a custom DataFactory which can be passed to the parser as an argument using the existing API. This way the exported tokens can either be a) passed as they are or b) modified and stored with the nodes as neded.

…g a custom data factory.

RubenVerborgh

Thanks @faubulous, off to a good start.

Gist of my comments:

Let's make the second argument a context object { token } for extensibility.
Let's expose this factory as-is, with the second optional argument.
The base Term constructor should do the work of assigning context / token. (Preferably context actually; Term should not look inside context.)
Let's not use explicit undefined; only null or ''.

Open questions:

Reading mechanism for context/token?

RubenVerborgh · 2024-07-05T11:22:47Z

src/N3DataFactory.js

+  namedNode: iri => namedNode(iri),
+  blankNode: name => blankNode(name),
+  variable: name => variable(name),
+  literal: (name, datatype) => literal(name, datatype),


I'd be happy leaving these as they are actually. It doesn't hurt to have the second optional argument, and that way, we can expose it to other parsers.

If I understand you correctly, you want to pass the default constructor functions to the DataFactory as they are. This would mean, however, that the default behavior of the DataFactory changes to always creating Terms with an initialized { token } context. Which in turn will break some tests and is probably not needed in standard situations.

I was hesitant to put a condition into the Parser to check wether to pass the token/context argument to the factory or not because this would have an impact on the performance. The solution to simply ignore the second argument in the DataFactory seemed like a good compromise and allows the API users to define their own behavior regarding the context variable of a node, such as offsetting positions or adding additional data if needed.

So should I revert this to the default functions and have the context always initialized with a token?

the default behavior of the DataFactory changes to always creating Terms with an initialized { token } context.

That's fine; objects can always have optional fields.

Which in turn will break some tests

Those tests might be too strict; we can loosen them.

I was hesitant to put a condition into the Parser to check wether to pass the token/context argument to the factory

Indeed; let's definitely not.

The solution to simply ignore the second argument in the DataFactory

Aah okay. This actually shows me that we need 2 DataFactory classes then. One that discards the context, and one that keeps it. Let's make both first-class citizens.

RubenVerborgh · 2024-07-05T11:24:56Z

src/N3DataFactory.js

+  constructor(id, token) {
+    super(id);
+
+    this.token = token;
+  }
+


Suggested change

constructor(id, token) {

super(id);

this.token = token;

}

constructor(id, token = null) {

super(id);

this.token = token;

}

I'm nitting here, but I want things to be explicit. No undefined.

However, even better would be if the Term constructor simply took token; then we don't need any of this.

And let's not make it token, let's make it context, with let { token } = context.

That's OK. So will the default be context = { token = null } or context = null?

I am personally hesitant to initialize token with an empty string as this is a different datataype as the Token object and it might indicate that one should expect a non-empty string when it is initalized which is misleading.

But I'll do as you wish. Just let me know.

OK. Saw the answer in the comment below. Will go with DEFAULT_CONTEXT.

RubenVerborgh · 2024-07-05T11:25:37Z

src/N3DataFactory.js

@@ -92,7 +105,7 @@ export class Literal extends Term {

  // ### The datatype IRI of this literal
  get datatype() {
-    return new NamedNode(this.datatypeString);
+    return new NamedNode(this.datatypeString, this.token);


I'm hesitant to pass the token along here. I don't think we should.

RubenVerborgh · 2024-07-05T11:26:21Z

src/N3DataFactory.js

@@ -273,7 +291,8 @@ export function termToId(term, nested) {
 // ## Quad constructor
 export class Quad extends Term {
  constructor(subject, predicate, object, graph) {
-    super('');
+    super(undefined);


Never. Either null or '', but never explicitly undefined.

Alright, no problem.

However, I'm just curious to learn why. Am I right that you want the variable to be initialized with something so that it is visible on an API level? I have to admit that I am not that deep into JavaScript to know all the consequences. It was my reasoning that when passing undefined the variable might not occupy any memory as interpreters might just delete it and thus having the API behave as before.

It was my reasoning that when passing undefined the variable might not occupy any memory

That's not how JavaScript engines work today; the will still explicitly assign the field.

However, I'm just curious to learn why.

Mainly for explicitness. undefined is what happens when we don't pass anything, forget stuff, uninitialized, etc. null signifies a choice. Choices are good.

RubenVerborgh · 2024-07-05T11:26:51Z

src/N3Parser.js

@@ -194,7 +194,7 @@ export default class N3Parser {
    case '[':
      // Start a new quad with a new blank node as subject
      this._saveContext('blank', this._graph,
-                        this._subject = this._blankNode(), null, null);
+                        this._subject = this._blankNode(undefined, token), null, null);


not undefined

How should I initialize it then? Please advise.

null, or allow a constructor that makes the first argument optional, like this._blankNode({ token }). Then we can do a string test on the first argument.

RubenVerborgh · 2024-07-05T11:26:55Z

src/N3Parser.js

@@ -206,7 +206,7 @@ export default class N3Parser {
      if (!this._n3Mode)
        return this._error('Unexpected graph', token);
      this._saveContext('formula', this._graph,
-                        this._graph = this._blankNode(), null, null);
+                        this._graph = this._blankNode(undefined, token), null, null);


idem (for all below)

RubenVerborgh · 2024-07-05T11:32:27Z

The more I think about it, it should be:

constructor(whatever, context = DEFAULT_CONTEXT)
and
this.context = context;

with DEFAULT_CONTEXT a frozen object { token: null } that we recycle.

faubulous · 2024-07-05T20:51:06Z

Thank you for your quick review @RubenVerborgh! I implemented most of your suggested changes except exporting the DataFactory as is because this will change the default behavior of the API and break some unit tests. Please advise.

bergos · 2024-07-06T18:55:21Z

I only had a brief look as I am on vacation and traveling, so forgive me if I missed something.

It looks like you are adding arguments to the factory methods, which are defined in the RDF/JS spec. That may cause problems if someone else comes up with the same idea but with different properties. If we want to move forward in that direction, we must extend the specification to avoid any conflicts. It would be good if the spec would cover that topic, but it may take a bit longer and it's understandable if one want to move forward faster.

I would recommend another approach that doesn't require extending the spec. Just pass a flag to the constructor of the parser, and if it's set, set the properties after creating the Term instance with the factory like this:

const term = factory.namedNode(iri)
term.source = { url, line }

Also, please have a look at the additional properties page in the wiki of the spec. I requested creating that page with source maps or DOM element linking in mind. It would be great if we could define a reusable standard structure for it. I have something like this in mind with all properties optional:

term.source = {
  url, // file or http URL
  line, // line number
  position, // start position of the token
  text, // text of the token
  element // DOM element for RDFa
}

I'm back in the week of 22nd of July. It would be great if we could keep the discussion open till the end of that week.

RubenVerborgh · 2024-07-07T12:19:58Z

@bergos

That may cause problems if someone else comes up with the same idea but with different properties.

Indeed, and thanks for looking at this.

That's the main reason why I am proposing an object as an optional second argument, so { token } rather than directly token or even a string. This should give us the extensibility we're looking for.

Just pass a flag to the constructor of the parser, and if it's set, set the properties after creating the Term instance

I could also agree with this; just not a big fan of external classes managing the internal state of another class. Plus, some factories could create frozen objects etc… I think the first approach, with an optional object, is actually the safest all things considered.

Also, please have a look at the additional properties page in the wiki of the spec. I requested creating that page with source maps or DOM element linking in mind. It would be great if we could define a reusable standard structure for it.

+1 to this

I'm back in the week of 22nd of July. It would be great if we could keep the discussion open till the end of that week.

We'll wait for you, mate. Would be good to have an RDF/JS issue open to track this then. Cheers!

bergos · 2024-07-23T23:06:13Z

@RubenVerborgh

That's the main reason why I am proposing an object as an optional second argument, so { token } rather than directly token or even a string. This should give us the extensibility we're looking for.

I agree with that. If we extend the interface, we will almost certainly use an additional object argument.

I dug a bit into the code. The debugger showed me that the token object has the following properties: type, value, prefix, line, start, end. That makes sense for the parser as an internal object, but as an additional property to the term object, I prefer something more self-describing for that context.

My proposal:

token -> source or origin
type -> textType or tokenType
value -> text or token
prefix -> keep
line -> keep
start -> keep
end -> keep

RubenVerborgh · 2024-07-24T12:09:00Z

Thanks @bergos!

Based on your comment, may I suggest to @faubulous:

source
tokenType
token
prefix
line
start
end

We should still retain backward compatibility with those who expect value/type to be there.

faubulous · 2024-08-22T12:23:21Z

Dear @RubenVerborgh, could you please summarise what changes you expect me to make? From the comments I gather that I need to refactor the token argument into an object of conforming to this interface:

{
  source,
  tokenType,
  token,
  prefix,
  line,
  start,
  end,
}

Is this correct?

RubenVerborgh · 2024-08-23T13:58:24Z

@faubulous That's the one!

bergos · 2024-08-23T19:08:14Z

@faubulous @RubenVerborgh I expected source to be a property with an object value that contains all other properties like this:

{
  source: {
    tokenType,
    token,
    prefix,
    line,
    start,
    end
  }
}

RubenVerborgh · 2024-08-23T21:38:43Z

@bergos Doesn't fully make sense to me; some of those are not source attributes. Perhaps:

{
  tokenType,
  token,
  prefix,
  source: {
    line,
    start,
    end,
  },
}

bergos · 2024-08-24T09:54:27Z

@RubenVerborgh All of them are source attributes, as a different serialization may generate different tokens, and prefixes are also defined as part of the serialization.

Maybe a short overview of how I think additional attributes could be used in the future makes it easier to understand my view:

Terms and Quads have a lifecycle. They are generated by a source, they can be shown somewhere and bound to UI components, and they can be written to a target, which can be another intermediate data model object or the final serialization.

An example UI application could be a Turtle to Turtle mapper with a Quad table component that lists the complete content of the source document. A prefix table component can be used to define the prefixes of the target document. On a click on Quad of the table component, the source and target tokens are highlighted. Behind the scenes, the parser attaches a source property to all Term objects. The Quad table component attaches a binding property to all Term objects. That property refers to the DOM cells in the table. The serializer attaches a target property. It contains the same kind of properties as the parser generates.

Using only the Term and Quad objects for this use case may not be optimal. Additional data structures like maps are required for performance reasons. But that may vary based on the use case. The important point here is: how could additional properties be used? My goal would be to get some kind of standard for the properties. It should be possible to use them with different libraries. Therefore, we should have a look at the whole lifecycle of data model objects.

RubenVerborgh · 2024-08-24T10:40:48Z

I'll leave the decision to others.

faubulous · 2024-08-27T14:12:25Z

@bergos I've implemented most of the requested changes. However, I want to make sure that I got it right. The context object should be structured like this:

{
  source: {
    tokenType, // token.type
    token, // token.value
    prefix, // token.prefix
    line, // token.line
    start, // token.start
    end // token.end
  }
}

RubenVerborgh · 2024-08-27T14:28:43Z

I'm very unsure now why we have an object with a single property wrapping an object with all the other properties.
We're now creating twice the objects, and object creation is expensive. I think this will cost us in performance, so wouldn't merge before we have assessed that.

faubulous · 2024-08-27T14:56:40Z

Agree. I would personally prefer the original proposal of: { source: token }

Where the token object is provided as-is in a wrapped context. We could also rename token to source, but I would not like to map the token to another format as this is expensive. At least we would spare ourselves the transformation step for the original token object.

However, this also requires the Parser to provide { source: token } to the namedNode factory method every time. This is a performance overhead even if we do not pass the object along to the term in the default DataFactory. This is why I originally only provided the token object without modification as this is the fastest way to do it. Then the custom DataFactory can take care of the transformation and the associated performance overhead.. the default factory would simply ignore the parameter.

With this approach, every application that needs another information / structure can provide a custom DataFactory to implement all required transformations or mappings like this:

new Parser({
    factory: {
       namedNode: (iri, token) => namedNode(iri, { source: { ... }, target: { ... } }),
       ....
    }
  });

faubulous added 2 commits July 3, 2024 15:52

Implemented support for adding the parser tokens to parsed nodes usin…

7a872f8

…g a custom data factory.

Added unit test for token enabled DataFactory.

1e175f7

jeswr self-requested a review July 5, 2024 02:58

RubenVerborgh self-requested a review July 5, 2024 11:27

RubenVerborgh reviewed Jul 5, 2024

View reviewed changes

RubenVerborgh assigned faubulous Jul 5, 2024

RubenVerborgh added feature-request semver.minor labels Jul 5, 2024

Implemented some changes requested by @RubenVerborgh.

1409dd1

RubenVerborgh marked this pull request as draft July 7, 2024 12:20

RubenVerborgh requested a review from bergos July 7, 2024 12:20

RubenVerborgh added the input-needed label Jul 7, 2024

RubenVerborgh requested review from RubenVerborgh and removed request for RubenVerborgh August 24, 2024 10:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #377: Implemented support for adding the tokens to parsed nodes. #388

Fix #377: Implemented support for adding the tokens to parsed nodes. #388

faubulous commented Jul 3, 2024

RubenVerborgh left a comment

RubenVerborgh Jul 5, 2024

faubulous Jul 5, 2024 •

edited

Loading

RubenVerborgh Jul 7, 2024

RubenVerborgh Jul 5, 2024

faubulous Jul 5, 2024

faubulous Jul 5, 2024 •

edited

Loading

RubenVerborgh Jul 5, 2024

faubulous Jul 5, 2024

RubenVerborgh Jul 5, 2024

faubulous Jul 5, 2024

RubenVerborgh Jul 7, 2024

RubenVerborgh Jul 5, 2024

faubulous Jul 5, 2024

RubenVerborgh Jul 7, 2024

RubenVerborgh Jul 5, 2024

RubenVerborgh commented Jul 5, 2024

faubulous commented Jul 5, 2024

bergos commented Jul 6, 2024

RubenVerborgh commented Jul 7, 2024 •

edited

Loading

bergos commented Jul 23, 2024

RubenVerborgh commented Jul 24, 2024 •

edited

Loading

faubulous commented Aug 22, 2024

RubenVerborgh commented Aug 23, 2024

bergos commented Aug 23, 2024

RubenVerborgh commented Aug 23, 2024

bergos commented Aug 24, 2024

RubenVerborgh commented Aug 24, 2024

faubulous commented Aug 27, 2024 •

edited

Loading

RubenVerborgh commented Aug 27, 2024

faubulous commented Aug 27, 2024

Fix #377: Implemented support for adding the tokens to parsed nodes. #388

Are you sure you want to change the base?

Fix #377: Implemented support for adding the tokens to parsed nodes. #388

Conversation

faubulous commented Jul 3, 2024

RubenVerborgh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

faubulous Jul 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

faubulous Jul 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RubenVerborgh commented Jul 5, 2024

faubulous commented Jul 5, 2024

bergos commented Jul 6, 2024

RubenVerborgh commented Jul 7, 2024 • edited Loading

bergos commented Jul 23, 2024

RubenVerborgh commented Jul 24, 2024 • edited Loading

faubulous commented Aug 22, 2024

RubenVerborgh commented Aug 23, 2024

bergos commented Aug 23, 2024

RubenVerborgh commented Aug 23, 2024

bergos commented Aug 24, 2024

RubenVerborgh commented Aug 24, 2024

faubulous commented Aug 27, 2024 • edited Loading

RubenVerborgh commented Aug 27, 2024

faubulous commented Aug 27, 2024

faubulous Jul 5, 2024 •

edited

Loading

faubulous Jul 5, 2024 •

edited

Loading

RubenVerborgh commented Jul 7, 2024 •

edited

Loading

RubenVerborgh commented Jul 24, 2024 •

edited

Loading

faubulous commented Aug 27, 2024 •

edited

Loading