Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A package named "", ||:symbol #63

Open
Gleefre opened this issue Jul 4, 2024 · 10 comments
Open

A package named "", ||:symbol #63

Gleefre opened this issue Jul 4, 2024 · 10 comments

Comments

@Gleefre
Copy link

Gleefre commented Jul 4, 2024

The standard doesn't explicitly prohibits a package named "", which means that it is allowed.

It is not however clear what ||:symbol means; and if :symbol is allowed to be interpreted as ||:symbol, defining "" as a global nickname for the "KEYWORD" package (as AllegroCL does).

A simple test (from https://plaster.tymoon.eu/view/4474):

(let ((pkg (make-package "" :use nil)))
  (export (intern "SYMBOL" pkg) pkg)
  (package-name (symbol-package (read-from-string "||:symbol"))))

; SBCL, CMUCL, CCL =>  ""
; ECL, CLASP, CLISP, MKCL, Corman Lisp, LispWorks, JSCL  =>  "KEYWORD"
; ACL >> A package named "" already exists
; ABCL >> java.lang.IndexOutOfBoundsException: fromIndex: 1 > toIndex: 0

Currently ACL interprets :symbol as ||:symbol, defining a global nickname "" for the keyword package to implement the special syntax for keywords.

SBCL, CMUCL and CCL interpret :symbol and ||:symbol differently, treating the second one as a symbol lookup in the package named "".

ECL, CLASP, CLISP, MKCL, Corman Lisp, LispWorks, JSCL interpret ||:symbol as :symbol - as a keyword.

It is not clear what ABCL does, since it just crashes trying to read ||:symbol.

@Gleefre
Copy link
Author

Gleefre commented Jul 6, 2024

From a discussion on #commonlisp IRC channel:

  • Defining a new global nickname for the #:KEYWORD package, as Allegro CL does, is not standard-compliant.
    See CLHS 11.1.2 Figure 11-2 "Standardized Package Names"
    This figure lists all the nicknames of the #:KEYWORD package - none.

  • The reader algorithm (CLHS 2.2) seems to imply that multiple escape characters (namely #\|) are resolved before interpreting a token. That suggests that ||:xxxx is going to form a token :XXXX which then should be interpreted as a keyword as per CLHS 2.3.5.

  • That being said, it would be somewhat useful to define ||:xxxx as a special syntax for a symbol in a package named by an empty string, since there is no other way to read in such a symbol (without using #. syntax). More precisely, if a multiple escape character was used before the first package marker in a token, it should not be interpreted as a keyword symbol, but as a symbol in a package, possibly named by an empty string.

  • See this commit in SBCL's repo which added that feature. See also this comment in CCL's codebase. As we can see, historically MCL had this feature as well. Also note that CMUCL has accidentally included this feature when backporting float reader from SBCL, as can be seen in this commit.

@Gleefre
Copy link
Author

Gleefre commented Jul 6, 2024

Also note that not defining such an extension leads to losing print-read consistency, since a symbol in a package named "" no longer can be readably printed.

Such a symbol is currently printed either as ||::foo and ||:bar; or ::foo and :bar. Note that without such an extension, all of these can only be read as a keyword. Note also that ::foo and ||::foo have unspecified consequences as per CLHS 2.3.5

See this paste for an example: https://plaster.tymoon.eu/view/4480

@Gleefre
Copy link
Author

Gleefre commented Jul 7, 2024

There is a similar problem with potential numbers (see CLHS 2.3.1.1.1).

The example of interest here is 5||, which is specifically said to be interpreted as a symbol. However, the reader algorithm implies that it should be read as a token "5", where the character #\5 has its usual syntactic qualities, and thus this token should be interpreted as a number.

Also, the following remark:

In each case, removing the escape character (or characters) would cause the token to be a potential number.

seems to imply that escape characters are actually part of the token, although it could just be a wording issue.

(The token in question is said to be a potential number, which means that a potential number is a token. It is also previously said that "a potential number cannot contain any escape characters", which implies that generally a token can contain escape characters, and thus escape characters are included in the token.)

P.S. The same can be said about ||. which reads as a symbol |.| and not a consing dot.
Or at least all implementations I can test it on seem to agree on that -- I wasn't able to find it being specifically mentioned in the hyperspec.

@informatimago
Copy link

informatimago commented Jul 8, 2024

Nope. 5|| is explicitly rejected as syntax for potential numbers by https://www.lispworks.com/documentation/lw61/CLHS/Body/02_caaa.htm

This is to stay consistent with things like 5|x| or 5|2| which must both be interpreted as symbols.

Having an explicit rule overrides any default interpretation by the algorithm.

@Gleefre
Copy link
Author

Gleefre commented Jul 8, 2024

I don't claim that 5|| is undefined -- it is indeed specifically said to be read as a symbol. However, the reason why it is not interpreted as a number given by the spec is invalid for the 5||:

An escape character robs the following character of all syntactic qualities, forcing it to be strictly alphabetic[2] and therefore unsuitable for use in a potential number.

This is not true for 5||, as the only character in the token "5" is not an escaped character and thus keeps all of its syntactic qualities.

What I do claim, is that it means that the reader algorithm is not correctly defined, and thus needs to be changed.

@Gleefre
Copy link
Author

Gleefre commented Jul 8, 2024

A few passages, including the one that I have already cited earlier from CLHS 2.3.1.1.1, suggest that escape characters should be part of the token. Here's another passage that supports that (this time from CLHS 2.3.3):

If a token consists solely of dots (with no escape characters), then <...>

@informatimago
Copy link

A few passages, including the one that I have already cited earlier from CLHS 2.3.1.1.1, suggest that escape characters should be part of the token. Here's another passage that supports that (this time from CLHS 2.3.3):

If a token consists solely of dots (with no escape characters), then <...>

"As IF".

But indeed, tokens are not mere strings. A token must remember the character trait of each character. And indeed, to handle the 5|| rule, a token must also remember that it has occurrences of multiple-escapes even if they're empty, so it needs at least one flag in addition to the characters and traits. https://www.lispworks.com/documentation/lw61/CLHS/Body/02_adb.htm

Keeping the escapes themselves is not really useful and makes parsing the token more difficult, once the token type is determined from the traits.

For example, I rephase 2.3.3 as "if a token consisting only of dots with the character trait of "dot", then ...".

... -> only #. characters with the character trait dot.
... -> only #. characters, but one with the character trait alphabetic (the escaped one).

And yes, the reader algorithm is not formally specified down to these details, because there are various implementation choices possible to implement things like 2.3.3 or 2.3.1.1.1.

In the context of wscl, the question is whether there are any ambiguity in the specification, not whether the specification allows various implementations (all behaving the same way).

So far I've not seen you demonstrated any ambiguity, ie. strings that could be interpreted in different ways when processed by the lisp reader as specified (applying all the rules).

@Gleefre
Copy link
Author

Gleefre commented Jul 8, 2024

a token must also remember that it has occurrences of multiple-escapes

And that contradicts the hyperspec, as CLHS 2.2 Reader Algorithm explicitly specifies how tokens are constructed (the phrases used are "y is used to begin a token", "Y is appended to the token being built" e.t.c).

The very fact that one part of the spec (namely 2.3.1.1.1) contradicts the other one (2.2) already makes it a WSCL issue; since any "rephrasing" already creates ambiguity.

Note that the whole thing with numbers and the consing dot is only "supporting evidence", that is intended to provide context to the main issue -- interpretation of ||:xxxx and ||::xxxx.

I've not seen you demonstrated any ambiguity, ie. strings that could be interpreted in different ways

By the way, the string "||:xxxx" and "||::xxxx" is exactly what are asking for. I believe I have already demonstrated how there are three different possible interpretations, based on different "rephrasing" of the contradicting pars of the specification.

@Gleefre
Copy link
Author

Gleefre commented Jul 8, 2024

Keeping the escapes themselves is not really useful and makes parsing the token more difficult, once the token type is determined from the traits.

Just in case, by saying that escape characters are part of the token, I don't necessarily mean that they must be put directly into the token being accumulated. It would be enough to, for example, keep track of escaped "intervals" of the form [a, b), with "empty" pairs of multiple escape characters appending an "empty interval" of the form [x, x). [ note: AFAICT this is what Eclector does ]

For example, I rephase 2.3.3 as "if a token consisting only of dots with the character trait of "dot", then ...".

This rephrasing would mean that ||. must be read as a consing dot, and not as a symbol. This is not the case for all implementations I tested it on -- sbcl, cmucl, ccl, allegro cl, ecl, clasp, abcl, clisp, mkcl, lispworks, corman cl, jscl.

@Gleefre
Copy link
Author

Gleefre commented Jul 8, 2024

"As IF".

Well, there are many other examples. Here's what I have found so far (including previous ones for completeness):

CLHS 2.1.4 Character Syntax Types:

Constituent and escape characters are accumulated to make a token <...>

CLHS 2.3.1.1.1 Escape Characters and Potential Numbers:

A potential number cannot contain any escape characters.

<...> removing the escape character (or characters) would cause the token to be a potential number.

Note: potential number is a special kind of token (CLHS 2.3.1.1 Potential Numbers as Tokens):

A token is a potential number if <...>

CLHS 2.3.3 The Consing Dot:

If a token consists solely of dots (with no escape characters), then <...>

CLHS 2.4.1 Left-Parenthesis:

If a token that is just a dot not immediately preceded by an escape character is read after some object then

CLHS 2.4.8.4 Sharpsign Asterisk:

Neither a single escape nor a multiple escape is permitted in this token.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants