Separating the C and CN parsing #342

kmemarian · 2024-06-22T22:52:55Z

This creates a separate lexer and parser for CN.

There are a few questions for @cp526 and @dc-mak marked with TODO(Christopher/Dhruv) in the code.

And there is a hack for dealing with the parsing of C type-names discussed in the commit message of c96a1ad.

This is still work-in-progress, the CN parser doesn't produce pretty messages so some CI and tutorial tests fail.

I'm keeping `c_parser.mly` and `c_lexer.mll` unchanged for now to avoid destroying `c_parser_error.messages` and to keep CN in a working state.

…er/Parser)

…[skip ci]

This does not duplicate the C grammar, instead when the CN lexer sees matching `<` ... `>` it attempts to parse a C type-name between the angles (using a new entry point to the C parser) and produces a single `LT_CTYPE_GT` token for the CN parser. This may fail because the opening `<` may in fact be a relational operator. In this case the CN lexer recovers by re-lexing from the top. This first attempt avoids changing the existing CN syntax. If we instead change it to have dedicated delimiters around C type-names, we won't need to do the re-lexing recovery. Most of the CN ci tests now work (the failures are caused by the current lack of pretty error messages in the CN parser), but this needs proper testing...

cp526 · 2024-06-24T10:54:53Z

Thanks very much!

AFAICS CN_bool and cn_function can go away. Not clear to me at the moment whether boolean might have a use if there's a clash of names between C (integer) booleans and CN (proper) booleans.

cp526 · 2024-06-24T10:59:56Z

Regarding the question of parsing C-types within CN: maybe the lexer hack is less problematic than it seems: there's a small set of language constructs only that can take C-types (Owned, array_shift, and maybe nothing else?), and those all require parentheses following the <ctype> part for the remaining arguments. I wonder whether that makes the syntax unambiguous wrt the C-type angle-bracket notation so there's never a need to re-lex?

dc-mak

Lgtm, though this ctype issue is gonna be annoying.

dc-mak · 2024-06-24T11:02:46Z

parsers/c/cn_lexer.mll

+let hexadecimal_constant = hexadecimal_prefix hexadecimal_digit+
+
+(* C23 binary constant, but omitting ' separators for now *)
+(* TODO(Christopher/Dhruv): add support for ' separators to be in line with C23 ? *)


Yes please! #337

dc-mak · 2024-06-24T11:04:50Z

parsers/c/cn_lexer.mll

+(* TODO(Christopher/Dhruv): do you care about following C11 closely for this
+    (e.g. with respect to the universal character stuff), or should we simplify
+    like we already do for uppercase names. *)


I vote simplify - we can always add it later if requested.

dc-mak · 2024-06-24T11:13:26Z

parsers/c/cn_lexer.mll

+  | '<' ([^'='][^'<' '>']+ as str) '>'
+      {
+        try
+          LT_CTYPE_GT (C_parser.type_name_eof C_lexer.lexer (Lexing.from_string str))
+        with
+          | _ ->
+              relexbuf_opt := Some (Lexing.from_string (str ^ ">"));
+              LT
+      }
+


This is an amazing hack. But I do worry it might be a bit too hacky. Can it be more conservative? Specifically, I wanted to have <- available as a token. We do need to discuss this properly though @cp526

The idea of ...<...> was to have a C++-like syntax for type parameters, which would be good to keep if we can. What's the use of <-?

I think types always start with a letter (or maybe _), so you could probably change it to only try for a type, if the < is followed by one of these. I think should allow you to have <- later if you wanted it.

Otoh, I don't think this would work if you ever wanted types that have instantiations in them, but we probably don't have those at the moment.

dc-mak · 2024-06-24T11:15:07Z

parsers/c/cn_lexer.mll

+  "boolean", BOOL; (* TODO(Christopher/Dhruv): are all these variants of BOOL needed? *)
+  "CN_bool", BOOL; (* TODO(Christopher/Dhruv): are all these variants of BOOL needed? *)


These two can go.

dc-mak · 2024-06-24T11:17:00Z

Thanks very much!

AFAICS CN_bool and cn_function can go away. Not clear to me at the moment whether boolean might have a use if there's a clash of names between C (integer) booleans and CN (proper) booleans.

We're still using cn_function for constants! https://rems-project.github.io/cn-tutorial/#_constants

cp526 · 2024-06-24T11:56:47Z

Thanks very much!
AFAICS CN_bool and cn_function can go away. Not clear to me at the moment whether boolean might have a use if there's a clash of names between C (integer) booleans and CN (proper) booleans.

We're still using cn_function for constants! https://rems-project.github.io/cn-tutorial/#_constants

Yes, that's true.

dc-mak · 2024-06-24T12:26:17Z

parsers/c/cn_lexer.mll

+  "bool", BOOL; (* shared with C23 *)
+  "boolean", BOOL; (* TODO(Christopher/Dhruv): are all these variants of BOOL needed? *)
+  "CN_bool", BOOL; (* TODO(Christopher/Dhruv): are all these variants of BOOL needed? *)
+  "cn_function", FUNCTION; (* TODO(Christopher/Dhruv): is this variant still needed? *)


This stays.

kmemarian added 7 commits June 22, 2024 20:32

Adding a CN lexer and a copy of the CN grammar.

7b20666

I'm keeping `c_parser.mly` and `c_lexer.mll` unchanged for now to avoid destroying `c_parser_error.messages` and to keep CN in a working state.

Adding a separate module for CN's tokens (and using it in the new Lex…

c0c2dd0

…er/Parser)

CN: renaming some tokens, the CN_ prefixes are no longer needed

579220b

CN: polishing token declarations in the parser

44e5a05

CN: using the new lexer/parser -- BREAKING CN (no parsing of ctypes) …

58ce851

…[skip ci]

Removing CN parts from the C lexer

e6d2c3b

kmemarian added the cn label Jun 22, 2024

kmemarian self-assigned this Jun 22, 2024

kmemarian mentioned this pull request Jun 22, 2024

[CN] Feature: add logical implication to the spec language #329

Closed

dc-mak reviewed Jun 24, 2024

View reviewed changes

dc-mak mentioned this pull request Jul 11, 2024

[CN] Allow ' in numeric constants #337

Open

yav mentioned this pull request Aug 8, 2024

Pp imrovements #470

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separating the C and CN parsing #342

Separating the C and CN parsing #342

kmemarian commented Jun 22, 2024 •

edited

Loading

cp526 commented Jun 24, 2024

cp526 commented Jun 24, 2024

dc-mak left a comment

dc-mak Jun 24, 2024

dc-mak Jun 24, 2024

dc-mak Jun 24, 2024

cp526 Jun 24, 2024

yav Aug 8, 2024

dc-mak Jun 24, 2024

dc-mak commented Jun 24, 2024

cp526 commented Jun 24, 2024

dc-mak Jun 24, 2024

		"boolean", BOOL; (* TODO(Christopher/Dhruv): are all these variants of BOOL needed? *)
		"CN_bool", BOOL; (* TODO(Christopher/Dhruv): are all these variants of BOOL needed? *)

Separating the C and CN parsing #342

Are you sure you want to change the base?

Separating the C and CN parsing #342

Conversation

kmemarian commented Jun 22, 2024 • edited Loading

cp526 commented Jun 24, 2024

cp526 commented Jun 24, 2024

dc-mak left a comment

Choose a reason for hiding this comment

dc-mak Jun 24, 2024

Choose a reason for hiding this comment

dc-mak Jun 24, 2024

Choose a reason for hiding this comment

dc-mak Jun 24, 2024

Choose a reason for hiding this comment

cp526 Jun 24, 2024

Choose a reason for hiding this comment

yav Aug 8, 2024

Choose a reason for hiding this comment

dc-mak Jun 24, 2024

Choose a reason for hiding this comment

dc-mak commented Jun 24, 2024

cp526 commented Jun 24, 2024

dc-mak Jun 24, 2024

Choose a reason for hiding this comment

kmemarian commented Jun 22, 2024 •

edited

Loading