Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cabal exact print #65

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

jappeace
Copy link

@jappeace jappeace commented Jul 17, 2024

@Ericson2314
Copy link
Contributor

  1. Does GenericPackageDescription have enough information? I'm very out of the loop on this, but I feared it might not. (Feel free to just link me something that already exists answering this.)

  2. ExactPrintMeta is a nasty way to encode this information, because it makes it very hard to modify the AST while preserving comments (other edits would invalidate stuff in the map --- unless you have cool ideas I am missing). Maybe it is good intermediate step, but I highly doubt its what we want to end up with. This choice should be justified.

From a quick glance at the mega-issue, the approach of using Distribution.Fields does seem more promising. It would be good to discuss why that approach is not taking if you don't want to do it.

@jmct
Copy link
Contributor

jmct commented Jul 19, 2024

@andreabedini, it would be useful to have your thoughts on this as someone that's considered this stuff deeply before.

@gbaz
Copy link
Collaborator

gbaz commented Jul 19, 2024

Having reviewed everything I see there's already a lot of work in this draft PR: https://github.com/haskell/cabal/pull/9436/files

It would be good to clarify if this proposal is to continue to extend that work, or to do something else inspired by it, or something else. It looks like the first to me!

If so, it would be good to clarify A) what remains to be done given that work already existing, and B) if this approach has been tested in terms of what happens if we parse, modify, and then exact-print. Are there examples of this -- and where in that code is the algorithm that sort of "merges" the changes back into the exact-printed structure?

Reviewing that PR I think I recall that the code looked very promising, and answers a lot of questions that the proposal itself isn't in-depth enough to get to.

@jappeace
Copy link
Author

@gbaz
sorry I forgot to directly link, but that's the PR indeed, I added it to the proposal.

B) if this approach has been tested in terms of what happens if we parse, modify, and then exact-print. Are there examples of this

There are no tests for this right now, we should add tests for this, please list which modifications are important and I'll add an automated test for each of them.

and where in that code is the algorithm that sort of "merges" the changes back into the exact-printed structure?

The current implementation will "improvise" and fall back to the pretty printer.
I think what it should do is keep track of improvisations and relatively shift subsequent exact occurrences.

it does something like this right now but it's kind of adhoc:
https://github.com/haskell/cabal/pull/9436/files#diff-53cbe1fb815e26b11060bd5c78c372013a079192b8acad5c1c6044ecc647dcefR160

and it isn't quite right, because it'll find negative relative rows and do nothing with that information, but it indicates those improvisations have occurred: https://github.com/haskell/cabal/pull/9436/files#diff-53cbe1fb815e26b11060bd5c78c372013a079192b8acad5c1c6044ecc647dcefR137

For comment placement you can do the same relative shifts per row, but for columns you may need to keep right shifting them untill you find place (under modification)

@jappeace
Copy link
Author

@Ericson2314

Hi eric,

Does GenericPackageDescription have enough information? I'm very out of the loop on this, but I feared it might not. (Feel free to just link me something that already exists answering this.)

as far as I can tell it's the root type used for dealing with .cabal, for example it's used in readSourcePackageCabalFile, and they benchmark for the parseGenericPackage function:
https://github.com/haskell/cabal/blob/705b6ebcaed649ed1a30b138b4348d24722d610a/cabal-benchmarks/bench/CabalBenchmarks.hs#L22
up till now I've added some fields for things that aren't used by cabal (for example, exact positions and comments), but I think the main cabal library uses that type as well to do most things.

ExactPrintMeta is a nasty way to encode this information, because it makes it very hard to modify the AST while preserving comments

I'm open to suggestions! Comments specifically are underdeveloped in its current form because it was the zurich hack prototype which I barely got working 😄

(other edits would invalidate stuff in the map

Invalidation works kinda great for exact positions themselves (eg for fields and sections).

Perhaps using a map wasn't the best idea for comments,
it should be something like Map Ycoordinate (XCoordinate, Text).
y can be relatively calculated like I mentioned above (keeping track of "improvisations"),
and x you try to place on a row, if there is no space you just keep shifting right until you find space.
That way you can modify GenericPackageDescription and preserve the comment.

@gbaz
Copy link
Collaborator

gbaz commented Jul 23, 2024

Thanks for the updates and comments. On the thing you added with remaining work, I think a proposal we accept would need some sense of how we can architect thing to solve those problems, since they're pretty fundamental:

Redo how common stanza's are handled (they're currently "merged" into sections directly, which is unrecoverable).
add support for comma printing,
add support for braces.
add support for conditional branches.

Specifically, I think that adding support for common stanzas and conditionals is going to be both very tricky and very important. Meanwhile, adding support for braces can be out of scope -- i.e. if braces are translated away, I think that's livable -- they're a basically unused feature. And commas should be straightforward.

@andreabedini
Copy link

It is great to see you pushing forward with this @jappeace. I know you have invested significant time on it already and we should not let that effort go unnoticed. I also liked working with you on this at ZuriHac, if only that weekend had been a few days longer!

With respect to the proposal. Thanks for mentioning my Field-based proposal, but I think you misunderstand it.

A related effort is to build combinators that allow modifyng the Field type directly. This would depracate the GenericPackage structure and make an alternative structure available. [..] I suppose the idea is to completly replace GenericPackageDescription with this Field type.

Replacing GenericPackageDescription (GPD) has never been on the cards. My approach, like yours, was to make the minimal change that is a step forward. Therefore what I have been working on was intentionally ... anticlimactic: forget about the exact-print of GPD and start with the exact-print of Field. The reasoning is that it is Field that actually describes the syntax cabal files (and other files) are written in.

A value of type Field a (where a is a type parameter used to annotate parsed values) does correspond 1-to-1 to what is written in a cabal file, with the non-trivial caveat that comments and the whitespaces have already been stripped by the lexer. GPD, ProjectConfig and InstalledPackageInfo get all "parsed" from this lower-level Field type.

To make a long story short: my plan was to get the lexer to keep whitespace and comments so I could exactprint Field. Changing the lexer was easy but the lexer and the Field parser are interlevead and I never managed to adjust the parser side. I thought at ZuriHac you did succeed in fixing the parser side? You might have already integrated the two approaches in your PR, I will need to check.

In any case, the two approaches are not incompatible or in alternative. I believe you must have gone past the obstacles I have encountered, which now leaves me to decide what to do with my efforts :-)

Unlike the Field effort, this proposal only focusess only on getting exact printing to work with a minimal footprint. We don't want to do any additional refactoring.

Maybe I misunderstand the plan but I am not sure whether GPD can even be mapped to the concrete grammar. @gbaz rightly mentions the common stanzas, which get inlined directly during the Field to GPD parsing. How do we keep the knowledge of common stanzas in GPD without a significant refactoring?1

There are some constraints I think we should keep in mind.

  1. I believe whatever framework of functionality should be part of Cabal-syntax.
  2. The API should be well integrated with the rest (where it makes sense. Cabal-syntax is a bit of a kitchen sink already).
  3. Ideally the set of parseable cabal files should not change, although pragmatically I would say it's ok to only check those on Hackage.
  4. Operations like cabal update and cabal build --dry-run should not get slower.
  5. The memory demand on workloads like hackage-server should not increase.

All this might well within reach, you just need to prove it :P

Footnotes

  1. I do not mean to give a negative connotation to "significant refactoring", likely it would be beneficial. Nevertheless I think it is unavoidable.

@jappeace
Copy link
Author

jappeace commented Aug 7, 2024

@andreabedini

In any case, the two approaches are not incompatible or in alternative. I believe you must have gone past the obstacles I have encountered, which now leaves me to decide what to do with my efforts :-)

I strongly suspect this approach I'm taking will work, but I don't wish to discourage you.

To make a long story short: my plan was to get the lexer to keep whitespace and comments so I could exactprint Field.

Could you elaborate on how this deals with changes made? For example, if hls wants to add a dependency to a library.

I thought at ZuriHac you did succeed in fixing the parser side? You might have already integrated the two approaches in your PR, I will need to check.

yes it passes the tests. Like mentioned in the proposal, it's not complete however, not all comment positions are captured.
I pushed all changes I made on zurich hack to pr linked in this proposal, for example:

https://github.com/haskell/cabal/pull/9436/files#diff-cd54b0e388dc16fd5e820e41341f642c3a526d143a6768014478e56984a92751L138

I believe whatever framework of functionality should be part of Cabal-syntax.

I'm not familiar with the cabal package structure, why is this desired?
I think the current implementation already abides by this.

The API should be well integrated with the rest (where it makes sense. Cabal-syntax is a bit of a kitchen sink already).

Could you elaborate on this?

Ideally the set of parseable cabal files should not change, although pragmatically I would say it's ok to only check those on Hackage.

I already mentioned this I think.
https://github.com/haskellfoundation/tech-proposals/pull/65/files#diff-e5972723ec21675420683d52a3ddc437d65024bd60428c4a50ff09a0ba53c87aR171

The memory demand on workloads like hackage-server should not increase.

I mentioned in the proposal it should not crash, because no increase at all seems a bit stringent. Please review:
https://github.com/haskellfoundation/tech-proposals/pull/65/files#diff-e5972723ec21675420683d52a3ddc437d65024bd60428c4a50ff09a0ba53c87aR393

Operations like cabal update and cabal build --dry-run should not get slower.

Why do we want this?

@jappeace
Copy link
Author

jappeace commented Aug 7, 2024

@gbaz I added some ideas for both common stanzas and conditionals.

@gbaz
Copy link
Collaborator

gbaz commented Aug 8, 2024

Thanks for the updates. Starting to shape up.

@jappeace jappeace changed the title Jappie cabal exact print Cabal exact print Aug 10, 2024
@Ericson2314
Copy link
Contributor

@jappeace Even with relative offsets, the separate map idea sounds very hard to work with to me. I still strongly lean towards the Field approach. I don't yet see any downsides to it, just upsides.

@gbaz
Copy link
Collaborator

gbaz commented Aug 16, 2024

Why do we want this?

We want operations to not get slower because cabal needs to parse the full tarball of cabal files to build its database to solve, etc. If parsing is slow, then it becomes amplified at scale, and makes end-users frustrated when common commands take longer.

@gbaz
Copy link
Collaborator

gbaz commented Aug 16, 2024

in terms of "Refactor GenericPackageDescription so that the parser no longer merges these sections and instead stores them as proper records. Then make the callsites smart enough to deal with common stanzas." I think this will be very painful. There are a lot of callsites and a lot of usages -- especially since so many fields can be within common stanzas. A remembering and "smart back-merge" algo seems much more lightweight.

As far as this approach vs fields, the advantage to this gpd approach that I see is that it would let people modify the gpd, which is a much more "natural" structure to the data, as opposed to modifying fields directly. So its like letting people manipulate an actual datastructure vs the json-encoding of that structure.

In fact, roughly, Fields : GPD :: Json : ADT

@Bodigrim
Copy link
Collaborator

I greatly appreciate the effort, so please consider my remarks below not as a critique against but as an encouragement to make the proposal stronger.

In a world of infinite resources having an exact printer is better than not having one. But in the real world I'd suggest to adopt a problem-based approach. What is a practical problem which an exact printer can solve, which cannot be reasonably solved by other, already available methods? There is no such business ask as "bidirectional parsing and printing" and I hardly imagine an HF sponsor immediately excited about by such achievement.

The proposal lists automated addition of dependencies, expansion of modules and formatting. But we do have (maybe, imperfect) automated tools for all of this already. Would it be easier to accomplish such tasks with an exact printer? How exactly? To pick a particular instance, what kind of impact would an exact printer have for cabal-add?

As much as I can tell from a cursory glance, GHC exact printer needs lots of maintenance. Pretty much every release of GHC requires non-trivial amount of work in this area. We'd better have a very good justification to invest into one for Cabal.

it's kindoff annoying because the reference implementation
already uses this, but somehow people aren't connecting the dots.
@Bodigrim
Copy link
Collaborator

For example a newly introduced cabal-add, would need to take into account any
syntax change to cabal, forever.

I don't quite understand this, cabal-add does not parse Cabal files on its own.

@jappeace
Copy link
Author

It looks like parsing to me because it's working directly on Field type which is what the parser in cabal does.
but I wrote that because I saw stuff like detectLeadingComma this:
https://github.com/Bodigrim/cabal-add/blob/master/src/Distribution/Client/Add.hs#L393

@Bodigrim
Copy link
Collaborator

cabal-add uses detectLeadingComma to detect whether a Cabal file uses either leading or trailing commas for build-depends, so that it can mimic the convention in a new entry. How would exact printing help with this? It seems to me it will only make this property harder to detect.

It would help me to see a specific example how cabal-add would look like if the proposal is implemented. Having [Field Position] is already quite powerful, and we can convert it to [Field ByteString] without much work, so I'm somewhat unsure whether full-fledged exact print annotations bring enough additional value.

@jappeace
Copy link
Author

the type we wish to expose isn't Field 😅
I added an example to clarify why.

There are many ways of capturing the cruft of commas and spacing, the one I proposed here is just an imperfect but hopefully decent solution,
I think it'll make manipulating cabal files much easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants