Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add json #219

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Add json #219

wants to merge 2 commits into from

Conversation

Gad
Copy link

@Gad Gad commented Dec 26, 2024

the proposed PR addresses somehow issue #34.
Having not found a suitable python library, I added a JsonConverter class independent of the PlainTextConverter.
in a nutshell :

  • parse document with python json.load()
  • converts nested dictionaries into objects starting with the filename (e.g. json_org_example2.menu.id for {menu :{id : ...}} in json_org_example2.json
  • converts JSON arrays into md ordered lists since json arrays are ordered
  • adds some MD formatting
  • adds back tips around key/value to avoid markdown interpreters using them as keywords

hash test / pre-commit ran, Ok

tested against json.org examples. output available here : https://gist.github.com/Gad/4de412dabb73c6a20b0a098089226814

@Gad
Copy link
Author

Gad commented Dec 26, 2024 via email

if extension.lower() not in [".json"]:
return None

# TODO : check similar extensions and/or mime type

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a TODO are you working on this mime type? Thanks!

@Viddesh1
Copy link

Hello @Gad

Thanks for pull request and gist really helpful. https://gist.github.com/Gad/4de412dabb73c6a20b0a098089226814

I like this recursive function def _json_traversal(self, d: Union[dict, list], level: int, prefix: str) -> str

The output currently look like below after converting to markdown:-

# lib.glossary
`lib.glossary.title:'example glossary'`
## lib.glossary.GlossDiv
`lib.glossary.GlossDiv.title:'S'`
### lib.glossary.GlossDiv.GlossList
#### lib.glossary.GlossDiv.GlossList.GlossEntry
`lib.glossary.GlossDiv.GlossList.GlossEntry.ID:'SGML'`
`lib.glossary.GlossDiv.GlossList.GlossEntry.SortAs:'SGML'`
`lib.glossary.GlossDiv.GlossList.GlossEntry.GlossTerm:'Standard Generalized Markup Language'`
`lib.glossary.GlossDiv.GlossList.GlossEntry.Acronym:'SGML'`
`lib.glossary.GlossDiv.GlossList.GlossEntry.Abbrev:'ISO 8879:1986'`
##### lib.glossary.GlossDiv.GlossList.GlossEntry.GlossDef
`lib.glossary.GlossDiv.GlossList.GlossEntry.GlossDef.para:'A meta-markup language, used to create markup languages such as DocBook.'`
###### lib.glossary.GlossDiv.GlossList.GlossEntry.GlossDef.GlossSeeAlso
---
0. `lib.glossary.GlossDiv.GlossList.GlossEntry.GlossDef.GlossSeeAlso[0]:'GML'`
1. `lib.glossary.GlossDiv.GlossList.GlossEntry.GlossDef.GlossSeeAlso[1]:'XML'`
---
`lib.glossary.GlossDiv.GlossList.GlossEntry.GlossSee:'markup'`

I think is is hard to read. Some space in between could be good like below for all :-
What do you think?

# lib.glossary
`lib.glossary.title:'example glossary'`

## lib.glossary.GlossDiv
`lib.glossary.GlossDiv.title:'S'`

### lib.glossary.GlossDiv.GlossList

#### lib.glossary.GlossDiv.GlossList.GlossEntry

Why not like this? (human readable)

# Glossary

## Title: example glossary

### Section: S

#### GlossEntry

- **ID**: SGML  
- **SortAs**: SGML  
- **GlossTerm**: Standard Generalized Markup Language  
- **Acronym**: SGML  
- **Abbreviation**: ISO 8879:1986  

##### GlossDef

- **Description**: A meta-markup language, used to create markup languages such as DocBook.  
- **See Also**:
  - GML
  - XML  

#### GlossSee

- Related Term: markup

OR This would also be good

lib
└── glossary
    ├── title: example glossary
    └── GlossDiv
        ├── title: S
        └── GlossList
            └── GlossEntry
                ├── ID: SGML
                ├── SortAs: SGML
                ├── GlossTerm: Standard Generalized Markup Language
                ├── Acronym: SGML
                ├── Abbrev: ISO 8879:1986
                ├── GlossDef
                │   ├── para: A meta-markup language, used to create markup languages such as DocBook.
                │   └── GlossSeeAlso
                │       ├── GML
                │       └── XML
                └── GlossSee: markup

Also, Test case is missing.

Thanks and Best Regards!
Viddesh

@Gad
Copy link
Author

Gad commented Dec 26, 2024

Thank you for your feedback Viddesh,

Regarding your first comment, yes analyzing the mime type instead of or in addition to the file extension is possible for those files without .json extension (as long as your /etc/mime.types include the desired extensions, afaik).

Regarding your second comment, my understanding - correct me if I'm wrong - is that this is mainly the job of the markdown renderer. for example, the tree structure you suggest is up to the renderer - usually through a plugin -although the markdown syntax should be modified to facilitate it (e.g. the tree like plugins I found are based on markdown code being indented). Granted, one could generate such a structure programmatically and enclose it within three backticks. The result would surely be prettier but wouldn't it make this lib more difficult in its context ("for indexing, text analysis") ? I had a hard time settling for a syntax that is both useful for this purpose and readable after rendering.
Best regards,
Gad

@Viddesh1
Copy link

Hello @Gad ,

Thanks for your reply.

If there is any plugin or a parser that would parse through the JSON and returns the tree structure that would be good to have.
I think that after converting from JSON file to markdown file, It depends on the specific use case on what they want to do with the
JSON markdown file that they have. For example they may create parser for specific key, value or do some data analysis for training a RAG model.

Yes, Rendering depends on which format the end user want as output. This parser that you created is nice. The markdown preview is also fine.

if you add a new line the output will not feel congested. Find below:-
line number 1244: _md += "%s %s.%s\n\n" % ("#" * level, prefix, str(key))

# lib.glossary

`lib.glossary.title:'example glossary'`
## lib.glossary.GlossDiv

`lib.glossary.GlossDiv.title:'S'`
### lib.glossary.GlossDiv.GlossList

#### lib.glossary.GlossDiv.GlossList.GlossEntry

`lib.glossary.GlossDiv.GlossList.GlossEntry.ID:'SGML'`
`lib.glossary.GlossDiv.GlossList.GlossEntry.SortAs:'SGML'`
`lib.glossary.GlossDiv.GlossList.GlossEntry.GlossTerm:'Standard Generalized Markup Language'`
`lib.glossary.GlossDiv.GlossList.GlossEntry.Acronym:'SGML'`
`lib.glossary.GlossDiv.GlossList.GlossEntry.Abbrev:'ISO 8879:1986'`
##### lib.glossary.GlossDiv.GlossList.GlossEntry.GlossDef

`lib.glossary.GlossDiv.GlossList.GlossEntry.GlossDef.para:'A meta-markup language, used to create markup languages such as DocBook.'`
###### lib.glossary.GlossDiv.GlossList.GlossEntry.GlossDef.GlossSeeAlso

---
0. `lib.glossary.GlossDiv.GlossList.GlossEntry.GlossDef.GlossSeeAlso[0]:'GML'`
1. `lib.glossary.GlossDiv.GlossList.GlossEntry.GlossDef.GlossSeeAlso[1]:'XML'`
---
`lib.glossary.GlossDiv.GlossList.GlossEntry.GlossSee:'markup'`

Regards!
Viddesh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants