Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rag with pdf #37

Open
navr32 opened this issue Jan 15, 2025 · 3 comments
Open

rag with pdf #37

navr32 opened this issue Jan 15, 2025 · 3 comments

Comments

@navr32
Copy link

navr32 commented Jan 15, 2025

Hi Leon . very nice app.
I give a try local with a conda env python 3.10.12.
All start and run ok in think.
I i put llava and an image this is working if i do some ask about this image.
But with pdf this is very bad.
I try with the hover.pdf you have put for testing . I browse it. the pdf is loaded and show in the left panel so i think it have been good ingest to the database.
I i ask about this file at the chat the filename of the pdf file on the left pane is reset . Why very curious ?. Because with the image the file stay register in the list.
And the return about the file subject is bad. And if ask for some precise information in the file the retrieve is very bad.
I have try with a table of number with date put in pdf and the problem is the same..the retrieve see only a very little parts of all the date and numbers.
Do you have done a look about tika ? this is i think better for a rag system. Tika i able to index many many type of file.
Thanks have a nice days.

@Leon-Sander
Copy link
Owner

Leon-Sander commented Jan 15, 2025

The pdf is removed since the pdf is already ingested in the database, there is no need for the pdf to stay there.
The image stays there, because you can only chat with the image if the image is available. If it is removed, then you continue a normal chat without the image.

The RAG quality depends on many factors, for example:

  • How good can the data be extracted from the pdf
  • Which embedding model are you using
  • The chunk size you're using
  • The number of documents you're retrieving from the embedding database
  • What prompt you are using to retrieve the data

Above mentioned parameters to influence the retrieval process can be set in the config.yaml file.

I have not worked with tika yet, looks interesting.

@navr32
Copy link
Author

navr32 commented Jan 16, 2025

Hi Leon many thanks for the feedback.
I was thinking you have optimize the rag with the models set in the settings.
I wil try some news combination of embedding model, chunk size..and so.
I have ever do some test with openwebui. At end i see that it is often with the bigger embedding models thats the knowledge vector base is better. And the chunk always give problem for precise data retrieve...So i will see with yours.

@Leon-Sander
Copy link
Owner

Optimization depends on your hardware and use case. The settings are optimized in the way that most people should be able to run it on their hardware. If you have better hardware or don't care about waiting times, then the numbers can be increased or different models can be used.

An easy approach to see which text chunks are created from the pdf would be to write the text chunks created in the pdf_handler into a txt file for example and then examine it. With that you could check if the relevant text could be extracted from the pdf file at all. Or load the database and look into it.
Next would be to see if your does retrieve the relevant document and find out why or why not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants