rag with pdf #37

navr32 · 2025-01-15T11:55:39Z

Hi Leon . very nice app.
I give a try local with a conda env python 3.10.12.
All start and run ok in think.
I i put llava and an image this is working if i do some ask about this image.
But with pdf this is very bad.
I try with the hover.pdf you have put for testing . I browse it. the pdf is loaded and show in the left panel so i think it have been good ingest to the database.
I i ask about this file at the chat the filename of the pdf file on the left pane is reset . Why very curious ?. Because with the image the file stay register in the list.
And the return about the file subject is bad. And if ask for some precise information in the file the retrieve is very bad.
I have try with a table of number with date put in pdf and the problem is the same..the retrieve see only a very little parts of all the date and numbers.
Do you have done a look about tika ? this is i think better for a rag system. Tika i able to index many many type of file.
Thanks have a nice days.

Leon-Sander · 2025-01-15T14:08:04Z

The pdf is removed since the pdf is already ingested in the database, there is no need for the pdf to stay there.
The image stays there, because you can only chat with the image if the image is available. If it is removed, then you continue a normal chat without the image.

The RAG quality depends on many factors, for example:

How good can the data be extracted from the pdf
Which embedding model are you using
The chunk size you're using
The number of documents you're retrieving from the embedding database
What prompt you are using to retrieve the data

Above mentioned parameters to influence the retrieval process can be set in the config.yaml file.

I have not worked with tika yet, looks interesting.

navr32 · 2025-01-16T13:32:45Z

Hi Leon many thanks for the feedback.
I was thinking you have optimize the rag with the models set in the settings.
I wil try some news combination of embedding model, chunk size..and so.
I have ever do some test with openwebui. At end i see that it is often with the bigger embedding models thats the knowledge vector base is better. And the chunk always give problem for precise data retrieve...So i will see with yours.

Leon-Sander · 2025-01-16T15:14:33Z

Optimization depends on your hardware and use case. The settings are optimized in the way that most people should be able to run it on their hardware. If you have better hardware or don't care about waiting times, then the numbers can be increased or different models can be used.

An easy approach to see which text chunks are created from the pdf would be to write the text chunks created in the pdf_handler into a txt file for example and then examine it. With that you could check if the relevant text could be extracted from the pdf file at all. Or load the database and look into it.
Next would be to see if your does retrieve the relevant document and find out why or why not.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rag with pdf #37

rag with pdf #37

navr32 commented Jan 15, 2025

Leon-Sander commented Jan 15, 2025 •

edited

Loading

navr32 commented Jan 16, 2025

Leon-Sander commented Jan 16, 2025

rag with pdf #37

rag with pdf #37

Comments

navr32 commented Jan 15, 2025

Leon-Sander commented Jan 15, 2025 • edited Loading

navr32 commented Jan 16, 2025

Leon-Sander commented Jan 16, 2025

Leon-Sander commented Jan 15, 2025 •

edited

Loading