-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rag with pdf #37
Comments
The pdf is removed since the pdf is already ingested in the database, there is no need for the pdf to stay there. The RAG quality depends on many factors, for example:
Above mentioned parameters to influence the retrieval process can be set in the config.yaml file. I have not worked with tika yet, looks interesting. |
Hi Leon many thanks for the feedback. |
Optimization depends on your hardware and use case. The settings are optimized in the way that most people should be able to run it on their hardware. If you have better hardware or don't care about waiting times, then the numbers can be increased or different models can be used. An easy approach to see which text chunks are created from the pdf would be to write the text chunks created in the pdf_handler into a txt file for example and then examine it. With that you could check if the relevant text could be extracted from the pdf file at all. Or load the database and look into it. |
Hi Leon . very nice app.
I give a try local with a conda env python 3.10.12.
All start and run ok in think.
I i put llava and an image this is working if i do some ask about this image.
But with pdf this is very bad.
I try with the hover.pdf you have put for testing . I browse it. the pdf is loaded and show in the left panel so i think it have been good ingest to the database.
I i ask about this file at the chat the filename of the pdf file on the left pane is reset . Why very curious ?. Because with the image the file stay register in the list.
And the return about the file subject is bad. And if ask for some precise information in the file the retrieve is very bad.
I have try with a table of number with date put in pdf and the problem is the same..the retrieve see only a very little parts of all the date and numbers.
Do you have done a look about tika ? this is i think better for a rag system. Tika i able to index many many type of file.
Thanks have a nice days.
The text was updated successfully, but these errors were encountered: