You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Seeking Recommendations for Improving Academic Reference Extraction
Hi everyone,
I’ve been working on a system to assist with writing grants. In the past, I’ve used Marker, which has been fairly effective for extracting specific elements like equations and tables. However, for this project, I need high-quality references from academic papers. If anyone knows of a model particularly good at this, I’d greatly appreciate your recommendations.
Challenges with Marker
Regarding Marker, I’ve noticed limitations with its metadata handling. Each processed PDF generates two JSON files, with the global metadata.json aggregating information across all documents. This is the primary file we use.
However, updating this metadata has been challenging due to the diverse formats of academic publications. Marker primarily detects numbered citations like [1], [2], etc., which means the metadata for some papers ends up blank. Some cases the resulting references are - [1] in that case Marker also fails to understand they are references.
Other papers generate richer JSON files due to:
Inclusion of mathematical equations (marked with LaTeX notation).
Well-structured content that Marker can effectively extract.
Citation formats that Marker is able to recognize.
But I seem to be unable to predict if a paper will work well with Marker or not...
Efforts So Far
I’ve tried using the --use_llm option, but it hasn’t significantly improved reference extraction.
Call for Suggestions
If anyone has suggestions for improving this —perhaps through pre-processing techniques or integrating additional LLMs trained specifically for academic tasks —I’d be keen to contribute. I understand this might introduce a slowdown, so it would likely need to be an optional feature or a solution for others, it seems to be a hot topic in the RAG world. https://ds4sd.github.io/docling/ is still missing references. I have tried HuggingFaceM4/idefics2-8b but it also falls short, GROBID is fairly good at references perhaps there a possibility there.. If others are having similar issues or have methods that can solve it, maybe something new - just post it..
Looking forward to your insights!
The text was updated successfully, but these errors were encountered:
Seeking Recommendations for Improving Academic Reference Extraction
Hi everyone,
I’ve been working on a system to assist with writing grants. In the past, I’ve used Marker, which has been fairly effective for extracting specific elements like equations and tables. However, for this project, I need high-quality references from academic papers. If anyone knows of a model particularly good at this, I’d greatly appreciate your recommendations.
Challenges with Marker
Regarding Marker, I’ve noticed limitations with its metadata handling. Each processed PDF generates two JSON files, with the global
metadata.json
aggregating information across all documents. This is the primary file we use.However, updating this metadata has been challenging due to the diverse formats of academic publications. Marker primarily detects numbered citations like
[1]
,[2]
, etc., which means the metadata for some papers ends up blank. Some cases the resulting references are- [1]
in that case Marker also fails to understand they are references.Other papers generate richer JSON files due to:
But I seem to be unable to predict if a paper will work well with Marker or not...
Efforts So Far
I’ve tried using the
--use_llm
option, but it hasn’t significantly improved reference extraction.Call for Suggestions
If anyone has suggestions for improving this —perhaps through pre-processing techniques or integrating additional LLMs trained specifically for academic tasks —I’d be keen to contribute. I understand this might introduce a slowdown, so it would likely need to be an optional feature or a solution for others, it seems to be a hot topic in the RAG world. https://ds4sd.github.io/docling/ is still missing references. I have tried HuggingFaceM4/idefics2-8b but it also falls short, GROBID is fairly good at references perhaps there a possibility there.. If others are having similar issues or have methods that can solve it, maybe something new - just post it..
Looking forward to your insights!
The text was updated successfully, but these errors were encountered: