Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

References in different formats - Marker only uses numbered citations like [1], [2] #461

Open
flight505 opened this issue Jan 4, 2025 · 0 comments

Comments

@flight505
Copy link

Seeking Recommendations for Improving Academic Reference Extraction

Hi everyone,

I’ve been working on a system to assist with writing grants. In the past, I’ve used Marker, which has been fairly effective for extracting specific elements like equations and tables. However, for this project, I need high-quality references from academic papers. If anyone knows of a model particularly good at this, I’d greatly appreciate your recommendations.

Challenges with Marker

Regarding Marker, I’ve noticed limitations with its metadata handling. Each processed PDF generates two JSON files, with the global metadata.json aggregating information across all documents. This is the primary file we use.

However, updating this metadata has been challenging due to the diverse formats of academic publications. Marker primarily detects numbered citations like [1], [2], etc., which means the metadata for some papers ends up blank. Some cases the resulting references are - [1] in that case Marker also fails to understand they are references.

Other papers generate richer JSON files due to:

  1. Inclusion of mathematical equations (marked with LaTeX notation).
  2. Well-structured content that Marker can effectively extract.
  3. Citation formats that Marker is able to recognize.

But I seem to be unable to predict if a paper will work well with Marker or not...

Efforts So Far

I’ve tried using the --use_llm option, but it hasn’t significantly improved reference extraction.

Call for Suggestions

If anyone has suggestions for improving this —perhaps through pre-processing techniques or integrating additional LLMs trained specifically for academic tasks —I’d be keen to contribute. I understand this might introduce a slowdown, so it would likely need to be an optional feature or a solution for others, it seems to be a hot topic in the RAG world. https://ds4sd.github.io/docling/ is still missing references. I have tried HuggingFaceM4/idefics2-8b but it also falls short, GROBID is fairly good at references perhaps there a possibility there.. If others are having similar issues or have methods that can solve it, maybe something new - just post it..

Looking forward to your insights!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant