References in different formats - Marker only uses numbered citations like [1], [2] #461

flight505 · 2025-01-04T16:54:30Z

Seeking Recommendations for Improving Academic Reference Extraction

Hi everyone,

I’ve been working on a system to assist with writing grants. In the past, I’ve used Marker, which has been fairly effective for extracting specific elements like equations and tables. However, for this project, I need high-quality references from academic papers. If anyone knows of a model particularly good at this, I’d greatly appreciate your recommendations.

Challenges with Marker

Regarding Marker, I’ve noticed limitations with its metadata handling. Each processed PDF generates two JSON files, with the global metadata.json aggregating information across all documents. This is the primary file we use.

However, updating this metadata has been challenging due to the diverse formats of academic publications. Marker primarily detects numbered citations like [1], [2], etc., which means the metadata for some papers ends up blank. Some cases the resulting references are - [1] in that case Marker also fails to understand they are references.

Other papers generate richer JSON files due to:

Inclusion of mathematical equations (marked with LaTeX notation).
Well-structured content that Marker can effectively extract.
Citation formats that Marker is able to recognize.

But I seem to be unable to predict if a paper will work well with Marker or not...

Efforts So Far

I’ve tried using the --use_llm option, but it hasn’t significantly improved reference extraction.

Call for Suggestions

If anyone has suggestions for improving this —perhaps through pre-processing techniques or integrating additional LLMs trained specifically for academic tasks —I’d be keen to contribute. I understand this might introduce a slowdown, so it would likely need to be an optional feature or a solution for others, it seems to be a hot topic in the RAG world. https://ds4sd.github.io/docling/ is still missing references. I have tried HuggingFaceM4/idefics2-8b but it also falls short, GROBID is fairly good at references perhaps there a possibility there.. If others are having similar issues or have methods that can solve it, maybe something new - just post it..

Looking forward to your insights!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

References in different formats - Marker only uses numbered citations like [1], [2] #461

References in different formats - Marker only uses numbered citations like [1], [2] #461

flight505 commented Jan 4, 2025

References in different formats - Marker only uses numbered citations like [1], [2] #461

References in different formats - Marker only uses numbered citations like [1], [2] #461

Comments

flight505 commented Jan 4, 2025

Seeking Recommendations for Improving Academic Reference Extraction

Challenges with Marker

Efforts So Far

Call for Suggestions