Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-evaluate LLM approach on value correctness #15

Open
fl4p opened this issue Sep 22, 2024 · 1 comment
Open

Re-evaluate LLM approach on value correctness #15

fl4p opened this issue Sep 22, 2024 · 1 comment

Comments

@fl4p
Copy link
Owner

fl4p commented Sep 22, 2024

Results from the new benchmark comparing actual min/typ/max field values:

num *EQUAL* *VALUES*:
                                           total     
                           tabular_parse:  278 (100%)   37   35    9   31   29    1   39    7    6   25   30   29
    ocr_text2_claude-3-5-sonnet-20240620:  236 ( 85%)   30   31    6   29   21    0   34    7    6   20   26   26
                       ocr_text2_llama-3:  180 ( 65%)   24   21    8   21   22    0   26    5    5   15   16   17
                       text2_gpt-4o-mini:  157 ( 56%)   20   25    2   18   16    0   23    1    1    8   21   22
                   ocr_text2_gpt-4o-mini:  149 ( 54%)   23   21    3   22   14    0   15    5    3   10   16   17

tabular_parse is the reference, because we can assume that most of the values are correct here (no LLM, it has been carefully hand crafted).

A wrongly extracted value is much worse than a missing value, because we will not notice the mistake in the results of the power calc (missing values will output nan power values).

Analysis shows that the LLM takes values from neighbouring fields or just completely random.

The converterapi pdf2txt (or pdf2ocr2txt ?) seems to extract table contents columns wise (not row wise) , this might explain the neighbour confusion.

Random values might come from LLM exhaustion and Non deterministic effects?

@piotrdelikat
Copy link
Collaborator

Could you provide an example of a datasheet name/results where this is taking place?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants