Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document splitter update #8745

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

nickprock
Copy link
Contributor

Related Issues

Proposed Changes:

Updated code as illustrated in the issue.

How did you test it?

unit tests

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@nickprock nickprock requested review from a team as code owners January 17, 2025 14:08
@nickprock nickprock requested review from dfokina and anakin87 and removed request for a team January 17, 2025 14:08
@nickprock
Copy link
Contributor Author

Hi @anakin87 ,
on my local environment the code pass all checks but here there are some files (I haven't worked on they) that don't pass the format checker.
I don't understand.

@anakin87
Copy link
Member

Sometimes it's related to linter updates... I'll take a look later...

@github-actions github-actions bot added the topic:DX Developer Experience label Jan 17, 2025
@coveralls
Copy link
Collaborator

Pull Request Test Coverage Report for Build 12833435732

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 91.306%

Totals Coverage Status
Change from base Build 12829583920: 0.0%
Covered Lines: 8853
Relevant Lines: 9696

💛 - Coveralls

@anakin87 anakin87 force-pushed the dev/DocumentSplitter_update branch from 2deb45a to cb2bc32 Compare January 17, 2025 17:34
@anakin87
Copy link
Member

anakin87 commented Jan 17, 2025

  1. I understood the format problem, that I will fix in another PR (chore: update ruff version in pre-commit hook #8746)
  2. I'm having a hard time testing this bugfix: I'm not sure that we are creating duplicate IDs with the current code, so I'll ask @julian-risch to take a look

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite good to me already! Thank you @nickprock for contributing to Haystack!
There is a small bug in the code, which is also the reason why @anakin87 couldn't produce any duplicate documents. We're currently re-using meta throughout multiple iterations of the for loop, which we shouldn't. In particular, overwriting meta in meta = deepcopy(meta) is problematic.

If you do something like the following, that problem will be fixed:

copied_meta = deepcopy(meta)
copied_meta["page_number"] = splits_pages[i]
copied_meta["split_id"] = i
copied_meta["split_idx_start"] = split_idx
doc = Document(content=txt, meta=copied_meta)

Here is an example of a unit test you could please add to test/components/preprocessors/test_document_splitter.py

def test_duplicate_pages_get_different_doc_id(self):
    splitter = DocumentSplitter(split_by="page", split_length=1)
    doc1 = Document(content="This is some text.\fThis is some text.\fThis is some text.\fThis is some text.")
    splitter.warm_up()
    result = splitter.run(documents=[doc1])

    assert len({doc.id for doc in result["documents"]}) == 4 

@nickprock
Copy link
Contributor Author

Thanks @julian-risch , I'll work on it tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:core topic:DX Developer Experience topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DocumentSplitter updates Document's meta data after initializing the Document
4 participants