Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: on the PropertyGraphIndex, is there an easy way to exclude metadata when building an index from documents? #17535

Open
1 task done
chartsNgraphs opened this issue Jan 16, 2025 · 4 comments
Labels
question Further information is requested

Comments

@chartsNgraphs
Copy link

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

Whenever I use PropertyGraphIndex to build a graph using an LLM, it always includes the metadata. This seems pretty wasteful, since the metadata is not useful to the knowledge graph and it wastes a lot of LLM tokens.

I'd like my graph NOT to include triplets like "source, is, 12348873" and "file path, is, home/folder". How can this be easily achieved?

@chartsNgraphs chartsNgraphs added the question Further information is requested label Jan 16, 2025
Copy link

dosubot bot commented Jan 16, 2025

Currently, there is no specific configuration or parameter in the PropertyGraphIndex that allows for the exclusion of specific metadata fields during the index building process. The default behavior includes metadata fields as defined in the DEFAULT_PROPS_SCHEMA, which includes fields like file_path, triplet_source_id, and others [1].

To achieve the exclusion of certain metadata, you might need to customize or extend the PropertyGraphIndex by modifying the code to filter out unwanted metadata fields before they are processed into the graph. This could involve altering the logic where metadata is handled or creating a custom preprocessing step to remove specific metadata entries.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@chartsNgraphs
Copy link
Author

Under what circumstances is it desirable to pollute a graph with a massive community of metadata nodes?

If max_tiplets_per_chunk is 5, this means every 6th node at least is something like "source, is, file path". Won't this degrade retriever performance?

Copy link

dosubot bot commented Jan 16, 2025

Including a large number of metadata nodes in a graph can be desirable when the metadata provides valuable context or additional information that enhances the understanding or utility of the graph. For example, metadata can be useful for:

  1. Provenance Tracking: Metadata nodes can help track the origin and history of data, which is important for auditing and verifying the integrity of information.

  2. Contextual Enrichment: Metadata can provide context that aids in the interpretation of the data, such as timestamps, authorship, or source details.

  3. Facilitating Complex Queries: In some cases, metadata can enable more complex queries that require additional context or filtering based on metadata attributes.

However, as you pointed out, a high density of metadata nodes can indeed degrade retriever performance by increasing the complexity and size of the graph. This can lead to slower query times and higher computational costs. To mitigate these issues, it's important to balance the inclusion of metadata with the performance requirements of the application. Techniques such as filtering out non-essential metadata, optimizing the graph structure, or using efficient retrieval mechanisms can help manage the trade-offs between metadata richness and performance.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@logan-markewich
Copy link
Collaborator

@chartsNgraphs The source is coming from the node relationships extractor included by default. Try excluding it.

The llm reads node/document metadata by default. If this is not helpful, exclude it

Example for both suggestions

document.excluded_llm_metadata_keys = [...]
# can also exclude embedding keys
document.excluded_embed_metadata_keys = [...]
from llama_index.core.indices.property_graph import SimpleLLMPathExtractor

kg_extractor = SimpleLLMPathExtractor(
    llm=llm,
    max_paths_per_chunk=10,
    num_workers=4,
    show_progress=False,
)

index = PropertyGraphIndex.from_documents(documents, kg_extractors=[kg_extractor])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants