[Question]: on the PropertyGraphIndex, is there an easy way to exclude metadata when building an index from documents? #17535

chartsNgraphs · 2025-01-16T19:32:17Z

Question Validation

I have searched both the documentation and discord for an answer.

Question

Whenever I use PropertyGraphIndex to build a graph using an LLM, it always includes the metadata. This seems pretty wasteful, since the metadata is not useful to the knowledge graph and it wastes a lot of LLM tokens.

I'd like my graph NOT to include triplets like "source, is, 12348873" and "file path, is, home/folder". How can this be easily achieved?

dosubot · 2025-01-16T19:34:15Z

Currently, there is no specific configuration or parameter in the PropertyGraphIndex that allows for the exclusion of specific metadata fields during the index building process. The default behavior includes metadata fields as defined in the DEFAULT_PROPS_SCHEMA, which includes fields like file_path, triplet_source_id, and others ^[1].

To achieve the exclusion of certain metadata, you might need to customize or extend the PropertyGraphIndex by modifying the code to filter out unwanted metadata fields before they are processed into the graph. This could involve altering the logic where metadata is handled or creating a custom preprocessing step to remove specific metadata entries.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

chartsNgraphs · 2025-01-16T20:02:10Z

Under what circumstances is it desirable to pollute a graph with a massive community of metadata nodes?

If max_tiplets_per_chunk is 5, this means every 6th node at least is something like "source, is, file path". Won't this degrade retriever performance?

dosubot · 2025-01-16T20:05:36Z

Including a large number of metadata nodes in a graph can be desirable when the metadata provides valuable context or additional information that enhances the understanding or utility of the graph. For example, metadata can be useful for:

Provenance Tracking: Metadata nodes can help track the origin and history of data, which is important for auditing and verifying the integrity of information.
Contextual Enrichment: Metadata can provide context that aids in the interpretation of the data, such as timestamps, authorship, or source details.
Facilitating Complex Queries: In some cases, metadata can enable more complex queries that require additional context or filtering based on metadata attributes.

However, as you pointed out, a high density of metadata nodes can indeed degrade retriever performance by increasing the complexity and size of the graph. This can lead to slower query times and higher computational costs. To mitigate these issues, it's important to balance the inclusion of metadata with the performance requirements of the application. Techniques such as filtering out non-essential metadata, optimizing the graph structure, or using efficient retrieval mechanisms can help manage the trade-offs between metadata richness and performance.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

logan-markewich · 2025-01-17T03:43:53Z

@chartsNgraphs The source is coming from the node relationships extractor included by default. Try excluding it.

The llm reads node/document metadata by default. If this is not helpful, exclude it

Example for both suggestions

document.excluded_llm_metadata_keys = [...]
# can also exclude embedding keys
document.excluded_embed_metadata_keys = [...]

from llama_index.core.indices.property_graph import SimpleLLMPathExtractor

kg_extractor = SimpleLLMPathExtractor(
    llm=llm,
    max_paths_per_chunk=10,
    num_workers=4,
    show_progress=False,
)

index = PropertyGraphIndex.from_documents(documents, kg_extractors=[kg_extractor])

chartsNgraphs added the question Further information is requested label Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: on the PropertyGraphIndex, is there an easy way to exclude metadata when building an index from documents? #17535

[Question]: on the PropertyGraphIndex, is there an easy way to exclude metadata when building an index from documents? #17535

chartsNgraphs commented Jan 16, 2025

dosubot bot commented Jan 16, 2025

chartsNgraphs commented Jan 16, 2025

dosubot bot commented Jan 16, 2025

logan-markewich commented Jan 17, 2025

[Question]: on the PropertyGraphIndex, is there an easy way to exclude metadata when building an index from documents? #17535

[Question]: on the PropertyGraphIndex, is there an easy way to exclude metadata when building an index from documents? #17535

Comments

chartsNgraphs commented Jan 16, 2025

Question Validation

Question

dosubot bot commented Jan 16, 2025

chartsNgraphs commented Jan 16, 2025

dosubot bot commented Jan 16, 2025

logan-markewich commented Jan 17, 2025