Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle explain type of query #112

Closed
kevinxin90 opened this issue Mar 19, 2021 · 14 comments
Closed

Handle explain type of query #112

kevinxin90 opened this issue Mar 19, 2021 · 14 comments
Assignees
Labels
bug Something isn't working

Comments

@kevinxin90
Copy link
Contributor

{
    "message": {
        "query_graph": {
            "nodes": {
                "a": {
                    "category": "biolink:Disease",
                    "id": "MESH:D015464"
                },
                "b": {
                    "category": "biolink:ChemicalSubstance",
                    "id": "CHEBI:45783"
                },
                "c": {
                    "category": "biolink:Gene"
                }
            },
            "edges": {
                "ac": {
                    "subject": "a",
                    "object": "c"
                },
                "bc": {
                    "subject": "c",
                    "object": "b"
                }
            }
        }
    },
    "knowledge_graph": {
        "nodes": [],
        "edges": []
    },
    "results": []
}
@kevinxin90 kevinxin90 added the bug Something isn't working label Mar 19, 2021
@kevinxin90 kevinxin90 added this to the 2021-3-26 milestone Mar 19, 2021
@kevinxin90 kevinxin90 self-assigned this Mar 19, 2021
@andrewsu
Copy link
Member

Three great examples of "explain" queries from Sui

image

@colleenXu
Copy link
Collaborator

colleenXu commented Jun 11, 2021

This is not working as-expected.


For example, we would expect KCNMA1 -(e0)-> biolink:NamedThing <-(e1)- TAAR1 to do the following:

  1. one-hop KCNMA -(e0)-> biolink:NamedThing
  2. one-hop TAAR1-(e1)-> biolink:NamedThing
  3. either (@andrewsu, we have to pick the desired behavior):
    A. filter so we only keep the answers that came from BOTH one-hops (remove the rest from nodes/edges/results)
    B. OR do nothing. Return the output of both one-hops as the "answers" (Edited AS 2021-06-11: strike out this option)

Instead, something seems to be going on that makes the query (TRAPI, listed in the next comment) take a long time (like >30 minutes). Andrew tried separately running the one-hops (1 and 2) above (also listed in the next comment), and both were quick (<6 seconds each).


I think these logs from my console are relevant. I used a JSON viewer to help me read the path parts. This is my interpretation of the logs:

  • At the beginning of the query process, it looks like BioThings_Explorer_TRAPI OR bte_trapi_query_graph_handler pick two sets of queries, for 4 one-hops total.
  • The first set of queries (key:0) looks like the one-hops 1 and 2 above.
  • The second set of queries (key 1)....I don't understand how it works with its "prev_edge".
    • But I see sub-queries being done from Diseases to other things, which isn't one-hops 1 and 2...
    • Which makes me think that BTE is interpreting those query paths as 2 giant multi-hops (KCNMA1 -> NamedThing -> NamedThing, and TAAR1 -> NamedThing -> NamedThing). Then the sub-queries I'm seeing are the NamedThing -> NamedThing steps.

LOGS:

 biothings-explorer-trapi:query_graph ALL PATHS {"0":[{"qEdge":{"id":"e0","subject":{"id":"n0","category":["biolink:Gene"],"curie":["HGNC:6284"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":false,"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}},{"qEdge":{"id":"e1","subject":{"id":"n2","category":["biolink:Gene"],"curie":["HGNC:17734"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":false,"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}}],"1":[{"qEdge":{"id":"e0","subject":{"id":"n0","category":["biolink:Gene"],"curie":["HGNC:6284"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":true,"prev_edge":{"qEdge":{"id":"e1","subject":{"id":"n2","category":["biolink:Gene"],"curie":["HGNC:17734"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":false,"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}},"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}},{"qEdge":{"id":"e1","subject":{"id":"n2","category":["biolink:Gene"],"curie":["HGNC:17734"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":true,"prev_edge":{"qEdge":{"id":"e0","subject":{"id":"n0","category":["biolink:Gene"],"curie":["HGNC:6284"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":false,"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}},"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}}]} +0ms

  biothings-explorer-trapi:main query paths constructed: {"0":[{"qEdge":{"id":"e0","subject":{"id":"n0","category":["biolink:Gene"],"curie":["HGNC:6284"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":false,"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}},{"qEdge":{"id":"e1","subject":{"id":"n2","category":["biolink:Gene"],"curie":["HGNC:17734"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":false,"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}}],"1":[{"qEdge":{"id":"e0","subject":{"id":"n0","category":["biolink:Gene"],"curie":["HGNC:6284"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":true,"prev_edge":{"qEdge":{"id":"e1","subject":{"id":"n2","category":["biolink:Gene"],"curie":["HGNC:17734"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":false,"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}},"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}},{"qEdge":{"id":"e1","subject":{"id":"n2","category":["biolink:Gene"],"curie":["HGNC:17734"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":true,"prev_edge":{"qEdge":{"id":"e0","subject":{"id":"n0","category":["biolink:Gene"],"curie":["HGNC:6284"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":false,"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}},"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}}]} +1ms

  biothings-explorer-trapi:main Query depth is 2 +1ms

@colleenXu
Copy link
Collaborator

colleenXu commented Jun 11, 2021

TRAPI query for KCNMA1 -(e0)-> biolink:NamedThing <-(e1)- TAAR1

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": ["HGNC:6284"],
                    "categories": ["biolink:Gene"]
                },
                "n1": {
                    "categories":["biolink:NamedThing"]
                },
                "n2": {
                    "ids":["HGNC:17734"],
                    "categories":["biolink:Gene"]
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e1": {
                    "subject": "n2",
                    "object": "n1"
                }
            }
        }
    }
}


Fast One Hop 1: KCNMA1-(e0)-> biolink:NamedThing

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": ["HGNC:6284"],
                    "categories": ["biolink:Gene"]
                },
                "n1": {
		    "categories":["biolink:NamedThing"]
                }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

Fast One Hop 2: TAAR1-(e1)-> biolink:NamedThing

{
    "message": {
        "query_graph": {
            "nodes": {
                "n1": {
		    "categories":["biolink:NamedThing"]
                },
                "n2": {
                    "ids":["HGNC:17734"],
                    "categories":["biolink:Gene"]
               }
            },
            "edges": {
                "e1": {
                    "subject": "n2",
                    "object": "n1"
                }
            }
        }
    }
}

@colleenXu
Copy link
Collaborator

colleenXu commented Jun 22, 2021

This is a special kind of Explain-query we also want to support (see TRAPI query below): ChemicalSubstance celecoxib (PUBCHEM.COMPOUND:2662) -> PTGS1 (HGNC:9604). It's from a Translator standup meeting.

The minimal expected behavior is:

  1. BTE queries ChemicalSubstance PUBCHEM.COMPOUND:2662 -> Gene.
  2. Then BTE then "filters": only the answer node PTGS1 (HGNC:9604) and edges/results with that answer node are kept. The other message knowledge_graph.nodes/knowledge_graph.edges/results are removed. Have a logs object that mentions that this happened.
  3. If there's no edges/results after the filtering is done, that's fine. Keep an empty object for edges / array for results. There will still be message.query_graph, message.knowledge_graph.nodes, logs.

Currently, BTE is only doing 1 and ID-resolving the gene ID in the query.


Expected edges in the answer: For the example query, I would expect only the following edges to exist in the Response:

  • CHEBI:41423-biolink:physically_interacts_with-NCBIGene:5742
  • CHEBI:41423-biolink:related_to-NCBIGene:5742
  • CHEBI:41423-biolink:decreases_activity_of-NCBIGene:5742 (a control answer!)
  • CHEBI:41423-biolink:directly_interacts_with-NCBIGene:5742
  • CHEBI:41423-biolink:increases_activity_of-NCBIGene:5742
  • CHEBI:41423-biolink:affects_activity_of-NCBIGene:5742
  • CHEBI:41423-biolink:increases_degradation_of-NCBIGene:5742 (a control answer!)
  • CHEBI:41423-biolink:condition_associated_with_gene-NCBIGene:5742

The TRAPI query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["PUBCHEM.COMPOUND:2662"],
                    "categories":["biolink:ChemicalSubstance"]
                },
                "n1": {
                    "categories":["biolink:Gene"],
                       "ids":["HGNC:9604"]
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

@colleenXu
Copy link
Collaborator

This functionality is high priority since it's come up in standup queries and the demo (Workflow D, maybe Workflow C).

@andrewsu
Copy link
Member

andrewsu commented Jul 8, 2021

Note that @ericz1803 found this repo https://github.com/kevinxin90/explain.js from Kevin that handles the special case of explain queries with one intermediate node (used at https://biothings.io/explorer/explain). It is based on @biothings-explorer/call-apis and @biothings-explorer/smartapi-kg, so may be useful to consult when implementing explain queries in the main application. In fact, it could be that the short-term solution to this ticket would be to integrate this code into the main app, leaving the longer-term generalized query handler to handle longer paths and more complex query topologies.

@andrewsu
Copy link
Member

andrewsu commented Jul 8, 2021

one-hop explain query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["PUBCHEM.COMPOUND:2662"],
                    "categories":["biolink:ChemicalSubstance"]
                },
                "n1": {
                    "categories":["biolink:Gene"],
                       "ids":["HGNC:9604"]
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

two-hop explain query (version 1):

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["PUBCHEM.COMPOUND:2662"],
                    "categories":["biolink:ChemicalSubstance"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               },
                "n2": {
                    "categories":["biolink:Gene"],
                       "ids":["HGNC:9604"]
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e1": {
                    "subject": "n1",
                    "object": "n2"
                }
            }
        }
    }
}

two-hop explain query (version 2):

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["PUBCHEM.COMPOUND:2662"],
                    "categories":["biolink:ChemicalSubstance"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               },
                "n2": {
                    "categories":["biolink:Gene"],
                       "ids":["HGNC:9604"]
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e1": {
                    "subject": "n2",
                    "object": "n1"
                }
            }
        }
    }
}

@marcodarko
Copy link
Contributor

#112 (comment)
csgene.txt
These are the results I'm getting using the new query handler algorithm, just wanna make sure it's looking OK. Going through some of the queries here as I read it.

@marcodarko
Copy link
Contributor

marcodarko commented Jul 23, 2021

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["PUBCHEM.COMPOUND:2662"],
                    "categories":["biolink:ChemicalSubstance"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               },
                "n2": {
                    "categories":["biolink:Gene"],
                       "ids":["HGNC:9604"]
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e1": {
                    "subject": "n2",
                    "object": "n1"
                }
            }
        }
    }
}

This is the new result for this two hop query above, new logs will explain the process hopefully.
twohop.txt

@colleenXu
Copy link
Collaborator

colleenXu commented Aug 6, 2021

For Workflow D:

  • We probably won't return direct edges between all the chem IDs and the disease ID, in Workflow D.3.
  • We won't return answers for Workflow D.4 / D.5 because it uses Protein/connects SmallMolecule-MolecularActivity (we don't have those edges).
  • We won't return answers for Workflow D.6 because it uses Protein and takes too long (if you replace all Proteins with Genes, I can currently run it but it takes ~ 8 min)

Note:

  • I would not use Kevin's original Explain example. I cannot get answers from the first hop of the query...
{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": ["biolink:Disease"],
                    "ids": ["MESH:D015464"]
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

@colleenXu
Copy link
Collaborator

colleenXu commented Aug 6, 2021

some queries with 2 intermediates:

Note: This test query should have this path as a result:
ChemicalSubstance PUBCHEM.COMPOUND:2662 <-> Disease MONDO:0002974 <-> Pathway REACT:R-HSA-109704 <-> HGNC:17947.

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["PUBCHEM.COMPOUND:2662"],
                    "categories":["biolink:SmallMolecule"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               },
               "n2": {
                    "categories":["biolink:Pathway"]
               },
                "n3": {
                    "categories":["biolink:Gene"],
                    "ids":["HGNC:17947"]
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e1": {
                    "subject": "n1",
                    "object": "n2"
                },
                "e2": {
                    "subject": "n2",
                    "object": "n3"
                }
            }
        }
    }
}

This path should exist: Pathway REACT:R-HSA-1368082 <-> Gene NCBIGene:1374 <-> ChemicalSubstance CHEBI:35553 - Disease MONDO:0009287

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": ["biolink:Pathway"],
                    "ids": ["REACT:R-HSA-1368082"]
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                },
                "n2": {
                    "categories": ["biolink:ChemicalSubstance"]
                },
                "n3": {
                    "categories": ["biolink:Disease"],
                    "ids": ["MONDO:0009287"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e02": {
                    "subject": "n1",
                    "object": "n2"
                },
                "e03": {
                    "subject": "n3",
                    "object": "n2"
                }
            }
        }
    }
}

@colleenXu
Copy link
Collaborator

What the results object should look like:

{
  "node_bindings": { 
    "n0": ["id": "CHEBI:41423"],
    "n1": ["id": "MONDO:0004247"],
    "n2": ["id": "NCBIGene:5742"]
   },
  "edge_bindings": {
    "e0": ["id": "CHEBI:41423-biolink:related_to-MONDO:0004247"],
    "e1": ["id": "NCBIGene:5742-biolink:related_to-MONDO:0004247"]
   }
}

For this query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["PUBCHEM.COMPOUND:2662"],
                    "categories":["biolink:ChemicalSubstance"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               },
                "n2": {
                    "categories":["biolink:Gene"],
                       "ids":["HGNC:9604"]
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e1": {
                    "subject": "n2",
                    "object": "n1"
                }
            }
        }
    }
}

@colleenXu
Copy link
Collaborator

colleenXu commented Sep 21, 2021

Note: Kevin's opening query, reformatted now has results that look as expected. The query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": ["MESH:D015464"],
                    "categories": ["biolink:Disease"]
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                },
                "n2": {
                    "ids": ["CHEBI:45783"],
                    "categories": ["biolink:SmallMolecule"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e02": {
                    "subject": "n1",
                    "object": "n2"
                }
            }
        }
    }
}

@colleenXu
Copy link
Collaborator

The new query-handler handles these cases, this was checked during my testing process for the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants