-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ai): add document qa and document summary #2054
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #2054 +/- ##
==========================================
+ Coverage 90.49% 90.54% +0.04%
==========================================
Files 141 145 +4
Lines 22488 22603 +115
==========================================
+ Hits 20351 20466 +115
Misses 2137 2137
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a small comment; Otherwise, looks good to me.
This commit adds sdk support for the newly GA endpoints Document Question Answering and Document Summary, which are part of the AI API.
The utility methods can concatenate the pieces of the answer text to the full text of the answer, and combine the reference lists to a single list, eliminating duplicates.
10bd266
to
bf03552
Compare
>>> client.ai.tools.documents.summarize(ids=[123]) | ||
""" | ||
body = { | ||
"items": ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can do something like the following to get a list of ids:
identifiers = IdentifierSequence.load(ids=id, external_ids=external_id, instance_ids=instance_id)
identifiers_dict = identifiers.as_dicts()
>>> client.ai.tools.documents.ask(question="What model pump was used?", ids=[123]) | ||
""" | ||
body = { | ||
"question": question, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same identifier trick here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great additions! I look forward to testing out 🥳
ids: Sequence[int] | None = None, | ||
external_ids: SequenceNotStr[str] | None = None, | ||
instance_ids: Sequence[NodeId] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use "Identifier trick" as mentioned above, supporting single inputs will automatically be supported. I suggest using parameter names in singular case:
ids: Sequence[int] | None = None, | |
external_ids: SequenceNotStr[str] | None = None, | |
instance_ids: Sequence[NodeId] | None = None, | |
id: int | Sequence[int] | None = None, | |
external_id: str | Sequence[str] | None = None, | |
instance_id: NodeId | Sequence[NodeId] | None = None, |
instance_ids: Sequence[NodeId] | None = None, | ||
ignore_unknown_ids: bool = False, | ||
) -> list[Summary]: | ||
"""Summarize a document using a Large Language Model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is missing API doc link
ids (Sequence[int] | None): Internal ids of documents to summarize. | ||
external_ids (SequenceNotStr[str] | None): External ids of documents to summarize. | ||
instance_ids (Sequence[NodeId] | None): Instance ids of documents to summarize. | ||
ignore_unknown_ids (bool): Whether to skip documents that can't be summarized, without throwing an error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this right? ignore_unknown_ids
is typically used when the given ids cant be found
ignore_unknown_ids (bool): Whether to skip documents that can't be summarized, without throwing an error | ||
|
||
Returns: | ||
list[Summary]: No description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing description of return type
external_ids: SequenceNotStr[str] | None = None, | ||
instance_ids: Sequence[NodeId] | None = None, | ||
ignore_unknown_ids: bool = False, | ||
) -> list[Summary]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make a SummaryList
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you shed some light on why that is useful? I come from Go and have learned the hard way to prefer primitive types, but I see it's very common in the sdk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I agree with you, but users of the SDK expect certain features from our "list of resource" classes such as:
get
method for hash table backed item lookup- pretty-printing in notebooks
- easy conversion to pandas data frame
- load and dump methods to/from dict and yaml
...all of which come for free by inheriting from the base Cognite list class
|
||
|
||
@dataclass | ||
class AnswerReference: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments above.
|
||
content: list[AnswerContent] | ||
|
||
def get_full_answer_text(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Property perhaps?
def get_full_answer_text(self) -> str: | |
@property | |
def full_answer(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, should __str__
return the full answer perhaps?
Get the full answer text. This is the concatenation of the texts from | ||
all the content objects. | ||
""" | ||
return "".join([content.text for content in self.content]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need to create a temporary list:
return "".join([content.text for content in self.content]) | |
return "".join(content.text for content in self.content) |
""" | ||
return "".join([content.text for content in self.content]) | ||
|
||
def get_all_references(self) -> list[AnswerReference]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using a property (or cached_property)
def get_all_references(self) -> list[AnswerReference]: | |
@property | |
def all_references(self) -> list[AnswerReference]: |
@@ -95,6 +98,9 @@ def __init__(self, *args: Any, **kwargs: Any) -> None: | |||
# - Add spacing above and below | |||
# - Use `spec=MyAPI` only for "top level" | |||
# - Use `spec_set=MyNestedAPI` for all nested APIs | |||
self.ai = MagicMock(spec=AIAPI) | |||
self.ai.tools = MagicMock(spec=AIToolsAPI) | |||
self.ai.tools.documents = MagicMock(spec_set=AIDocumentsAPI) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.ai.tools.documents = MagicMock(spec_set=AIDocumentsAPI) | |
self.ai.tools.documents = MagicMock(spec_set=AIDocumentsAPI) | |
Description
This commit adds sdk support for the newly GA endpoints Document Question Answering and Document Summary, which are part of the AI API.
Checklist:
If a new method has been added it should be referenced in cognite.rst in order to generate docs based on its docstring.