Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ai): add document qa and document summary #2054

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Conversation

tommythorsen
Copy link
Contributor

@tommythorsen tommythorsen commented Dec 3, 2024

Description

This commit adds sdk support for the newly GA endpoints Document Question Answering and Document Summary, which are part of the AI API.

Checklist:

  • Tests added/updated.
  • Documentation updated. Documentation is generated from docstrings - these must be updated according to your change.
    If a new method has been added it should be referenced in cognite.rst in order to generate docs based on its docstring.
  • Changelog updated in CHANGELOG.md.
  • Version bumped. If triggering a new release is desired, bump the version number in _version.py and pyproject.toml per semantic versioning.

@tommythorsen tommythorsen requested review from a team as code owners December 3, 2024 19:29
Copy link

codecov bot commented Dec 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.54%. Comparing base (fa7f991) to head (bf03552).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2054      +/-   ##
==========================================
+ Coverage   90.49%   90.54%   +0.04%     
==========================================
  Files         141      145       +4     
  Lines       22488    22603     +115     
==========================================
+ Hits        20351    20466     +115     
  Misses       2137     2137              
Files with missing lines Coverage Δ
cognite/client/_api/ai/__init__.py 100.00% <100.00%> (ø)
cognite/client/_api/ai/tools/__init__.py 100.00% <100.00%> (ø)
cognite/client/_api/ai/tools/documents.py 100.00% <100.00%> (ø)
cognite/client/_api_client.py 90.18% <ø> (ø)
cognite/client/_cognite_client.py 93.81% <100.00%> (+0.13%) ⬆️
cognite/client/_version.py 100.00% <100.00%> (ø)
cognite/client/data_classes/__init__.py 100.00% <100.00%> (ø)
cognite/client/data_classes/ai.py 100.00% <100.00%> (ø)
cognite/client/testing.py 100.00% <100.00%> (ø)

Copy link

@kandeh kandeh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a small comment; Otherwise, looks good to me.

tommythorsen and others added 11 commits December 4, 2024 18:02
This commit adds sdk support for the newly GA endpoints Document
Question Answering and Document Summary, which are part of the AI API.
The utility methods can concatenate the pieces of the answer text to the
full text of the answer, and combine the reference lists to a single
list, eliminating duplicates.
>>> client.ai.tools.documents.summarize(ids=[123])
"""
body = {
"items": (

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can do something like the following to get a list of ids:

identifiers = IdentifierSequence.load(ids=id, external_ids=external_id, instance_ids=instance_id)
identifiers_dict = identifiers.as_dicts()

>>> client.ai.tools.documents.ask(question="What model pump was used?", ids=[123])
"""
body = {
"question": question,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same identifier trick here

Copy link
Contributor

@haakonvt haakonvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great additions! I look forward to testing out 🥳

Comment on lines +16 to +18
ids: Sequence[int] | None = None,
external_ids: SequenceNotStr[str] | None = None,
instance_ids: Sequence[NodeId] | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use "Identifier trick" as mentioned above, supporting single inputs will automatically be supported. I suggest using parameter names in singular case:

Suggested change
ids: Sequence[int] | None = None,
external_ids: SequenceNotStr[str] | None = None,
instance_ids: Sequence[NodeId] | None = None,
id: int | Sequence[int] | None = None,
external_id: str | Sequence[str] | None = None,
instance_id: NodeId | Sequence[NodeId] | None = None,

instance_ids: Sequence[NodeId] | None = None,
ignore_unknown_ids: bool = False,
) -> list[Summary]:
"""Summarize a document using a Large Language Model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing API doc link

ids (Sequence[int] | None): Internal ids of documents to summarize.
external_ids (SequenceNotStr[str] | None): External ids of documents to summarize.
instance_ids (Sequence[NodeId] | None): Instance ids of documents to summarize.
ignore_unknown_ids (bool): Whether to skip documents that can't be summarized, without throwing an error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? ignore_unknown_ids is typically used when the given ids cant be found

ignore_unknown_ids (bool): Whether to skip documents that can't be summarized, without throwing an error

Returns:
list[Summary]: No description.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing description of return type

external_ids: SequenceNotStr[str] | None = None,
instance_ids: Sequence[NodeId] | None = None,
ignore_unknown_ids: bool = False,
) -> list[Summary]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make a SummaryList

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you shed some light on why that is useful? I come from Go and have learned the hard way to prefer primitive types, but I see it's very common in the sdk

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I agree with you, but users of the SDK expect certain features from our "list of resource" classes such as:

  • get method for hash table backed item lookup
  • pretty-printing in notebooks
  • easy conversion to pandas data frame
  • load and dump methods to/from dict and yaml
    ...all of which come for free by inheriting from the base Cognite list class



@dataclass
class AnswerReference:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments above.


content: list[AnswerContent]

def get_full_answer_text(self) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Property perhaps?

Suggested change
def get_full_answer_text(self) -> str:
@property
def full_answer(self) -> str:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should __str__ return the full answer perhaps?

Get the full answer text. This is the concatenation of the texts from
all the content objects.
"""
return "".join([content.text for content in self.content])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to create a temporary list:

Suggested change
return "".join([content.text for content in self.content])
return "".join(content.text for content in self.content)

"""
return "".join([content.text for content in self.content])

def get_all_references(self) -> list[AnswerReference]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using a property (or cached_property)

Suggested change
def get_all_references(self) -> list[AnswerReference]:
@property
def all_references(self) -> list[AnswerReference]:

@@ -95,6 +98,9 @@ def __init__(self, *args: Any, **kwargs: Any) -> None:
# - Add spacing above and below
# - Use `spec=MyAPI` only for "top level"
# - Use `spec_set=MyNestedAPI` for all nested APIs
self.ai = MagicMock(spec=AIAPI)
self.ai.tools = MagicMock(spec=AIToolsAPI)
self.ai.tools.documents = MagicMock(spec_set=AIDocumentsAPI)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.ai.tools.documents = MagicMock(spec_set=AIDocumentsAPI)
self.ai.tools.documents = MagicMock(spec_set=AIDocumentsAPI)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants