Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract content urls #4604

Merged

Conversation

akolson
Copy link
Member

@akolson akolson commented Jul 11, 2024

Summary

Description of the change(s) you made

This pr implements the logic responsible for the extraction of the content urls for content nodes. It also perform code clean-up and optimization tasks

Manual verification steps performed

NA

Does this introduce any tech-debt items?

No

References

Closes #4455

Comments


Contributor's Checklist

PR process:

  • If this is an important user-facing change, PR or related issue the CHANGELOG label been added to this PR. Note: items with this label will be added to the CHANGELOG at a later time
  • If this includes an internal dependency change, a link to the diff is provided
  • The docs label has been added if this introduces a change that needs to be updated in the user docs?
  • If any Python requirements have changed, the updated requirements.txt files also included in this PR
  • Opportunities for using Google Analytics here are noted
  • Migrations are safe for a large db

Studio-specifc:

  • All user-facing strings are translated properly
  • The notranslate class been added to elements that shouldn't be translated by Google Chrome's automatic translation feature (e.g. icons, user-generated text)
  • All UI components are LTR and RTL compliant
  • Views are organized into pages, components, and layouts directories as described in the docs
  • Users' storage used is recalculated properly on any changes to main tree files
  • If there new ways this uses user data that needs to be factored into our Privacy Policy, it has been noted.

Testing:

  • Code is clean and well-commented
  • Contributor has fully tested the PR manually
  • If there are any front-end changes, before/after screenshots are included
  • Critical user journeys are covered by Gherkin stories
  • Any new interactions have been added to the QA Sheet
  • Critical and brittle code paths are covered by unit tests

Reviewer's Checklist

This section is for reviewers to fill out.

  • Automated test coverage is satisfactory
  • PR is fully functional
  • PR has been tested for accessibility regressions
  • External dependency files were updated if necessary (yarn and pip)
  • Documentation is updated
  • Contributor is in AUTHORS.md

@akolson akolson marked this pull request as ready for review July 15, 2024 18:56
@akolson akolson requested a review from bjester July 15, 2024 18:56
nodes = self._extract_data(response)
cache = [
RecommendationsCache(
request=request,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicating the request for each node returned isn't a scalable way to handle the cache. We should distill the request to something smaller and still represents the uniqueness of each request.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now generate a unique hash for each request from the params and json body and store that instead

contentcuration/contentcuration/utils/recommendations.py Outdated Show resolved Hide resolved
contentcuration/contentcuration/utils/recommendations.py Outdated Show resolved Hide resolved
contentcuration/contentcuration/utils/recommendations.py Outdated Show resolved Hide resolved
contentcuration/contentcuration/utils/recommendations.py Outdated Show resolved Hide resolved
contentcuration/contentcuration/utils/recommendations.py Outdated Show resolved Hide resolved
@akolson akolson requested a review from bjester July 24, 2024 20:45
Copy link
Member

@bjester bjester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking improved! I left some comments about the override_threshold and its implications on all-the-things

contentcuration/contentcuration/utils/recommendations.py Outdated Show resolved Hide resolved
contentcuration/automation/models.py Outdated Show resolved Hide resolved
@akolson akolson requested a review from bjester July 30, 2024 13:59
Copy link
Member

@bjester bjester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Just some small things regarding the override_threshold handling and the cache

request_hash = self._generate_request_hash(request)
data = list(
RecommendationsCache.objects.filter(request_hash=request_hash)
.order_by('rank')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.order_by('rank')
.order_by('override_threshold', 'rank')

return self.backend.make_request(embed_topics_request)
request_hash = self._generate_request_hash(request)
data = list(
RecommendationsCache.objects.filter(request_hash=request_hash)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RecommendationsCache.objects.filter(request_hash=request_hash)
RecommendationsCache.objects.filter(request_hash=request_hash, override_threshold=request.params.get('override_threshold', False))

nodes = self._extract_data(response)
if len(nodes) > 0:
node_ids = [node['contentnode_id'] for node in nodes]
recommendations = list(ContentNode.objects.filter(id__in=node_ids))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may not want to preemptively cast this to a list to get all the objects. Returning a queryset is very flexible.

On the frontend, we'll need access to the data, either directly from the API or via the public API, as long as we have the data points required here: https://github.com/learningequality/studio/blob/unstable/contentcuration/contentcuration/frontend/channelEdit/vuex/contentNode/actions.js#L52C55-L52C82

@akolson akolson requested a review from bjester August 1, 2024 13:55
Copy link
Member

@bjester bjester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some defensiveness when writing to the cache would be worthwhile.

I'm also concerned about the code that builds the recommendations queryset.

null=True,
blank=True,
related_name='recommendations',
on_delete=models.SET_NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning for setting this to null on deletion of the node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of your comment here i guess it defeats to purpose setting it null and keeping the record. I think a better approach would be to cascade the deletion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think cascade deletion should be fine.

Comment on lines 213 to 226
# Get the channel_id from PublicContentNode based on matching node_id from ContentNode
channel_id_subquery = PublicContentNode.objects.filter(
self._normalize_uuid(F('id')) == self._normalize_uuid(OuterRef('node_id'))
).values('channel_id')[:1]

# Get main_tree_id from Channel based on channel_id obtained from channel_id_subquery
main_tree_id_subquery = Channel.objects.filter(
self._normalize_uuid(F('id')) == self._normalize_uuid(Subquery(channel_id_subquery))
).values('main_tree_id')[:1]

# Annotate main_tree_id onto ContentNode
recommendations = ContentNode.objects.filter(id__in=node_ids).annotate(
main_tree_id=Subquery(main_tree_id_subquery)
).values('id', 'node_id', 'main_tree_id', 'parent_id')
Copy link
Member

@bjester bjester Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look correct to me. Are there tests covering this code?

channel_id_subquery has an OuterRef to node_id but is used in a Channel queryset subquery, which means the OuterRef should apply to the Channel model, which doesn't have a node_id.

Reiterating again that the PublicContentNode.id is the same as ContentNode.node_id, which seems like the intent in the first query.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some changes to the query and added tests

override_threshold=override_threshold,
) for node in nodes if node['contentnode_id'] not in existing_cache
]
RecommendationsCache.objects.bulk_create(new_cache, ignore_conflicts=True)
Copy link
Member

@bjester bjester Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically, having ignore_conflicts should mean you can omit checking existing_cache, because:

On databases that support it (all but Oracle), setting the ignore_conflicts parameter to True tells the database to ignore failure to insert any rows that fail constraints such as duplicate unique values. (ref)

So with the unique_together expression on request_hash and contentnode_id, it should ignore failing to create those on the overridden request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have a unit test to verify the previously described behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote tests for the cache that confirmed the above when ignore_conflicts is True or False. The only difference is that when ignore_conflicts=False, an exception is raised, brining the creation to a halt. So I think we can safely remove the existing_cache check as long as ignore_conflicts=True.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! sounds perfect

@akolson akolson requested a review from bjester August 7, 2024 14:47
Comment on lines +215 to +241
channel_cte = With(
Channel.objects.annotate(
channel_id=self._cast_to_uuid(F('id'))
).filter(
Exists(
PublicContentNode.objects.filter(
id__in=cast_node_ids,
channel_id=OuterRef('channel_id')
)
)
).values(
'main_tree_id',
tree_id=F('main_tree__tree_id'),
).distinct()
)

recommendations = channel_cte.join(
ContentNode.objects.filter(node_id__in=node_ids),
tree_id=channel_cte.col.tree_id
).with_cte(channel_cte).annotate(
main_tree_id=channel_cte.col.main_tree_id
).values(
'id',
'node_id',
'main_tree_id',
'parent_id',
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bjester, we should be good now I think?

Copy link
Member

@bjester bjester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @akolson!

@akolson akolson merged commit 3d2f924 into learningequality:search-recommendations Aug 12, 2024
13 checks passed
@akolson akolson deleted the embed-content-logic branch October 31, 2024 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants