-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract content urls #4604
Extract content urls #4604
Conversation
contentcuration/automation/migrations/0002_auto_20240711_1938.py
Outdated
Show resolved
Hide resolved
nodes = self._extract_data(response) | ||
cache = [ | ||
RecommendationsCache( | ||
request=request, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicating the request for each node returned isn't a scalable way to handle the cache. We should distill the request to something smaller and still represents the uniqueness of each request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now generate a unique hash for each request from the params
and json
body and store that instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking improved! I left some comments about the override_threshold
and its implications on all-the-things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Just some small things regarding the override_threshold
handling and the cache
request_hash = self._generate_request_hash(request) | ||
data = list( | ||
RecommendationsCache.objects.filter(request_hash=request_hash) | ||
.order_by('rank') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.order_by('rank') | |
.order_by('override_threshold', 'rank') |
return self.backend.make_request(embed_topics_request) | ||
request_hash = self._generate_request_hash(request) | ||
data = list( | ||
RecommendationsCache.objects.filter(request_hash=request_hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RecommendationsCache.objects.filter(request_hash=request_hash) | |
RecommendationsCache.objects.filter(request_hash=request_hash, override_threshold=request.params.get('override_threshold', False)) |
nodes = self._extract_data(response) | ||
if len(nodes) > 0: | ||
node_ids = [node['contentnode_id'] for node in nodes] | ||
recommendations = list(ContentNode.objects.filter(id__in=node_ids)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may not want to preemptively cast this to a list to get all the objects. Returning a queryset is very flexible.
On the frontend, we'll need access to the data, either directly from the API or via the public API, as long as we have the data points required here: https://github.com/learningequality/studio/blob/unstable/contentcuration/contentcuration/frontend/channelEdit/vuex/contentNode/actions.js#L52C55-L52C82
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some defensiveness when writing to the cache would be worthwhile.
I'm also concerned about the code that builds the recommendations queryset.
contentcuration/automation/models.py
Outdated
null=True, | ||
blank=True, | ||
related_name='recommendations', | ||
on_delete=models.SET_NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning for setting this to null on deletion of the node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In light of your comment here i guess it defeats to purpose setting it null and keeping the record. I think a better approach would be to cascade the deletion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think cascade deletion should be fine.
# Get the channel_id from PublicContentNode based on matching node_id from ContentNode | ||
channel_id_subquery = PublicContentNode.objects.filter( | ||
self._normalize_uuid(F('id')) == self._normalize_uuid(OuterRef('node_id')) | ||
).values('channel_id')[:1] | ||
|
||
# Get main_tree_id from Channel based on channel_id obtained from channel_id_subquery | ||
main_tree_id_subquery = Channel.objects.filter( | ||
self._normalize_uuid(F('id')) == self._normalize_uuid(Subquery(channel_id_subquery)) | ||
).values('main_tree_id')[:1] | ||
|
||
# Annotate main_tree_id onto ContentNode | ||
recommendations = ContentNode.objects.filter(id__in=node_ids).annotate( | ||
main_tree_id=Subquery(main_tree_id_subquery) | ||
).values('id', 'node_id', 'main_tree_id', 'parent_id') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not look correct to me. Are there tests covering this code?
channel_id_subquery
has an OuterRef
to node_id
but is used in a Channel
queryset subquery, which means the OuterRef
should apply to the Channel
model, which doesn't have a node_id
.
Reiterating again that the PublicContentNode.id
is the same as ContentNode.node_id
, which seems like the intent in the first query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some changes to the query and added tests
override_threshold=override_threshold, | ||
) for node in nodes if node['contentnode_id'] not in existing_cache | ||
] | ||
RecommendationsCache.objects.bulk_create(new_cache, ignore_conflicts=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically, having ignore_conflicts
should mean you can omit checking existing_cache
, because:
On databases that support it (all but Oracle), setting the ignore_conflicts parameter to True tells the database to ignore failure to insert any rows that fail constraints such as duplicate unique values. (ref)
So with the unique_together
expression on request_hash
and contentnode_id
, it should ignore failing to create those on the overridden request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to have a unit test to verify the previously described behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote tests for the cache that confirmed the above when ignore_conflicts
is True
or False
. The only difference is that when ignore_conflicts=False
, an exception is raised, brining the creation to a halt. So I think we can safely remove the existing_cache
check as long as ignore_conflicts=True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! sounds perfect
channel_cte = With( | ||
Channel.objects.annotate( | ||
channel_id=self._cast_to_uuid(F('id')) | ||
).filter( | ||
Exists( | ||
PublicContentNode.objects.filter( | ||
id__in=cast_node_ids, | ||
channel_id=OuterRef('channel_id') | ||
) | ||
) | ||
).values( | ||
'main_tree_id', | ||
tree_id=F('main_tree__tree_id'), | ||
).distinct() | ||
) | ||
|
||
recommendations = channel_cte.join( | ||
ContentNode.objects.filter(node_id__in=node_ids), | ||
tree_id=channel_cte.col.tree_id | ||
).with_cte(channel_cte).annotate( | ||
main_tree_id=channel_cte.col.main_tree_id | ||
).values( | ||
'id', | ||
'node_id', | ||
'main_tree_id', | ||
'parent_id', | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bjester, we should be good now I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @akolson!
3d2f924
into
learningequality:search-recommendations
Summary
Description of the change(s) you made
This pr implements the logic responsible for the extraction of the content urls for content nodes. It also perform code clean-up and optimization tasks
Manual verification steps performed
NA
Does this introduce any tech-debt items?
No
References
Closes #4455
Comments
Contributor's Checklist
PR process:
CHANGELOG
label been added to this PR. Note: items with this label will be added to the CHANGELOG at a later timedocs
label has been added if this introduces a change that needs to be updated in the user docs?requirements.txt
files also included in this PRStudio-specifc:
notranslate
class been added to elements that shouldn't be translated by Google Chrome's automatic translation feature (e.g. icons, user-generated text)pages
,components
, andlayouts
directories as described in the docsTesting:
Reviewer's Checklist
This section is for reviewers to fill out.
yarn
andpip
)