-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get feed by user-defined slug #358
Comments
tl;dr: I think you can use feed tags for this, details below.
Love this phrase :)
Indeed, it would not be a good idea to use user_title, since it is in some ways "special"; among others it may be used (together with the feed's original title) to implement title sort (e.g. dropping "The" etc.); this is one of the main reasons I did not deprecate it in favor of tags.
This is probably another reason for not using user_title like this, since you may still want "human readable" titles that are different from the original ones, but are also not slugs. So, if reader were to have an API for exactly this use case, it'd be something separate, like Feed.slug, set_feed_slug(url, slug), get_feed(slug=slug). That said, I think you can achieve this by using feed tags, since they are indexed. In the example below (click to expand), for my database with 177 feeds, get_feed_by_slug(slug) is only twice as slow as get_feed(url).from __future__ import annotations
def _make_slug_tag(reader, slug):
return reader.make_plugin_reserved_name('slugs', slug)
def _get_slug_tags(reader, resource):
prefix = _make_slug_tag(reader, '')
# filter tags by prefix would make this faster
# https://github.com/lemon24/reader/issues/309
rv = list()
for tag in reader.get_tag_keys(resource):
if not tag.startswith(prefix):
continue
rv.append(tag)
return prefix, rv
def get_feed_slug(reader, feed) -> str | None:
prefix, tags = _get_slug_tags(reader, feed)
if not tags:
return None
assert len(tags) == 1, (feed, tags)
return tags[0].removeprefix(prefix)
def get_feed_by_slug(reader, slug) -> Feed | None:
tag = _make_slug_tag(reader, slug)
feeds = list(reader.get_feeds(tags=[tag]))
if not feeds:
return None
assert len(feeds) == 1, (slug, [f.url for f in feeds])
return feeds[0]
def set_feed_slug(reader, feed, slug: str | None):
feed = reader.get_feed(feed)
tag = _make_slug_tag(reader, slug)
if slug:
already_has_slug = False
for other_feed in reader.get_feeds(tags=[tag]):
if feed.url == other_feed.url:
already_has_slug = True
continue
raise ValueError(f"slug {slug!r} already in use by {other_feed.url!r}")
# NOTE: technically a race condition / TOCTOU bug
if already_has_slug:
return
_, other_tags = _get_slug_tags(reader, feed)
for other_tag in other_tags:
reader.delete_tag(feed, other_tag, missing_ok=True)
if slug:
reader.set_tag(feed, tag)
# "handle" race condition; ensures(?) at most one feed has the slug,
# with the risk that none of them has it in case of race condition.
for other_feed in reader.get_feeds(tags=[tag]):
if feed.url == other_feed.url:
continue
reader.delete_tag(other_feed, tag, missing_ok=True)
# an alternative way of dealing with the race condition is
# to use global tags, but then we'd have to "garbage-collect"
# slugs of feeds that have been deleted (in an after_feeds_update_hook),
# although that may lead to a different sort of race condition.
if __name__ == '__main__':
import timeit
from reader import make_reader
reader = make_reader('db.sqlite')
url = 'https://death.andgravity.com/_feed/index.xml'
set_feed_slug(reader, url, 'one')
print(
get_feed_slug(reader, url),
getattr(get_feed_by_slug(reader, 'one'), 'url', None),
)
try:
set_feed_slug(reader, 'https://xkcd.com/atom.xml', 'one')
raise Exception("shouldn't get here")
except Exception as e:
print('error:', e)
set_feed_slug(reader, url, 'two')
print(
get_feed_slug(reader, url),
getattr(get_feed_by_slug(reader, 'two'), 'url', None),
)
print()
def repeat(stmt):
times = timeit.repeat(stmt, repeat=1000, number=1, globals=globals())
print(f"{stmt}: {min(times):.6f}s min, {sum(times)/len(times):.6f}s avg")
print(reader.get_feed_counts().total, 'feeds')
repeat("reader.get_feed(url)")
repeat("get_feed_by_slug(reader, 'two')") (For reference, this is somewhat related to #159 (comment) that discusses having a surrogate key, except I didn't think it would be user-defined there.) |
Thanks for the suggestion about tags. I didn't think of using them somehow because of the fact that there can be multiple ones, I'll have to look at the schema again to convince myself I'm not doing something abominable. Good that we're talking. -- |
The only issue I see with using tags is the race condition mentioned in the code comment, but I think the approach of (accidentally, rarely) deleting both should be acceptable, especially for a single-user system. BTW (and no pressure! :), if the approach above ends up being OK for you, I'd happy to merge it as a reader plugin, either experimental or built-in (the difference is mainly in the amount of testing). The get/set_feed_slug() and get_feed_by_slug() "methods" can be monkey-patched onto the reader instance as enclosure_dedupe does here. (In theory I'd be fine with this being built into reader itself too, which would also get rid of the race condition, but I think it'd probably be overkill.) |
Hmmm, so no, looking at the database I can't use a tag for what I want, as there's only an index on the composite key (feed, key) -- nothing on the value of the key/tag, and even if I went with the hack of a unique key, linear scan of the Would a PR to this effect be considered? |
My suggestion was to use tag keys with You are right in that there is a scan involved, since there's no index on the feed_tags key column alone. My results were reasonably fast because I only have a ~small amount of feed tags (177 feeds * ~2 tags/feed). However, indexes on tag keys can and should be added, since they exists for this kind of access pattern; I opened #359 for this, and may work on it reasonably soon.
Yes, definitely, but I'm somewhat attached to the "slug" naming, since I think it best conveys the idea that it's used as an identifier – "the part of a URL that identifies a page in human-readable keywords" – and it also matches the publishing term. |
Added a cleaned up version of the code in #358 (comment) as a plugin recipe. Closing this for now, feel free to reopen if needed. |
Hello, I'm trying to write my dream feed reader with your library, and one obstacle I'm facing now is that I'm used to giving short unique IDs (or names) to feeds and accessing them with that ID. That's because URLs are often long and messy, and author titles can be missing or inconvenient (with spaces, among other things). Initially it looked as if the
user_title
attribute might work for that, but alas, the database schema has no index on the column, so that would be a linear search each time.Then I thought I could add the index myself in my code -- but the
Reader
class provides no direct access to the database connection, so I can't do it with a plugin. I could simply open the database after it's created and add the index behindReader
's back, but that would be really ugly. So, what approach do you think I should take? Should I ask you to add the index 😁 , or is there something else I can do now?Thanks,
--
Ian
The text was updated successfully, but these errors were encountered: