Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get feed by user-defined slug #358

Closed
nobrowser opened this issue Oct 24, 2024 · 6 comments
Closed

Get feed by user-defined slug #358

nobrowser opened this issue Oct 24, 2024 · 6 comments

Comments

@nobrowser
Copy link

Hello, I'm trying to write my dream feed reader with your library, and one obstacle I'm facing now is that I'm used to giving short unique IDs (or names) to feeds and accessing them with that ID. That's because URLs are often long and messy, and author titles can be missing or inconvenient (with spaces, among other things). Initially it looked as if the user_title attribute might work for that, but alas, the database schema has no index on the column, so that would be a linear search each time.

Then I thought I could add the index myself in my code -- but the Reader class provides no direct access to the database connection, so I can't do it with a plugin. I could simply open the database after it's created and add the index behind Reader's back, but that would be really ugly. So, what approach do you think I should take? Should I ask you to add the index 😁 , or is there something else I can do now?

Thanks,

--
Ian

@lemon24
Copy link
Owner

lemon24 commented Oct 26, 2024

tl;dr: I think you can use feed tags for this, details below.

my dream feed reader

Love this phrase :)

it looked as if the user_title attribute might work for that

Indeed, it would not be a good idea to use user_title, since it is in some ways "special"; among others it may be used (together with the feed's original title) to implement title sort (e.g. dropping "The" etc.); this is one of the main reasons I did not deprecate it in favor of tags.

URLs are often long and messy, and author titles can be missing or inconvenient (with spaces, among other things)

This is probably another reason for not using user_title like this, since you may still want "human readable" titles that are different from the original ones, but are also not slugs.

So, if reader were to have an API for exactly this use case, it'd be something separate, like Feed.slug, set_feed_slug(url, slug), get_feed(slug=slug).


That said, I think you can achieve this by using feed tags, since they are indexed.

In the example below (click to expand), for my database with 177 feeds, get_feed_by_slug(slug) is only twice as slow as get_feed(url).
from __future__ import annotations

def _make_slug_tag(reader, slug):
   return reader.make_plugin_reserved_name('slugs', slug)

def _get_slug_tags(reader, resource):
    prefix = _make_slug_tag(reader, '')
    # filter tags by prefix would make this faster
    # https://github.com/lemon24/reader/issues/309
    rv = list()
    for tag in reader.get_tag_keys(resource):
        if not tag.startswith(prefix):
            continue
        rv.append(tag)
    return prefix, rv
    
def get_feed_slug(reader, feed) -> str | None:
    prefix, tags = _get_slug_tags(reader, feed)
    if not tags:
        return None
    assert len(tags) == 1, (feed, tags)
    return tags[0].removeprefix(prefix) 

def get_feed_by_slug(reader, slug) -> Feed | None:
    tag = _make_slug_tag(reader, slug)
    feeds = list(reader.get_feeds(tags=[tag]))
    if not feeds:
        return None
    assert len(feeds) == 1, (slug, [f.url for f in feeds])
    return feeds[0]

def set_feed_slug(reader, feed, slug: str | None):
    feed = reader.get_feed(feed)
    tag = _make_slug_tag(reader, slug)
    
    if slug:
        already_has_slug = False
        for other_feed in reader.get_feeds(tags=[tag]):
            if feed.url == other_feed.url:
                already_has_slug = True
                continue
            raise ValueError(f"slug {slug!r} already in use by {other_feed.url!r}")

        # NOTE: technically a race condition / TOCTOU bug

        if already_has_slug:
            return
        
    _, other_tags = _get_slug_tags(reader, feed)
    for other_tag in other_tags:
        reader.delete_tag(feed, other_tag, missing_ok=True)
        
    if slug:
        reader.set_tag(feed, tag)
        
        # "handle" race condition; ensures(?) at most one feed has the slug,
        # with the risk that none of them has it in case of race condition.
        
        for other_feed in reader.get_feeds(tags=[tag]):
            if feed.url == other_feed.url:
                continue
            reader.delete_tag(other_feed, tag, missing_ok=True)

    # an alternative way of dealing with the race condition is
    # to use global tags, but then we'd have to "garbage-collect"
    # slugs of feeds that have been deleted (in an after_feeds_update_hook),
    # although that may lead to a different sort of race condition.
    
if __name__ == '__main__':
    import timeit
    from reader import make_reader
    
    reader = make_reader('db.sqlite')
    url = 'https://death.andgravity.com/_feed/index.xml'
    
    set_feed_slug(reader, url, 'one')
    
    print(
        get_feed_slug(reader, url),
        getattr(get_feed_by_slug(reader, 'one'), 'url', None),
    )

    try:
        set_feed_slug(reader, 'https://xkcd.com/atom.xml', 'one')
        raise Exception("shouldn't get here")
    except Exception as e:
        print('error:', e)
    
    set_feed_slug(reader, url, 'two')

    print(
        get_feed_slug(reader, url),
        getattr(get_feed_by_slug(reader, 'two'), 'url', None),
    )

    print()
    
    def repeat(stmt):
        times = timeit.repeat(stmt, repeat=1000, number=1, globals=globals())
        print(f"{stmt}: {min(times):.6f}s min, {sum(times)/len(times):.6f}s avg")
    
    print(reader.get_feed_counts().total, 'feeds')
    repeat("reader.get_feed(url)")
    repeat("get_feed_by_slug(reader, 'two')")

(For reference, this is somewhat related to #159 (comment) that discusses having a surrogate key, except I didn't think it would be user-defined there.)

@lemon24 lemon24 changed the title access feeds by user_title Get feed by user-defined slug Oct 26, 2024
@nobrowser
Copy link
Author

Thanks for the suggestion about tags. I didn't think of using them somehow because of the fact that there can be multiple ones, I'll have to look at the schema again to convince myself I'm not doing something abominable.

Good that we're talking.

--
Ian

@lemon24
Copy link
Owner

lemon24 commented Oct 27, 2024

that there can be multiple [tags]

The only issue I see with using tags is the race condition mentioned in the code comment, but I think the approach of (accidentally, rarely) deleting both should be acceptable, especially for a single-user system.

BTW (and no pressure! :), if the approach above ends up being OK for you, I'd happy to merge it as a reader plugin, either experimental or built-in (the difference is mainly in the amount of testing). The get/set_feed_slug() and get_feed_by_slug() "methods" can be monkey-patched onto the reader instance as enclosure_dedupe does here.

(In theory I'd be fine with this being built into reader itself too, which would also get rid of the race condition, but I think it'd probably be overkill.)

@nobrowser
Copy link
Author

Hmmm, so no, looking at the database I can't use a tag for what I want, as there's only an index on the composite key (feed, key) -- nothing on the value of the key/tag, and even if I went with the hack of a unique key, linear scan of the feed_tags table would still be needed. So I really need a new (unique) attribute, which indeed would be a kind of "surrogate key". To avoid any confusion I propose to call it short_name -- it doesn't have to be a slug as the term is normally used.

Would a PR to this effect be considered?

@lemon24
Copy link
Owner

lemon24 commented Oct 28, 2024

My suggestion was to use tag keys with get_feeds(tags=[tag]), with the value being ignored/unused (also see the collapsed code at the end of my first comment); for slug My Feed, the matching feed tag would be .plugin.slugs.My Feed.

You are right in that there is a scan involved, since there's no index on the feed_tags key column alone. My results were reasonably fast because I only have a ~small amount of feed tags (177 feeds * ~2 tags/feed). However, indexes on tag keys can and should be added, since they exists for this kind of access pattern; I opened #359 for this, and may work on it reasonably soon.

Would a PR to this effect [new feed attribute] be considered?

Yes, definitely, but I'm somewhat attached to the "slug" naming, since I think it best conveys the idea that it's used as an identifier – "the part of a URL that identifies a page in human-readable keywords" – and it also matches the publishing term.

@lemon24
Copy link
Owner

lemon24 commented Nov 3, 2024

Added a cleaned up version of the code in #358 (comment) as a plugin recipe.

Closing this for now, feel free to reopen if needed.

@lemon24 lemon24 closed this as completed Nov 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants