Get feed by user-defined slug #358

nobrowser · 2024-10-24T23:58:14Z

Hello, I'm trying to write my dream feed reader with your library, and one obstacle I'm facing now is that I'm used to giving short unique IDs (or names) to feeds and accessing them with that ID. That's because URLs are often long and messy, and author titles can be missing or inconvenient (with spaces, among other things). Initially it looked as if the user_title attribute might work for that, but alas, the database schema has no index on the column, so that would be a linear search each time.

Then I thought I could add the index myself in my code -- but the Reader class provides no direct access to the database connection, so I can't do it with a plugin. I could simply open the database after it's created and add the index behind Reader's back, but that would be really ugly. So, what approach do you think I should take? Should I ask you to add the index 😁 , or is there something else I can do now?

Thanks,

--
Ian

The text was updated successfully, but these errors were encountered:

lemon24 · 2024-10-26T09:53:41Z

tl;dr: I think you can use feed tags for this, details below.

my dream feed reader

Love this phrase :)

it looked as if the user_title attribute might work for that

Indeed, it would not be a good idea to use user_title, since it is in some ways "special"; among others it may be used (together with the feed's original title) to implement title sort (e.g. dropping "The" etc.); this is one of the main reasons I did not deprecate it in favor of tags.

URLs are often long and messy, and author titles can be missing or inconvenient (with spaces, among other things)

This is probably another reason for not using user_title like this, since you may still want "human readable" titles that are different from the original ones, but are also not slugs.

So, if reader were to have an API for exactly this use case, it'd be something separate, like Feed.slug, set_feed_slug(url, slug), get_feed(slug=slug).

That said, I think you can achieve this by using feed tags, since they are indexed.

In the example below (click to expand), for my database with 177 feeds, get_feed_by_slug(slug) is only twice as slow as get_feed(url).

from __future__ import annotations

def _make_slug_tag(reader, slug):
   return reader.make_plugin_reserved_name('slugs', slug)

def _get_slug_tags(reader, resource):
    prefix = _make_slug_tag(reader, '')
    # filter tags by prefix would make this faster
    # https://github.com/lemon24/reader/issues/309
    rv = list()
    for tag in reader.get_tag_keys(resource):
        if not tag.startswith(prefix):
            continue
        rv.append(tag)
    return prefix, rv
    
def get_feed_slug(reader, feed) -> str | None:
    prefix, tags = _get_slug_tags(reader, feed)
    if not tags:
        return None
    assert len(tags) == 1, (feed, tags)
    return tags[0].removeprefix(prefix) 

def get_feed_by_slug(reader, slug) -> Feed | None:
    tag = _make_slug_tag(reader, slug)
    feeds = list(reader.get_feeds(tags=[tag]))
    if not feeds:
        return None
    assert len(feeds) == 1, (slug, [f.url for f in feeds])
    return feeds[0]

def set_feed_slug(reader, feed, slug: str | None):
    feed = reader.get_feed(feed)
    tag = _make_slug_tag(reader, slug)
    
    if slug:
        already_has_slug = False
        for other_feed in reader.get_feeds(tags=[tag]):
            if feed.url == other_feed.url:
                already_has_slug = True
                continue
            raise ValueError(f"slug {slug!r} already in use by {other_feed.url!r}")

        # NOTE: technically a race condition / TOCTOU bug

        if already_has_slug:
            return
        
    _, other_tags = _get_slug_tags(reader, feed)
    for other_tag in other_tags:
        reader.delete_tag(feed, other_tag, missing_ok=True)
        
    if slug:
        reader.set_tag(feed, tag)
        
        # "handle" race condition; ensures(?) at most one feed has the slug,
        # with the risk that none of them has it in case of race condition.
        
        for other_feed in reader.get_feeds(tags=[tag]):
            if feed.url == other_feed.url:
                continue
            reader.delete_tag(other_feed, tag, missing_ok=True)

    # an alternative way of dealing with the race condition is
    # to use global tags, but then we'd have to "garbage-collect"
    # slugs of feeds that have been deleted (in an after_feeds_update_hook),
    # although that may lead to a different sort of race condition.
    
if __name__ == '__main__':
    import timeit
    from reader import make_reader
    
    reader = make_reader('db.sqlite')
    url = 'https://death.andgravity.com/_feed/index.xml'
    
    set_feed_slug(reader, url, 'one')
    
    print(
        get_feed_slug(reader, url),
        getattr(get_feed_by_slug(reader, 'one'), 'url', None),
    )

    try:
        set_feed_slug(reader, 'https://xkcd.com/atom.xml', 'one')
        raise Exception("shouldn't get here")
    except Exception as e:
        print('error:', e)
    
    set_feed_slug(reader, url, 'two')

    print(
        get_feed_slug(reader, url),
        getattr(get_feed_by_slug(reader, 'two'), 'url', None),
    )

    print()
    
    def repeat(stmt):
        times = timeit.repeat(stmt, repeat=1000, number=1, globals=globals())
        print(f"{stmt}: {min(times):.6f}s min, {sum(times)/len(times):.6f}s avg")
    
    print(reader.get_feed_counts().total, 'feeds')
    repeat("reader.get_feed(url)")
    repeat("get_feed_by_slug(reader, 'two')")

(For reference, this is somewhat related to #159 (comment) that discusses having a surrogate key, except I didn't think it would be user-defined there.)

nobrowser · 2024-10-27T05:26:57Z

Thanks for the suggestion about tags. I didn't think of using them somehow because of the fact that there can be multiple ones, I'll have to look at the schema again to convince myself I'm not doing something abominable.

Good that we're talking.

--
Ian

lemon24 · 2024-10-27T12:45:11Z

that there can be multiple [tags]

The only issue I see with using tags is the race condition mentioned in the code comment, but I think the approach of (accidentally, rarely) deleting both should be acceptable, especially for a single-user system.

BTW (and no pressure! :), if the approach above ends up being OK for you, I'd happy to merge it as a reader plugin, either experimental or built-in (the difference is mainly in the amount of testing). The get/set_feed_slug() and get_feed_by_slug() "methods" can be monkey-patched onto the reader instance as enclosure_dedupe does here.

(In theory I'd be fine with this being built into reader itself too, which would also get rid of the race condition, but I think it'd probably be overkill.)

nobrowser · 2024-10-28T00:25:26Z

Hmmm, so no, looking at the database I can't use a tag for what I want, as there's only an index on the composite key (feed, key) -- nothing on the value of the key/tag, and even if I went with the hack of a unique key, linear scan of the feed_tags table would still be needed. So I really need a new (unique) attribute, which indeed would be a kind of "surrogate key". To avoid any confusion I propose to call it short_name -- it doesn't have to be a slug as the term is normally used.

Would a PR to this effect be considered?

lemon24 · 2024-10-28T07:13:49Z

My suggestion was to use tag keys with get_feeds(tags=[tag]), with the value being ignored/unused (also see the collapsed code at the end of my first comment); for slug My Feed, the matching feed tag would be .plugin.slugs.My Feed.

You are right in that there is a scan involved, since there's no index on the feed_tags key column alone. My results were reasonably fast because I only have a ~small amount of feed tags (177 feeds * ~2 tags/feed). However, indexes on tag keys can and should be added, since they exists for this kind of access pattern; I opened #359 for this, and may work on it reasonably soon.

Would a PR to this effect [new feed attribute] be considered?

Yes, definitely, but I'm somewhat attached to the "slug" naming, since I think it best conveys the idea that it's used as an identifier – "the part of a URL that identifies a page in human-readable keywords" – and it also matches the publishing term.

lemon24 · 2024-11-03T13:18:38Z

Added a cleaned up version of the code in #358 (comment) as a plugin recipe.

Closing this for now, feel free to reopen if needed.

lemon24 changed the title ~~access feeds by user_title~~ Get feed by user-defined slug Oct 26, 2024

lemon24 mentioned this issue Oct 28, 2024

Feed and entry tags should be indexed by key #359

Closed

lemon24 added enhancement API labels Oct 28, 2024

lemon24 added a commit that referenced this issue Nov 3, 2024

Add feed slugs plugin recipe. #358

99ad809

lemon24 closed this as completed Nov 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get feed by user-defined slug #358

Get feed by user-defined slug #358

nobrowser commented Oct 24, 2024

lemon24 commented Oct 26, 2024

nobrowser commented Oct 27, 2024

lemon24 commented Oct 27, 2024

nobrowser commented Oct 28, 2024

lemon24 commented Oct 28, 2024 •

edited

Loading

lemon24 commented Nov 3, 2024

Get feed by user-defined slug #358

Get feed by user-defined slug #358

Comments

nobrowser commented Oct 24, 2024

lemon24 commented Oct 26, 2024

nobrowser commented Oct 27, 2024

lemon24 commented Oct 27, 2024

nobrowser commented Oct 28, 2024

lemon24 commented Oct 28, 2024 • edited Loading

lemon24 commented Nov 3, 2024

lemon24 commented Oct 28, 2024 •

edited

Loading