Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pagination not implemented for project_views #20

Open
floatingpurr opened this issue Jun 7, 2022 · 4 comments
Open

Pagination not implemented for project_views #20

floatingpurr opened this issue Jun 7, 2022 · 4 comments

Comments

@floatingpurr
Copy link

floatingpurr commented Jun 7, 2022

Hi, I was trying to wrap my head around the Wikimedia REST API, when I found this repo. It looks like there is a pagination problem, at least with the method project_views. Step to reproduce:

p = PageviewsClient(user_agent="<[email protected]> Foo Bar")
a = p.project_views(['it.wikipedia'], granularity='hourly', start='2016010100', end='2022060700')

len(a) # 56,377 items
# However most of those are empty, This library inserts all the dates in the range, 
# but then there are no data coming from the API for all the time span

# E.g.:
a[datetime.datetime(2021, 2, 11, 15, 0)] #  {'it.wikipedia': None}

Data about such dates are definitely present in the API. The problem is that such an endpoint returns at most 5k items per-call along with an undocumented pagination token (see here for more). Probably, there should be a way to pass it to the API to get the next page (still trying to figure out how).

@milimetric
Copy link
Contributor

The limits are there for performance reasons. The pageviews data has grown in different ways to respond to lots of different user needs, and we need to step back and reorganize it so it's much easier to work with. Ideally what you're asking for would be available in a projectview yearly dump file, and the API would point that out when you hit one of its limits. But this isn't something we can patch onto this client code, we have to make some changes on the backend. Thank you for pointing out the problem.

@floatingpurr
Copy link
Author

Thanks for sharing your thought on it.

I do understand limitations for performance. That's the reason why the pagination exists. I opened this issue because the backend already returns a next token for pagination (just guessing, no spec about it). However, it's not clear how to use it. Hoped to find something more about such a mechanism inside this library :)

Side question: are there dump files with metrics data?

@milimetric
Copy link
Contributor

@floatingpurr: this API is being moved to a new codebase as we speak, and I honestly don't know if the pagination is even implemented. I wrote the previous version of the API and in that code, pagination was not possible because the data was coming out of Cassandra. If you asked it for a specific time range, there was no way to say "first 50" or "next 50" or anything like that. So stay tuned to this task for progress and details if you like, and if I end up learning about it I'll comment here. But it's low priority for my team right now.

As for dumps, yes there are dumps, but we're also in the middle of a cleanup there. Sorry for the bad state of everything, I'm trying to raise the priority of this right now: https://dumps.wikimedia.org/other/analytics/

So there's "pageviews complete" which should have monthly and per-project aggregates, but do not yet, making the "complete" part of the name quite ironic. This is definitely something we're trying to make better.

@floatingpurr
Copy link
Author

floatingpurr commented Jun 8, 2022

Hi @milimetric thank you again! I understand the situation. No worries at all. I was just trying to exploit what it's currently available :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants