-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query for set information #2
Comments
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Notes: Recursive full download needs to return a list of found categories (ns=14) after each page downloaded Should I just add the second return value for the list of categories or switch to OOP? |
Is OrderedSet a specific thing? Also what do you mean by "add the second return value for the list of categories or switch to OOP"? Also do you want the new downloaded files to be flat in the top level category folder or nested? |
How does gcmcontinue and grccontinue work, like when are you using it and what value do you give it |
I'm describing the changes that need to happen to the main logic in the Currently the category is specified to the MediaWiki API by the gcmcontinue and grccontinue are pagination tokens in the response from MediaWiki APIs when a generator is used, when the results don't fit on a single page of results. In the download scripts, this is populated from the previous request so it downloads all pages, but can also be provided on the command-line to restart a previous set of downloads from the middle. The parameter varies by script. For main.py, the generator CategoryMembers is used, so the parameter is GCMcontinue. For incremental.py, the generator RecentChanges is used, so the parameter is GRCcontinue. https://yugipedia.com/api.php |
I got it to work with recursion and a second return value but it doesn't look that nice right now so I'm considering restructuring it. The biggest issue so far is actually getting an identifier for the top category for preventing loops. The generator=categorymembers doesn't return any info about the category itself. I figure you can use pageid or title to track if there is looping. So in that case it looks like you either have to
What do you think? Do you have any preferences because otherwise I'm probably picking making another request for every category. |
Feel free to restructure. I already anticipated it would be necessary and there's actually very little code in this repository. The only interface that needs to be respected for full downloads is the command-line interface. Everything else is an implementation detail. |
Collecting set information is the last piece for YAML Yugi to exceed parity with other solutions. Unlike the other data collected so far, which are contained in flat categories, sets are indexed on Yugipedia in hierarchical categories. This means that instead of a target category for sets directly containing an article about a set, categories may be nested. When querying the MediaWiki API, only the immediate members of a category are returned, including the names of child categories, but the members of those child categories are not returned. Therefore, new code is required in order to download entire category hierarchies and subscribe to updates on them. Category hierarchies are allowed to contain cycles, and while this is not expected of the categories for sets, our code should be correct even if cycles are encountered and not fall into an infinite loop.
Design
Either create or extend the current full download script to recursively download a targeted category, without falling into infinite loops. For example, after fetching https://yugipedia.com/api.php?action=query&redirects=true&generator=categorymembers&prop=revisions&rvprop=content&format=json&formatversion=2&gcmlimit=50&gcmtitle=Category:Yu-Gi-Oh!_Master_Duel_sets, the the ns=14 category items in the response should be stored in a ordered set for additional follow-up requests once the current category is completely downloaded.
To subscribe to incremental updates, the existing script can be used, but each time, it should be called with all the known descendant categories cached from the last full download, in addition to the top-level category itself. This is because the MediaWiki API only provides the immediate parent categories of an article, not all ancestor categories.
Subtasks
The text was updated successfully, but these errors were encountered: