-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds feature displaying project view count #1101
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shulim, this is a great start. I've added a few comments, though haven't tested or looked at the parsing functionality yet. You should be able to fix the test by adding the migration.
physionet-django/project/management/commands/get_metrics_data.py
Outdated
Show resolved
Hide resolved
physionet-django/project/management/commands/get_metrics_data.py
Outdated
Show resolved
Hide resolved
physionet-django/project/management/commands/get_metrics_data.py
Outdated
Show resolved
Hide resolved
physionet-django/project/management/commands/get_metrics_data.py
Outdated
Show resolved
Hide resolved
physionet-django/project/management/commands/get_metrics_data.py
Outdated
Show resolved
Hide resolved
535fd71
to
a570373
Compare
Hey @shulim! Thanks for all the work on this! It looks good, I just have one comment / suggestion. Should we mention to users that the view count is only cumulative starting at a certain date (the date of the oldest log file)? I think this would prevent confusion from authors who are wondering why their project only has 3 views if it's been on PhysioNet for 4 years. Maybe this could be a circle question mark hover element such as the one after Roger's name in your photo that says "Note: the view count is cumulative as of July 20, 2020" or as simple as writing "16 views starting July 20, 2020" instead of just "16 views". Anyways, nice work! EDIT: Maybe you could also explain in the hover element how the view count is determined? Or would this cause users to try and hack the system to up their project's view counts disingenuously? |
Hi @Lucas-Mc, thanks for the helpful feedback! I’ll work on a circle question mark hover element to add a note on when the views started being counted.
I think adding a quick explanation would be nice! I can add a note about views being counted as unique IP address visits to any of the project’s pages on a given day (will think more on the wording as I go, but if you have any suggestions, I’ll work them in).
I’m not sure if I can accurately predict users’ responses to this—though I currently think that this wouldn’t be a large concern, I’m also open to further discussion on this, maybe once I add the question mark element. Thanks again! |
Thanks, @shulim. This is a great start! One general issue is that this requires that:
It's better not to rely on such assumptions; instead, we should look at the date/time of each request, add up the number of requests for each date, and update the corresponding counters in the database. At present, the system is configured to run logrotate at midnight local time, which means that if everything is working, then a new log file will be started a few seconds or minutes after midnight. But there are a lot of reasons that might not always be the case, and we don't want to rely on it. Some details about the database model: (models.py; line 2636)
Don't use DO_NOTHING. Usually the correct value for on_delete is CASCADE, sometimes something else is more appropriate. DO_NOTHING is dangerous and for experts only. Neither 'core_project' nor 'date' should have a default value (and None doesn't make sense anyway since these fields are not nullable.) We probably want to say that these objects should be unique for a given project and date:
Some details about the parsing: (get_metrics_data.py; line 98)
Reading the entire log file into a gigantic string, and then splitting that into a list of strings, is not efficient. It's much better to read the file one line at a time:
(line 106)
Why do you want to exclude requests with a query string ("?") ? If there are special cases then explain what those are. The expression For example, you don't want to recognize:
(line 117)
I imagine the print was for debugging; we don't want to print anything here. |
Thanks @bemoody for the detailed feedback! I'll look at this and get back to you next week on the suggested modifications. |
@bemoody Thank you again for your suggestions! I've incorporated most of them, and I have a few questions on some:
Thank you, that makes sense! I originally had the log parser keeping track of different dates within a single log file, and I can go back to that. I am, however, a little concerned about ensuring that logs don't get double-counted. If, for example, I have information in my table already from a log with entries on August 5, I don't want to accidentally call the command on the same log file and re-add the Aug 5 entries to the total view count. The current way I have in place to do this is to find the latest date for a slug and make sure that the additional views I add are for a date after the last update. If I do not assume that there is only one log file for a particular date, do you have any suggestions for how to go about preventing double-counting? One thing that comes to mind is storing the IP addresses in the Metrics table so that I only count an IP address viewing a project page once per day, though I'm not sure if that could be a privacy concern. That would also significantly change how I store/access the data in multiple other places, but I'd be happy to work on that if it helps address my not-so-sound assumptions.
Sounds good, I changed to CASCADE. I plan to commit this, along with the other changes from your feedback, all together after sorting out the other parts here.
When I originally had 'core_project' and 'date' without default values, I was prompted to enter default values when attempting to make the migration. Do you have any thoughts on how to address this?
Thanks, that makes sense--I added this in the Metrics class.
Thank you for the suggestion; I replaced my original code with these lines.
I'm trying to avoid counting logs that contain requests similar to the following:
Thanks, I added a slash so that the parser now searches for paths with
Noted. Instead of |
Thank you, that makes sense! I originally had the log parser keeping track
of different dates within a single log file, and I can go back to that. I
am, however, a little concerned about ensuring that logs don't get
double-counted.
Yeah, that's not a bad idea, but it'd be difficult to implement
in a reliable way. Maybe we could add a model that simply
records *which log files have been processed*, perhaps identified
by the file's size and modification time? Or even just
identified by the earliest/latest timestamps in the file?
When I originally had 'core_project' and 'date' without default values, I
was prompted to enter default values when attempting to make the migration.
Do you have any thoughts on how to address this?
I think that should only happen if you're adding new
columns (fields) to an existing table (model). Since this table
doesn't exist yet, a default value isn't necessary. So if you
delete the project/migrations/0045_metrics.py file and re-run the
makemigrations command, it shouldn't complain.
> Why do you want to exclude requests with a query string ("?") ? If there
> are special cases then explain what those are.
I'm trying to avoid counting logs that contain requests similar to the
following:
`"GET
/content/?topic=brain+tumor&orderby=relevance-desc&types=0&types=1&types=2&types=3
HTTP/1.1"`
I intend to exclude queries like the one above from being counted as a view.
However, I'd be open to other suggestions on getting around this, if
applicable.
Ah, I see... and that URL would get interpreted as a project with
a very long name, so you'd get a bunch of junk entries in 'data'.
I'm not sure that this really matters, since we only save the
results for PublishedProjects that actually exist. But you could
write something like:
path = parts[6].split('?')[0]
slug = path.split('/')[2]
to ignore any query string and just extract the project slug. In
general, any request *may* contain query parameters and we want
to count them either way.
> I imagine the print was for debugging; we don't want to print anything
> here.
Noted. Instead of `if line:` and the print statement, I put `continue`
because I don't want a single weirdly formatted line to raise something
general like an InvalidLogError. I'm not sure if this is the right move,
though, so please let me know your thoughts on this.
Yeah, I think if there are lines you can't parse then ignoring
them is probably the best strategy in general. Maybe it'd be
good to print a warning message just showing the number of
unparseable lines?
|
@bemoody thanks for your reponse! I believe I've addressed all the points we discussed and committed them, so if you get a chance, please let me know what you think.
Sounds good! I ended up storing the filename and creation timestamp in a new
This works great, thanks!
Makes sense, incorporated this.
Okay, did this! |
Thanks! Great improvements, although I still see some small issues here. In the Metrics class, to mark fields as "unique together" you need to place the "unique_together" attribute inside a "Meta" class. See the Affiliation class for an example of how this is done. In the update_metrics function, then, we want to add to the existing Metrics objects, if they exist, rather than creating new objects. To put it another way, we want the Metrics object to indicate the number of hits that actually occurred on August 13, even though some of those will have been recorded in the "August 12" and some in the "August 13" log file. (And we don't want to have two Metrics objects for August 13, because then we don't know which object to look at.) Finally, we don't want to create a MetricsLogData object until after we've parsed the file and saved the results - if the process is interrupted before the results are saved, it shouldn't record the file as having been processed. The best way would be to do all of the database updates - creating and modifying Metrics objects, and creating the MetricsLogData object - as a single atomic transaction. In theory we could do this:
But it's better not to hold database locks for a long time, so it'd be better to:
|
Thank you @bemoody for the continued feedback! I've tried to address each point below.
Oops, not sure how I missed that. Fixed it!
Oh yeah, that's important. I'm now using
That's neat, thank you! I've followed that general outline in my latest commit. |
One small issue in handle():
This is rather convoluted, and also would be wrong for one hour every year. :)
I was confused about what you're trying to measure here. When you say "view count", I would assume that means the number of times a page was viewed (or, really, the number of times the page was loaded by a web browser.) But that's not what you're counting here; you're trying to count the number of distinct visitors in each one-day period (assuming that roughly, one person corresponds to one IP address.) You did explain this, but I missed it in my first few times reading. So, some of the issues I've mentioned don't make a lot of sense in that context. Counting distinct values is different from counting events. But please try to more clearly express what you mean, both in terms of variable names used in the code, and how the information is presented on the project page. I may have more to add after I've thought about this a little more, but thanks for your work so far! |
@bemoody Thank you for the suggestion; I've updated the code accordingly.
Thank you for your thoughts on this! I currently consider “views” and “[distinct] visitors (per day)” to be synonymous, at least from my less-technical perspective. I think that counting the total number of times a page was loaded by a web browser, if one person did it multiple times in a day, might make it too easy to “hack” the system (Lucas’s concern mentioned earlier), whereas refreshing the count every day for distinct visitors would hopefully preserve most of the page’s activity, excluding repeat visitors within a day. From reading up on how different sites count views, it seems like each has its own definition of the term, so I settled on this early on after discussing briefly with you, Tom, and Lucas. I also believe that "views" is the most commonly used term for metrics related to what I am counting, so I thought that this might make more sense than saying "50 visitors" or another less commonly-used term. However, I realize the importance of clarity and am open to refining/revising this definition. Edit: I accidentally closed this, but I'm still open for feedback! |
I agree that trying to estimate the number of distinct visitors is a more useful measurement of project "popularity" than simply looking at the number of page requests or bytes downloaded. (Page requests and bytes downloaded are also useful metrics that I think we should be collecting for other reasons - and no one number can tell you the whole story - but we can add these and other things later, once we have the groundwork in place.) I don't agree that we should worry about people "gaming" the system, because anything we do here can be gamed. Anybody can have 1000 different IP addresses in one day, if they want to. Nobody should put much stock in these numbers; what's really important are questions like how many courses are using the data, and how many peer-reviewed papers are being published. One remaining issue here is that the logic of update_metrics doesn't deal well with the few seconds after midnight. The way update_metrics is currently written, all of the visits from the current log file are added together, and associated with the latest date seen for that project. If someone is actively downloading the project at midnight (which is usually the case for mimic-cxr), the latest date for the project will be today; if not, the latest date will be yesterday. Furthermore, the update function adds to the existing "running_viewcount" but replaces the existing "viewcount". This results in very weird and inconsistent information being stored in the database. For example, if we assume log rotation happens 15 seconds after midnight, and project X has:
the database would end up looking like this:
...205 visits have "vanished", and there is no entry for 2020-08-20 at all. Moreover, since those 5 visitors just after midnight were probably also represented among the 200 and the 150, we don't really want to count those as 5 additional visits anyway. One possibility would be to revert to logic like you had before, where we assume one log file = one day, and associate all visits in the log file with a single date. In effect, treating the log rotation (rather than local midnight) as the start/end of the bucket. This isn't a great long-term solution but should make things work "well enough for now." The better solution, in my view, would be to define visits in a way that is independent of the time of day - one request at 23:55 followed by one at 00:05 should be considered one visit, not two. Doing that would require either saving more information after each run, or parsing two log files at a time to obtain overlapping windows. Either way would require a decent amount of work, so I don't want to suggest that you try to implement it at this stage. |
@bemoody Thanks for the detailed feedback, and I'm sorry for the delay in getting back to you. I went forward with the one log file = one day approach, since although the few visitors who visit shortly after midnight may be counted with the previous day's total, the count displayed won't be affected. In In addition to any feedback you have on this, is there anything else that might need attention before merging? I'm planning to work on a small notice to indicate the day that views started being counted. Thank you again! |
@shulim We'd like to make a push to merge this! Please could you rebase on dev to deal with the conflict? It is less of a scary merge than it initially seems. We (well @cx1111) recently refactored the project We should (I think) need to move the following chunk to one of the physionet-build/physionet-django/project/models.py Lines 2632 to 2653 in 946c9f3
|
Bump! Message recently received from one of our dataset contributors:
|
This is a work in progress that starts to address #738 . The aim is to display a view count for each project, available to whoever accesses the project page to get a sense of how much a project has been referenced so far. The view count data is obtained from parsing NGINX logs. Here is what it looks like so far:
Here are the changes/new files:
physionet-django/project/models.py
: creates a Metrics table that stores daily and running total view counts for each project per date.physionet-django/project/templates/project/published_project.html
: adds a line to display the latest view count, as shown in the picture.physionet-django/project/views.py
: handles Metrics viewphysionet-django/project/management/commands/get_metrics_data.py
: this is a management task that can be run regularly to update view counts as each log is produced daily (can be called usingpython manage.py get_metrics_data [path_to_file]
). The functionlog_parser
parses the logs to count unique views (defined here as coming from different IP addresses).update_metrics
takes data fromlog_parser
to update the Metrics table in the database.validate_log_file
makes sure the log file can/should be parsed.physionet-django/project/management/commands/test_get_metrics_data.py
: tests theget_metrics_data
command using 3 sample log files. Can be run usingpython manage.py test project.management.commands.test_get_metrics_data
(the test only works ifresetdb
andloaddemo
are done beforehand—not sure if this is an issue. I attempted to create asetUp
function that could reset the database before each test, but was so far unable to do it).physionet-django/project/fixtures/physionet_access.log.1
,physionet-django/project/fixtures/physionet_access.log.2
, andphysionet-django/project/fixtures/physionet_access.log.3
are sample log files I generated usinglog-generator
(https://pypi.org/project/log-generator/). The IP addresses and user agents are random, and the user names were taken from the PhysioNet demo. I wasn’t sure where to put these logs, so I put them infixtures
—please let me know if there might be a better place or way to store them.Please let me know your thoughts on this approach or any comments/questions you may have. Thank you!