Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P.k.g. phone home #4609

Closed
StefanKarpinski opened this issue Oct 22, 2013 · 47 comments
Closed

P.k.g. phone home #4609

StefanKarpinski opened this issue Oct 22, 2013 · 47 comments
Labels
needs decision A decision on this change is needed packages Package management and loading

Comments

@StefanKarpinski
Copy link
Member

It would be nice to have some way of knowing what packages people are using and some way of estimating Julia installs. We could potentially achieve this by having Pkg phone home when doing Pkg.update – i.e. send a list of installed packages and system version info to a server for logging. I wouldn't want to do that in any underhanded, sneaky sort of way, but opt-in doesn't seem likely to generate much data. Any thoughts on this? Good idea, bad idea? How would we do it in a way that's transparent and not sneaky but is likely to get us a reasonable amount of representative data? Note that while we don't currently have any way of getting this information, GitHub already does since they know what users and IP addresses are doing git pull against METADATA.jl. So in principle, this is already information users are sharing – just not with us.

@johnmyleswhite
Copy link
Member

This would be great information to have. It would be good to hear how R's CRAN servers are dealing with this. They maintained their own log files forever about package downloads, but only recently decided to share aggregate statistics with the public. We obviously don't run servers, so we have to push information out rather than retain it, but that may not make much of a difference to people.

@JeffBezanson
Copy link
Member

Very tricky. One thing I can think of is prompt "Are you ok with Pkg sending anonymous usage data?" when Pkg.init() runs. I hate prompting the user, but it's hard to deal with this otherwise since neither opt-in nor opt-out works very well. Honestly the very best option is probably to switch to using our own servers, but that of course is quite a hassle (understatement!)

@johnmyleswhite
Copy link
Member

Also worth noting: Hadley Wickham's CRANtastic tried to use an even more opt-in mechanism in which you had to explicitly call a function to send information to his server. Basically no one ever did this and he got no useful information out of it.

@StefanKarpinski
Copy link
Member Author

Also worth noting: Hadley Wickham's CRANtastic tried to use an even more opt-in mechanism in which you had to explicitly call a function to send information to his server. Basically no one ever did this and he got no useful information out of it.

I'm not sure what else could possibly have been expected. This is why making this opt-in is pretty much a non-starter, although a one-time opt-in is more likely to produce some data than an each-time opt-in like that.

@StefanKarpinski
Copy link
Member Author

Honestly the very best option is probably to switch to using our own servers, but that of course is quite a hassle (understatement!)

I really like having things hosted on GitHub. Managing METADATA.jl via pull-requests and being able to give maintainers write access to the repo is golden. Otherwise we have to do all of that, which would be awful. We could host a read-only proxy and setup ~/.julia/METADATA to pull from there but still push to GitHub. That would be significantly easier and allow us to log the interaction. I'm not entirely sure that's more ok from a privacy perspective than just making a separate "phone home" call that people can opt out of, however.

@WestleyArgentum
Copy link
Member

I never thought I'd say this, but https://enterprise.github.com/

@StefanKarpinski
Copy link
Member Author

I think setting up a proxy would be easier, not to mention cheaper.

@mlubin
Copy link
Member

mlubin commented Oct 22, 2013

@IainNZ and I have been putting google analytics on our readthedocs pages. Seems like a more noninvasive approach.

@staticfloat
Copy link
Member

@StefanKarpinski I'm not sure what a proxied METADATA.jl would give us; When a user installs a package we don't do anything special to METADATA.jl, right? Once METADATA.jl is cloned to the user's computer, the computer doesn't touch METADATA.jl until it updates, and vanilla git doesn't send the git server any information about what we've done on the client side when it updates, so I don't see how we would get any information related to what a user has installed that way.

As long as we want to support users being able to do things like install packages from multiple sources (not just our GitHub repos, etc....) and maintain the fantasy that ~/.julia/ is "just a bunch of git repos" with a little bookkeeping on the side, I think the best way to do this is to indeed have the client side report rather than have the server-side try and listen in, because we explicitly want people to be able to use servers other than our own, and we'd have to proxy every package we wanted to monitor.

As far as user-experience for checking in goes, I think it's important to be 100% functional from the get-go, without any "initial setup" or anything. So I'd envision something like this:

A default julia installation will send statistics back to the motherland on certain events (Pkg.update() perhaps?), but will notify the user either on startup or Pkg.update(). Perhaps a notice under the Julia banner that says something like:

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" to list help topics
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.2.0-rc1+89
 _/ |\__'_|_|_|\__'_|  |  Commit c6a5caf* 2013-10-21 16:56:45 UTC
|__/                   |  x86_64-apple-darwin12.4.0

NOTE: This installation will send package usage statistics to <url> when updating
  Run Pkg.statistics_disable() to disable this, or Pkg.statistics_enable() to
  silence this message.  See http://julialang.org/privacy for more information

julia>

This way, if someone installs Julia and wants to go from 0 to curing cancer in 20 seconds, they don't have to wade through a bunch of installation setup, they can get to computing quickly. Simultaneously, privacy-conscious users are immediately told about this in a non-invasive way, and we get fine-grained control about what exactly we want to report from user's machines.

@mlubin
Copy link
Member

mlubin commented Oct 22, 2013

Will this also report the names of local repositories installed in ~/.julia that might not be in METADATA? That would be a bit troubling.

@staticfloat
Copy link
Member

We could do a check, either client-side (comparing to locally cached METADATA.jl) or server-side (comparing to globally-published METADATA.jl) to filter out unpublished packages. That would be pretty simple.

@ivarne
Copy link
Member

ivarne commented Oct 22, 2013

I think the best option is to give the warning and disabling instructions when a .julia dir is created on Pkg.init or somewhere in the installer. Then we show the actual data that will be transmitted every time we transmit data. Mabe on Pkg.update we will have the option to show the data first, and wait until the rest of the command is excecuted, so it can be aborted using ^C. That way we keep ourselves honest about what we collect, and limit the amount. We should only collect data about packages in METADATA, but the number of other packages on the system might be useful also.

@staticfloat
Copy link
Member

The concern I have about doing it on Pkg.init() is that I personally never
manually type Pkg.init(), I always type something like Pkg.add("Winston"),
which outputs multiple pages of text as packages are downloaded, built,
etc..., and it's difficult right now for a user (such as myself) to
differentiate from all the text what belongs to a package installation
process, what belongs to Julia's Pkg systems, and what is something else
entirely.

If our Pkg.add() operations were significantly more concise and
well-organized I would feel better about it, but right now everything just
kind of spits out onto stdout. The coloring of stderr and INFO output is
wonderful, but it's not enough yet.

On Tue, Oct 22, 2013 at 1:57 PM, Ivar Nesje [email protected]:

I think the best option is to give the warning and disabling instructions
when a .julia dir is created on Pkg.init or somewhere in the installer.
Then we show the actual data that will be transmitted every time we
transmit data. Mabe on Pkg.update we will have the option to show the
data first, and wait until the rest of the command is excecuted, so it can
be aborted using ^C. That way we keep ourselves honest about what we
collect, and limit the amount. We should only collect data about packages
in METADATA, but the number of other packages on the system might be useful
also.


Reply to this email directly or view it on GitHubhttps://github.com//issues/4609#issuecomment-26847499
.

@StefanKarpinski
Copy link
Member Author

This is shaping into a pretty reasonable plan. I think maybe just an interactive prompt the first time someone runs Pkg.update() would be best, with an example of the data we would send and instructions for how to opt-out later even if they decide to opt-in now – Pkg.phone_home(true|false) seems like a reasonable interface. I generally frown on anything interactive, but it seems like really the only way to do this. It's a one-time yes/no prompt, so it's pretty limited. On each Pkg.update() I think we can just print a line saying

INFO: Sending output of Pkg.summary() to $server [opt-out by running Pkg.phone_home(false)]

Or maybe we should just print it out every time since that may sound more sinister than just showing the whole thing.

@staticfloat
Copy link
Member

I'm just concerned about automated scripts getting hung up. I often install
julia and then immediately try to run a script that might auto install a
package or two.
On Oct 22, 2013 5:08 PM, "Stefan Karpinski" [email protected]
wrote:

This is shaping into a pretty reasonable plan. I think maybe just an
interactive prompt the first time someone runs Pkg.update() would be
best, with an example of the data we would send and instructions for how to
opt-out later even if they decide to opt-in now –
Pkg.phone_home(true|false) seems like a reasonable interface. I generally
frown on anything interactive, but it seems like really the only way to do
this. It's a one-time yes/no prompt, so it's pretty limited. On each
Pkg.update() I think we can just print a line saying

INFO: Sending output of Pkg.summary() to $server [opt-out by running Pkg.phone_home(false)]

Or maybe we should just print it out every time since that may sound more
sinister than just showing the whole thing.


Reply to this email directly or view it on GitHubhttps://github.com//issues/4609#issuecomment-26868465
.

@ihnorton
Copy link
Member

How about 'storing' the preference in the existence of a file, so that the automated scripts can just touch that?

@staticfloat
Copy link
Member

I guess I could just call pkg.phone_home(true) as well. Nothing to see here
folks
On Oct 22, 2013 5:21 PM, "Isaiah" [email protected] wrote:

How about 'storing' the preference in the existence of a file, so that the
automated scripts can just touch that?


Reply to this email directly or view it on GitHubhttps://github.com//issues/4609#issuecomment-26868977
.

@StefanKarpinski
Copy link
Member Author

Yes, I think that's a good idea. Non-existence of the file causes prompting, while the contents being "true" or "false" indicate the opt-in and opt-out states.

@mlubin
Copy link
Member

mlubin commented Oct 23, 2013

Just to be clear, this isn't for 0.2, right?

@StefanKarpinski
Copy link
Member Author

No, definitely not.

@stevengj
Copy link
Member

An interactive prompt currently won't work in IJulia because of JuliaLang/IJulia.jl#42, although this should be fixable for special-purpose prompts (the difficulty was redirecting stdin in general).

@StefanKarpinski
Copy link
Member Author

We could detect if STDOUT is a TTY and either print an error instead indicating that the user needs to run Pkg.phone_home(true) or Pkg.phone_home(false). Or we could treat "unopted" as opt-out when STDOUT isn't a TTY.

@staticfloat
Copy link
Member

Or we could treat "unopted" as opt-out when STDOUT isn't a TTY.

I like this idea the best.

Something Viral and I were just talking about is being able to get install base numbers from something like this. E.g. do we really need to support OSX 10.6? How many people are using Julia on Ubuntu? It would be neat to have basic stuff like what we report from versioninfo() reported as well. We could even keep track of how many people live on the bleeding edge and how many people stay at released versions!

@StefanKarpinski
Copy link
Member Author

I agree. I think that the benefits to the community as a whole would be enormous. Just knowing what to focus on is a huge benefit. It means we can allocate our efforts more efficiently and sanely.

@pao
Copy link
Member

pao commented Nov 1, 2013

My wishlist:

  1. A clear, properly written privacy policy.
  2. A optional way to inspect what data is sent before it is sent (at least once, as an example of what we collect--and again if the data sent is changed).
  3. I keep thinking there's a third thing but I can't seem to come up with it. I'll come back if it comes to me.

@StefanKarpinski
Copy link
Member Author

If we're going to do this, it would be nice to have some way of uniquely identifying machines and/or users. Hashing a MAC address might work. If they are set, hashing the value of config --global user.email and/or config --global github.user could work.

@MikeInnes
Copy link
Member

Somewhat related: Github just made data about clones over the last two weeks available – you could scrape this data and show it on pkg.julialang.org without the privacy concerns.

@JeffBezanson
Copy link
Member

That's awesome. We've always wanted that data. Amazing fact of the moment: the serialization/deserialization performance issue is by itself one of the 10 most viewed pages in the repo.

@staticfloat
Copy link
Member

Note that this data seems to only go back for two weeks or so. So we've got 645 UNIQUE clones, and 6,675 UNIQUE views in a fortnight, which is pretty impressive, at least to me. ;)

@JeffBezanson
Copy link
Member

Yes, not too shabby. Of course I'm immediately greedy and want a much bigger window of data :)

@staticfloat
Copy link
Member

I'll start working on TimeMachine.jl. Let's just hope The Eschaton doesn't notice.

@IainNZ
Copy link
Member

IainNZ commented Aug 12, 2014

Wooah, I start pulling that data (if its API-exposed)

@IainNZ
Copy link
Member

IainNZ commented Aug 12, 2014

@IainNZ
Copy link
Member

IainNZ commented Aug 12, 2014

I'll try hitting this: https://github.com/JuliaOpt/JuMP.jl/graphs/clone-activity-data and seeing what I can get

@staticfloat
Copy link
Member

I was just about to say:

"shhhhhh: https://github.com/JuliaLang/julia/graphs/clone-activity-data"

On Tue, Aug 12, 2014 at 1:10 PM, Iain Dunning [email protected]
wrote:

I'll try hitting this:
https://github.com/JuliaOpt/JuMP.jl/graphs/clone-activity-data and seeing
what I can get


Reply to this email directly or view it on GitHub
#4609 (comment).

@IainNZ
Copy link
Member

IainNZ commented Aug 12, 2014

Hope I don't get IP banned forever for doing that 350 times, lol

@IainNZ
Copy link
Member

IainNZ commented Aug 28, 2014

If anyone knows how to scrape that with Requests, let me know. You need to be logged in to see to it, and its not a real API endpoint so I don't see a way/know how to use a token.

@staticfloat
Copy link
Member

I'm getting pretty close, as long as you don't mind giving this script your GitHub username. It logs in correctly, and I believe will allow us to get at the data we need, but unfortunately Requests.jl has a bug which needs to get fixed before we can actually use this.

@samuelcolvin
Copy link
Contributor

github helpfully supply the underlying data in JSON at

https://github.com/JuliaLang/julia/graphs/clone-activity-data?_=1409921223000

so no complicated scrapping required.

@ivarne
Copy link
Member

ivarne commented Sep 5, 2014

@samuelcolvin I just get accessed denied when I download that using wget. In a logged in browser it works fine, but that means that you'd have to manually download everything in the browser and copy i to somewhere for analysis. 350 manual fetches in the browser seems like something you'd use scraping software to help with.

@StefanKarpinski
Copy link
Member Author

It must be possible to automatically authenticate as well. We do this when using the GitHub APIs from Julia.

@samuelcolvin
Copy link
Contributor

Not sure it's possible if you're not using api.gitgub...... I'll try and see
On 5 Sep 2014 14:46, "Stefan Karpinski" [email protected] wrote:

It must be possible to automatically authenticate as well. We do this when
using the GitHub APIs from Julia.


Reply to this email directly or view it on GitHub
#4609 (comment).

@IainNZ
Copy link
Member

IainNZ commented Sep 5, 2014

No I'm pretty sure @staticfloat s way is the only way here, the API is different and I make use of that extensively.

@samuelcolvin
Copy link
Contributor

ye, this doesn't work, so I'm pretty sure we have to do it the ugly way or wait for the proper api

curl -H "Authorization: token <my api token>" https://github.com/samuelcolvin/JuliaByExample/graphs/clone-activity-data?_=1409921223000

@tkelman
Copy link
Contributor

tkelman commented Sep 14, 2016

@StefanKarpinski why was this closed? Incorporating some form of usage analytics in Pkg3 would be a good idea if possible.

@StefanKarpinski
Copy link
Member Author

Sure, but I don't we need an issue for that. We can reopen if you like.

@tkelman
Copy link
Contributor

tkelman commented Sep 14, 2016

If Pkg3 ends up being developed primarily in a separate repo we can move it to a new issue there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs decision A decision on this change is needed packages Package management and loading
Projects
None yet
Development

No branches or pull requests