P.k.g. phone home #4609

StefanKarpinski · 2013-10-22T17:53:48Z

It would be nice to have some way of knowing what packages people are using and some way of estimating Julia installs. We could potentially achieve this by having Pkg phone home when doing Pkg.update – i.e. send a list of installed packages and system version info to a server for logging. I wouldn't want to do that in any underhanded, sneaky sort of way, but opt-in doesn't seem likely to generate much data. Any thoughts on this? Good idea, bad idea? How would we do it in a way that's transparent and not sneaky but is likely to get us a reasonable amount of representative data? Note that while we don't currently have any way of getting this information, GitHub already does since they know what users and IP addresses are doing git pull against METADATA.jl. So in principle, this is already information users are sharing – just not with us.

The text was updated successfully, but these errors were encountered:

johnmyleswhite · 2013-10-22T17:57:11Z

This would be great information to have. It would be good to hear how R's CRAN servers are dealing with this. They maintained their own log files forever about package downloads, but only recently decided to share aggregate statistics with the public. We obviously don't run servers, so we have to push information out rather than retain it, but that may not make much of a difference to people.

JeffBezanson · 2013-10-22T17:59:35Z

Very tricky. One thing I can think of is prompt "Are you ok with Pkg sending anonymous usage data?" when Pkg.init() runs. I hate prompting the user, but it's hard to deal with this otherwise since neither opt-in nor opt-out works very well. Honestly the very best option is probably to switch to using our own servers, but that of course is quite a hassle (understatement!)

johnmyleswhite · 2013-10-22T18:04:22Z

Also worth noting: Hadley Wickham's CRANtastic tried to use an even more opt-in mechanism in which you had to explicitly call a function to send information to his server. Basically no one ever did this and he got no useful information out of it.

StefanKarpinski · 2013-10-22T18:06:53Z

Also worth noting: Hadley Wickham's CRANtastic tried to use an even more opt-in mechanism in which you had to explicitly call a function to send information to his server. Basically no one ever did this and he got no useful information out of it.

I'm not sure what else could possibly have been expected. This is why making this opt-in is pretty much a non-starter, although a one-time opt-in is more likely to produce some data than an each-time opt-in like that.

StefanKarpinski · 2013-10-22T18:10:08Z

Honestly the very best option is probably to switch to using our own servers, but that of course is quite a hassle (understatement!)

I really like having things hosted on GitHub. Managing METADATA.jl via pull-requests and being able to give maintainers write access to the repo is golden. Otherwise we have to do all of that, which would be awful. We could host a read-only proxy and setup ~/.julia/METADATA to pull from there but still push to GitHub. That would be significantly easier and allow us to log the interaction. I'm not entirely sure that's more ok from a privacy perspective than just making a separate "phone home" call that people can opt out of, however.

WestleyArgentum · 2013-10-22T18:12:03Z

I never thought I'd say this, but https://enterprise.github.com/

StefanKarpinski · 2013-10-22T18:28:56Z

I think setting up a proxy would be easier, not to mention cheaper.

mlubin · 2013-10-22T20:05:57Z

@IainNZ and I have been putting google analytics on our readthedocs pages. Seems like a more noninvasive approach.

staticfloat · 2013-10-22T20:16:59Z

@StefanKarpinski I'm not sure what a proxied METADATA.jl would give us; When a user installs a package we don't do anything special to METADATA.jl, right? Once METADATA.jl is cloned to the user's computer, the computer doesn't touch METADATA.jl until it updates, and vanilla git doesn't send the git server any information about what we've done on the client side when it updates, so I don't see how we would get any information related to what a user has installed that way.

As long as we want to support users being able to do things like install packages from multiple sources (not just our GitHub repos, etc....) and maintain the fantasy that ~/.julia/ is "just a bunch of git repos" with a little bookkeeping on the side, I think the best way to do this is to indeed have the client side report rather than have the server-side try and listen in, because we explicitly want people to be able to use servers other than our own, and we'd have to proxy every package we wanted to monitor.

As far as user-experience for checking in goes, I think it's important to be 100% functional from the get-go, without any "initial setup" or anything. So I'd envision something like this:

A default julia installation will send statistics back to the motherland on certain events (Pkg.update() perhaps?), but will notify the user either on startup or Pkg.update(). Perhaps a notice under the Julia banner that says something like:

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" to list help topics
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.2.0-rc1+89
 _/ |\__'_|_|_|\__'_|  |  Commit c6a5caf* 2013-10-21 16:56:45 UTC
|__/                   |  x86_64-apple-darwin12.4.0

NOTE: This installation will send package usage statistics to <url> when updating
  Run Pkg.statistics_disable() to disable this, or Pkg.statistics_enable() to
  silence this message.  See http://julialang.org/privacy for more information

julia>

This way, if someone installs Julia and wants to go from 0 to curing cancer in 20 seconds, they don't have to wade through a bunch of installation setup, they can get to computing quickly. Simultaneously, privacy-conscious users are immediately told about this in a non-invasive way, and we get fine-grained control about what exactly we want to report from user's machines.

mlubin · 2013-10-22T20:26:14Z

Will this also report the names of local repositories installed in ~/.julia that might not be in METADATA? That would be a bit troubling.

staticfloat · 2013-10-22T20:32:00Z

We could do a check, either client-side (comparing to locally cached METADATA.jl) or server-side (comparing to globally-published METADATA.jl) to filter out unpublished packages. That would be pretty simple.

ivarne · 2013-10-22T20:57:24Z

I think the best option is to give the warning and disabling instructions when a .julia dir is created on Pkg.init or somewhere in the installer. Then we show the actual data that will be transmitted every time we transmit data. Mabe on Pkg.update we will have the option to show the data first, and wait until the rest of the command is excecuted, so it can be aborted using ^C. That way we keep ourselves honest about what we collect, and limit the amount. We should only collect data about packages in METADATA, but the number of other packages on the system might be useful also.

staticfloat · 2013-10-22T21:01:36Z

The concern I have about doing it on Pkg.init() is that I personally never
manually type Pkg.init(), I always type something like Pkg.add("Winston"),
which outputs multiple pages of text as packages are downloaded, built,
etc..., and it's difficult right now for a user (such as myself) to
differentiate from all the text what belongs to a package installation
process, what belongs to Julia's Pkg systems, and what is something else
entirely.

If our Pkg.add() operations were significantly more concise and
well-organized I would feel better about it, but right now everything just
kind of spits out onto stdout. The coloring of stderr and INFO output is
wonderful, but it's not enough yet.

On Tue, Oct 22, 2013 at 1:57 PM, Ivar Nesje [email protected]:

I think the best option is to give the warning and disabling instructions
when a .julia dir is created on Pkg.init or somewhere in the installer.
Then we show the actual data that will be transmitted every time we
transmit data. Mabe on Pkg.update we will have the option to show the
data first, and wait until the rest of the command is excecuted, so it can
be aborted using ^C. That way we keep ourselves honest about what we
collect, and limit the amount. We should only collect data about packages
in METADATA, but the number of other packages on the system might be useful
also.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4609#issuecomment-26847499
.

StefanKarpinski · 2013-10-23T00:08:34Z

This is shaping into a pretty reasonable plan. I think maybe just an interactive prompt the first time someone runs Pkg.update() would be best, with an example of the data we would send and instructions for how to opt-out later even if they decide to opt-in now – Pkg.phone_home(true|false) seems like a reasonable interface. I generally frown on anything interactive, but it seems like really the only way to do this. It's a one-time yes/no prompt, so it's pretty limited. On each Pkg.update() I think we can just print a line saying

INFO: Sending output of Pkg.summary() to $server [opt-out by running Pkg.phone_home(false)]

Or maybe we should just print it out every time since that may sound more sinister than just showing the whole thing.

staticfloat · 2013-10-23T00:13:24Z

I'm just concerned about automated scripts getting hung up. I often install
julia and then immediately try to run a script that might auto install a
package or two.
On Oct 22, 2013 5:08 PM, "Stefan Karpinski" [email protected]
wrote:

This is shaping into a pretty reasonable plan. I think maybe just an
interactive prompt the first time someone runs Pkg.update() would be
best, with an example of the data we would send and instructions for how to
opt-out later even if they decide to opt-in now –
Pkg.phone_home(true|false) seems like a reasonable interface. I generally
frown on anything interactive, but it seems like really the only way to do
this. It's a one-time yes/no prompt, so it's pretty limited. On each
Pkg.update() I think we can just print a line saying

INFO: Sending output of Pkg.summary() to $server [opt-out by running Pkg.phone_home(false)]

Or maybe we should just print it out every time since that may sound more
sinister than just showing the whole thing.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4609#issuecomment-26868465
.

ihnorton · 2013-10-23T00:21:37Z

How about 'storing' the preference in the existence of a file, so that the automated scripts can just touch that?

staticfloat · 2013-10-23T00:34:00Z

I guess I could just call pkg.phone_home(true) as well. Nothing to see here
folks
On Oct 22, 2013 5:21 PM, "Isaiah" [email protected] wrote:

How about 'storing' the preference in the existence of a file, so that the
automated scripts can just touch that?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4609#issuecomment-26868977
.

StefanKarpinski · 2013-10-23T00:34:10Z

Yes, I think that's a good idea. Non-existence of the file causes prompting, while the contents being "true" or "false" indicate the opt-in and opt-out states.

mlubin · 2013-10-23T00:59:25Z

Just to be clear, this isn't for 0.2, right?

StefanKarpinski · 2013-10-23T01:10:27Z

No, definitely not.

stevengj · 2013-10-23T21:00:52Z

An interactive prompt currently won't work in IJulia because of JuliaLang/IJulia.jl#42, although this should be fixable for special-purpose prompts (the difficulty was redirecting stdin in general).

StefanKarpinski · 2013-10-23T21:07:20Z

We could detect if STDOUT is a TTY and either print an error instead indicating that the user needs to run Pkg.phone_home(true) or Pkg.phone_home(false). Or we could treat "unopted" as opt-out when STDOUT isn't a TTY.

staticfloat · 2013-11-01T00:22:22Z

Or we could treat "unopted" as opt-out when STDOUT isn't a TTY.

I like this idea the best.

Something Viral and I were just talking about is being able to get install base numbers from something like this. E.g. do we really need to support OSX 10.6? How many people are using Julia on Ubuntu? It would be neat to have basic stuff like what we report from versioninfo() reported as well. We could even keep track of how many people live on the bleeding edge and how many people stay at released versions!

StefanKarpinski · 2013-11-01T03:18:31Z

I agree. I think that the benefits to the community as a whole would be enormous. Just knowing what to focus on is a huge benefit. It means we can allocate our efforts more efficiently and sanely.

pao · 2013-11-01T12:37:44Z

My wishlist:

A clear, properly written privacy policy.
A optional way to inspect what data is sent before it is sent (at least once, as an example of what we collect--and again if the data sent is changed).
I keep thinking there's a third thing but I can't seem to come up with it. I'll come back if it comes to me.

StefanKarpinski · 2013-11-01T13:17:28Z

If we're going to do this, it would be nice to have some way of uniquely identifying machines and/or users. Hashing a MAC address might work. If they are set, hashing the value of config --global user.email and/or config --global github.user could work.

MikeInnes · 2014-08-12T16:32:40Z

Somewhat related: Github just made data about clones over the last two weeks available – you could scrape this data and show it on pkg.julialang.org without the privacy concerns.

JeffBezanson · 2014-08-12T16:44:45Z

That's awesome. We've always wanted that data. Amazing fact of the moment: the serialization/deserialization performance issue is by itself one of the 10 most viewed pages in the repo.

staticfloat · 2014-08-12T16:46:30Z

Note that this data seems to only go back for two weeks or so. So we've got 645 UNIQUE clones, and 6,675 UNIQUE views in a fortnight, which is pretty impressive, at least to me. ;)

JeffBezanson · 2014-08-12T16:48:03Z

Yes, not too shabby. Of course I'm immediately greedy and want a much bigger window of data :)

staticfloat · 2014-08-12T16:51:20Z

I'll start working on TimeMachine.jl. Let's just hope The Eschaton doesn't notice.

IainNZ · 2014-08-12T16:55:33Z

Wooah, I start pulling that data (if its API-exposed)

IainNZ · 2014-08-12T17:08:28Z

:( https://twitter.com/alindeman/status/499239929604743169 no API yet

IainNZ · 2014-08-12T17:10:40Z

I'll try hitting this: https://github.com/JuliaOpt/JuMP.jl/graphs/clone-activity-data and seeing what I can get

staticfloat · 2014-08-12T17:12:12Z

I was just about to say:

"shhhhhh: https://github.com/JuliaLang/julia/graphs/clone-activity-data"

On Tue, Aug 12, 2014 at 1:10 PM, Iain Dunning [email protected]
wrote:

I'll try hitting this:
https://github.com/JuliaOpt/JuMP.jl/graphs/clone-activity-data and seeing
what I can get

—
Reply to this email directly or view it on GitHub
#4609 (comment).

IainNZ · 2014-08-12T17:14:15Z

Hope I don't get IP banned forever for doing that 350 times, lol

IainNZ · 2014-08-28T21:16:52Z

If anyone knows how to scrape that with Requests, let me know. You need to be logged in to see to it, and its not a real API endpoint so I don't see a way/know how to use a token.

staticfloat · 2014-08-29T01:28:16Z

I'm getting pretty close, as long as you don't mind giving this script your GitHub username. It logs in correctly, and I believe will allow us to get at the data we need, but unfortunately Requests.jl has a bug which needs to get fixed before we can actually use this.

samuelcolvin · 2014-09-05T12:51:17Z

github helpfully supply the underlying data in JSON at

https://github.com/JuliaLang/julia/graphs/clone-activity-data?_=1409921223000

so no complicated scrapping required.

ivarne · 2014-09-05T13:29:37Z

@samuelcolvin I just get accessed denied when I download that using wget. In a logged in browser it works fine, but that means that you'd have to manually download everything in the browser and copy i to somewhere for analysis. 350 manual fetches in the browser seems like something you'd use scraping software to help with.

StefanKarpinski · 2014-09-05T13:46:20Z

It must be possible to automatically authenticate as well. We do this when using the GitHub APIs from Julia.

samuelcolvin · 2014-09-05T13:48:07Z

Not sure it's possible if you're not using api.gitgub...... I'll try and see
On 5 Sep 2014 14:46, "Stefan Karpinski" [email protected] wrote:

It must be possible to automatically authenticate as well. We do this when
using the GitHub APIs from Julia.

—
Reply to this email directly or view it on GitHub
#4609 (comment).

IainNZ · 2014-09-05T14:26:19Z

No I'm pretty sure @staticfloat s way is the only way here, the API is different and I make use of that extensively.

samuelcolvin · 2014-09-05T16:28:58Z

ye, this doesn't work, so I'm pretty sure we have to do it the ugly way or wait for the proper api

curl -H "Authorization: token <my api token>" https://github.com/samuelcolvin/JuliaByExample/graphs/clone-activity-data?_=1409921223000

tkelman · 2016-09-14T22:18:16Z

@StefanKarpinski why was this closed? Incorporating some form of usage analytics in Pkg3 would be a good idea if possible.

StefanKarpinski · 2016-09-14T22:21:09Z

Sure, but I don't we need an issue for that. We can reopen if you like.

tkelman · 2016-09-14T22:23:36Z

If Pkg3 ends up being developed primarily in a separate repo we can move it to a new issue there.

staticfloat mentioned this issue Nov 19, 2013

Roadmap for 0.3 #4853

Closed

21 tasks

staticfloat mentioned this issue Jan 9, 2014

Reproducible system crash on mac os x 10.6.8 #5329

Closed

StefanKarpinski closed this as completed Sep 13, 2016

staticfloat mentioned this issue Mar 17, 2017

Pkg3: telemetry JuliaLang/Juleps#29

Open

P.k.g. phone home #4609

P.k.g. phone home #4609

Comments

StefanKarpinski commented Oct 22, 2013

johnmyleswhite commented Oct 22, 2013

JeffBezanson commented Oct 22, 2013

johnmyleswhite commented Oct 22, 2013

StefanKarpinski commented Oct 22, 2013

StefanKarpinski commented Oct 22, 2013

WestleyArgentum commented Oct 22, 2013

StefanKarpinski commented Oct 22, 2013

mlubin commented Oct 22, 2013

staticfloat commented Oct 22, 2013

mlubin commented Oct 22, 2013

staticfloat commented Oct 22, 2013

ivarne commented Oct 22, 2013

staticfloat commented Oct 22, 2013

StefanKarpinski commented Oct 23, 2013

staticfloat commented Oct 23, 2013

ihnorton commented Oct 23, 2013

staticfloat commented Oct 23, 2013

StefanKarpinski commented Oct 23, 2013

mlubin commented Oct 23, 2013

StefanKarpinski commented Oct 23, 2013

stevengj commented Oct 23, 2013

StefanKarpinski commented Oct 23, 2013

staticfloat commented Nov 1, 2013

StefanKarpinski commented Nov 1, 2013

pao commented Nov 1, 2013

StefanKarpinski commented Nov 1, 2013

MikeInnes commented Aug 12, 2014

JeffBezanson commented Aug 12, 2014

staticfloat commented Aug 12, 2014

JeffBezanson commented Aug 12, 2014

staticfloat commented Aug 12, 2014

IainNZ commented Aug 12, 2014

IainNZ commented Aug 12, 2014

IainNZ commented Aug 12, 2014

staticfloat commented Aug 12, 2014

IainNZ commented Aug 12, 2014

IainNZ commented Aug 28, 2014

staticfloat commented Aug 29, 2014

samuelcolvin commented Sep 5, 2014

ivarne commented Sep 5, 2014

StefanKarpinski commented Sep 5, 2014

samuelcolvin commented Sep 5, 2014

IainNZ commented Sep 5, 2014

samuelcolvin commented Sep 5, 2014

tkelman commented Sep 14, 2016

StefanKarpinski commented Sep 14, 2016

tkelman commented Sep 14, 2016