Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rejection sampling variational inference #819

Open
wants to merge 43 commits into
base: master
Choose a base branch
from

Conversation

cavaunpeu
Copy link
Contributor

WIP. Addresses #379.

@dustinvtran
Copy link
Member

Thanks for working on this! Ping me whenever you'd like some feedback.

@cavaunpeu
Copy link
Contributor Author

Will do! It's very WIP for now. I want to get a thing working, then clean it up considerably.

@cavaunpeu
Copy link
Contributor Author

cavaunpeu commented Jan 21, 2018

@dustinvtran, @naesseth

Hi. A few updates/questions:

  • I implemented a KucukelbirOptimizer as given in Equation 9 of the paper. (Too lazy to search for its real name, I called it this.) I put it in edward/optimizers/sgd.py and conformed its API to that of the other optimizer types that VariationalInference.initialize expects; it's a simple interface with an apply_gradients method. Eventually, this is something to robust-ify and merge upstream into Tensorflow, I'd think (if it's not there already). I've added an integration test as well.

  • My next step is to implement a ReparameterizedRejectionSampler object. This doesn't need to know anything about VI. The motivation is the ability to test it separately.

After this, everything should be on the right track.

Thus far, I've worked on build_rejection_sampling_loss_and_gradients, as a corollary to build_reparam_loss_and_gradients. Would the gradients given in the paper easily extend to a

  • build_rejection_sampling_kl_loss_and_gradients
  • build_rejection_sampling_entropy_loss_and_gradients

as well? I haven't taken the time to think this through.

How does this all sound? Thanks!

self.s_n = s_n
self.n = n

def apply_gradients(self, grads_and_vars, global_step=None):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dustinvtran

I'd quite appreciate if you could glance at this method as well, as my integration test passes on some days and fails on others — with 0 changes to my code. Promise 🤞.

@cavaunpeu
Copy link
Contributor Author

cavaunpeu commented Jan 30, 2018

Hey @dustinvtran, @naesseth. I've:

  • Removed the "Alp" sampler, and specified 'rmsprop' with a decaying learning rate.
  • Unit-tested a GammaRejectionSampler object.
  • Unit-tested build_rejection_sampling_loss_and_gradients.

In the latter two, I pinned results to those computed in this notebook and vetted them thoroughly.

My gradients are still exploding. Might you have a chance to give some 👀 this week?

@ghost
Copy link

ghost commented Feb 7, 2018

Hey @dustinvtran. Following up :)

@dustinvtran
Copy link
Member

Apologies for the delay! Busy for ICML stuff due this Friday. Maybe ping me this weekend? :)

@ghost
Copy link

ghost commented Feb 7, 2018

Ah, no worries! Will do.

The PR is in a good state, I think. Just need to close the gap between "unit-testing 1 iteration of RSVI gradient calculations passes" and "it just works". I'm guessing there's some slight "magic" with learning rates, supports, or the like that I'm missing. This is my hunch.

Copy link
Member

@dustinvtran dustinvtran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work! I think the testing as is is fine.

I like your API for rejection sampling for Gamma. In future work after merging this, it would be nice to move upstream to TensorFlow distributions. There, we can think a bit harder about how to incorporate various forms of reparameterized samplers in the tf.contrib.distributions.Gamma.

Edward 2.0's ed.klqp will incorporate the loss function you wrote down here.

@@ -144,6 +143,7 @@ def run(self, variables=None, use_coordinator=True, *args, **kwargs):

for _ in range(self.n_iter):
info_dict = self.update()
print(info_dict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm?

@@ -123,7 +123,6 @@ def run(self, variables=None, use_coordinator=True, *args, **kwargs):
Passed into `initialize`.
"""
self.initialize(*args, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add back newline? unrelated to PR

@@ -32,7 +32,7 @@ class KLpq(VariationalInference):

with respect to $\\theta$.

In conditional inference, we infer $z` in $p(z, \\beta
In conditional inference, we infer $z$ in $p(z, \\beta
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated to this PR. Can you make a new PR to fix this?

tf.summary.scalar("loss/reg_penalty", reg_penalty,
collections=[inference._summary_key])

g_rep = tf.gradients(rep, var_list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why you need the multiple gradient calls and not just one? This seems inefficient.

@cavaunpeu
Copy link
Contributor Author

Hey @dustinvtran. Thanks for the feedback. I left this in "debug" state (print statements, etc.), as it doesn't yet work. Specifically, the main integration test, _test_poisson_gamma, does not pass: the gradients explode to np.nan.

Any insight as to why this might be? I'm guessing I'm lacking some expertise in getting this to work. NB: the "single pass gradient computation" integration test, _test_build_rejection_sampling_loss_and_gradients, does pass (and has been thoroughly cross-checked with the Blei-lab notebook).

Also, I think there are other code-organizational considerations to be made before readying this for merge. For instance, what happens when we have > 1 latent variables, where some require a rejection sampler and others don't?

In short: I think I need some help :)

@cavaunpeu
Copy link
Contributor Author

Ping, @dustinvtran :) What's the best way forward? Have a moment to review in the coming days? Anyone else I should reach out to?

Cheers :)

@dustinvtran
Copy link
Member

Not sure if I know enough about the algorithm to help unfortunately. What happens if you try 1000 samples per iteration? Maybe @naesseth can reply?

@ghost
Copy link

ghost commented Feb 22, 2018

Cool! Will ping them here.

@naesseth @slinderman have a moment to help get this merged? If you're still in NYC, happy to come meet in person as well, so as to get this one done!

@slinderman
Copy link

Hi @williamabrwolf, I’m traveling now but I’ll be back in NYC mid March. Happy to meet then if you’d like. I’ll have limited cycles between now and then but I’ll try to take a look at the code here. Perhaps Christian can also help out in the meantime. Glad to see you’re integrating this into Edward!

@ghost
Copy link

ghost commented Feb 22, 2018

@slinderman mid-March works great. Grateful for the help, and happy to meet then if we haven't resolved. [email protected] is me. Cheers :).

@naesseth
Copy link

I'll try and take a look early next week, currently traveling.

@ghost
Copy link

ghost commented Mar 10, 2018

@slinderman @naesseth hey all. back in NYC? would love to meet for 2 hours and get this to a place where we can soon merge.

@ghost
Copy link

ghost commented Mar 21, 2018

Hey @slinderman, @naesseth. Following up :). Back in the city? Happy to travel to you to make this happen.

Cheers!

@slinderman
Copy link

slinderman commented Mar 21, 2018 via email

g_cor = tf.gradients(cor, var_list)
g_entropy = tf.gradients(q_entropy, var_list)

grad_summands = zip(*[g_rep, g_cor, g_entropy])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we try dropping g_cor from this summand and see if tests still pass?

Expected behavior: pass at a higher tolerance, but not blow up.

This is a possible culprit re: why gradients are exploding in running _test_poisson_gamma.

With a reasonably small step size, maybe 100 epochs.

Worth keeping an eye on g_entropy:

  • First, try g_rep and g_entropy
  • Next, try just g_rep

Print all the gradient terms from notebook as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"With a reasonably small step size, maybe 100 epochs." --> i.e. it should pass "with a reasonably small step size, and run for maybe 100 epochs."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants