Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process CCD Notebook #34

Open
kadrlica opened this issue May 16, 2018 · 10 comments
Open

Process CCD Notebook #34

kadrlica opened this issue May 16, 2018 · 10 comments
Assignees
Labels
Image Processing Topic area, to help people find projects to work on project Projects that Stack Club members are working on, defined in the top comment and discussed thereafter

Comments

@kadrlica
Copy link
Contributor

This is mostly my attempt to run processEImage.py on the twinkles data. It has ended up not being as easy as I hoped.

@drphilmarshall
Copy link
Contributor

drphilmarshall commented May 16, 2018

"It has ended up not being as easy as I hoped."

First contender for Stack Club tagline? :-)

Do you want to submit a PR and use it to point out some sticking points? Might be good for all of us watching to appreciate the challenges. Thanks for taking a crack at this!

@kadrlica
Copy link
Contributor Author

kadrlica commented May 16, 2018

Current version of the notebook can be found here but not sure if it's ready for a PR. The notebook is not fully functional, and the github PR interface shows changes to the json, which I don't find especially enlightening (though maybe I'm missing something?).

Some of the sticking points:

  • processEimage.py currently relies on Python2, but the stack versions on jupyterlab are all python3. I've created my own version of processEimage.py that updates the python version and it appears to work, but cannot be distributed with the notebook. In the notebook I directly interface with ProcessEimageTask API, but this is clunky.
  • Use of old-style reference catalogs astrometry_net_data which are not installed in current versions of the stack (UNRESOLVED)
  • Larger memory use than can be supported in a tiny jupyterlab instance (this can be solved by choosing medium)
  • My general opinion is that running something that takes several minutes to complete doesn't seem optimal for the notebook format.

@drphilmarshall
Copy link
Contributor

Thanks Alex.

  • processEimage.py relying on python 2 sounds like a bad sign: it could be falling behind the stack, perhaps through dis-use. Is processCCD.py itself python 3-compatible? If so, shelving the processEimage.py part of the tutorial and using a different (and raw) test image might be the way to go, I guess. Would that solve the reference catalog problem as well?

  • The memory issue looks serious. Suppose we insist on being able to run processCCD.py from a tutorial notebook, @SimonKrughoff - what would the implications be for the NCSA accounts? Is medium an allowable option for Stack Club members?

A couple of minutes runtime is awkwardly long for a live run through in front of an audience - but for a notebook that you work through at home I think it's fine (as long as the text tells you that you'll need to wait!). The other thing we could do is have on hand the results of a pre-baked run - which could be helpful for comparing results, but would also enable the user to skip the time-consuming part and get to the later section where the results of the task are displayed and explained.

Regarding PR threads for notebooks: I agree it's not ideal that we can only make line comments on the raw notebook code, but the PR thread does at least provide a "view" button where you can see the rendered notebook (so it saves you pasting links to the file in the branch). And then of course it enables the usual code review conversation in place, as well as tracking all the contributions made. I'd say that notebook was ready for PR no problem - you should just flag it as not yet ready to merge, in the initial comment. (Personally I like the workflow where you make the PR as soon as the dev branch is made in the base repo - so that that branch has an explanation for its existence.)

@kadrlica
Copy link
Contributor Author

As far as I know, the twinkles and HSC data sets are the best curated. My personal thinking was that it was better to use simulated LSST images than real data from HSC. If we use the HSC dataset we should be able to follow the v15.0 tutorial line for line, but I'm not sure we want to do that.

The python3 compatibility is a one-line change to the shebang. I've already submitted an issue on this here. We also might be able to get away with something smaller than medium but I haven't checked.

@SimonKrughoff
Copy link
Contributor

  1. I think running on medium should be fine. The cluster is fairly large, so I don't think we'll run out of CPUs. I think we will learn a lot about how k8s packs jobs onto nodes and how to configure things correctly.
  2. Diffing notebooks is a known issue. There is a package called nbdime that looks promising, but is not integrated into the github UI.
  3. The python2 problem should just be fixed. I'll try to get to that tomorrow and then push it to the fork.

@SimonKrughoff
Copy link
Contributor

I never got to 3 above, but it is on the branch that Heather Kelly is about to merge.

@drphilmarshall drphilmarshall added the project Projects that Stack Club members are working on, defined in the top comment and discussed thereafter label Jul 11, 2018
@kadrlica
Copy link
Contributor Author

kadrlica commented Aug 10, 2018

ProcessEImage is now broken with the following error:

---------------------------------------------------------------------------
ParentsMismatch                           Traceback (most recent call last)
/opt/lsst/software/stack/stack/miniconda3-4.3.21-10a4fa6/Linux64/daf_persistence/16.0/python/lsst/daf/persistence/butler.py in _setAndVerifyParentsLists(self, repoDataList)
    925                     try:
--> 926                         repoData.cfg.extendParents(parents)
    927                     except ParentsMismatch as e:

/opt/lsst/software/stack/stack/miniconda3-4.3.21-10a4fa6/Linux64/daf_persistence/16.0/python/lsst/daf/persistence/repositoryCfg.py in extendParents(self, newParents)
    181                                   "existing parents list in this RepositoryCfg: {}").format(
--> 182                                   newParents, self._parents))
    183 

ParentsMismatch: The beginning of the passed-in parents list: [RepositoryCfg(root='../../../../../DMStackClub/ImageProcessing/DATA', mapper=<class 'lsst.obs.lsstSim.lsstSimMapper.LsstSimMapper'>, mapperArgs={}, parents=[], policy=None), RepositoryCfg(root='../..', mapper=<class 'lsst.obs.lsstSim.lsstSimMapper.LsstSimMapper'>, mapperArgs={}, parents=[], policy=None)] does not match the existing parents list in this RepositoryCfg: [RepositoryCfg(root='/home/kadrlica/notebooks/DMStackClub/ImageProcessing/DATA', mapper=<class 'lsst.obs.lsstSim.lsstSimMapper.LsstSimMapper'>, mapperArgs={}, parents=[], policy=None)]

@kadrlica
Copy link
Contributor Author

I've decided to give up on ProcessEimage and move to ProcessCcd with the HSC data.

@kadrlica
Copy link
Contributor Author

Simon suggests that some modifications of configuration parameters could be useful:

  • Change the binning of the background model
  • Change the PSF model
  • Change the source detection threshold

He also suggests looking at Jim Bosch's Leon notebook:

@kadrlica
Copy link
Contributor Author

He also suggests Jonathan's documentation on configuration:

@drphilmarshall drphilmarshall added the Image Processing Topic area, to help people find projects to work on label Sep 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Image Processing Topic area, to help people find projects to work on project Projects that Stack Club members are working on, defined in the top comment and discussed thereafter
Projects
None yet
Development

No branches or pull requests

3 participants