Using IPython notebooks under version control

2019-01-02 21:13发布

What is a good strategy for keeping IPython notebooks under version control?

The notebook format is quite amenable for version control: if one wants to version control the notebook and the outputs then this works quite well. The annoyance comes when one wants only to version control the input, excluding the cell outputs (aka. "build products") which can be large binary blobs, especially for movies and plots. In particular, I am trying to find a good workflow that:

  • allows me to choose between including or excluding output,
  • prevents me from accidentally committing output if I do not want it,
  • allows me to keep output in my local version,
  • allows me to see when I have changes in the inputs using my version control system (i.e. if I only version control the inputs but my local file has outputs, then I would like to be able to see if the inputs have changed (requiring a commit). Using the version control status command will always register a difference since the local file has outputs.)
  • allows me to update my working notebook (which contains the output) from an updated clean notebook. (update)

As mentioned, if I chose to include the outputs (which is desirable when using nbviewer for example), then everything is fine. The problem is when I do not want to version control the output. There are some tools and scripts for stripping the output of the notebook, but frequently I encounter the following issues:

  1. I accidentally commit a version with the the output, thereby polluting my repository.
  2. I clear output to use version control, but would really rather keep the output in my local copy (sometimes it takes a while to reproduce for example).
  3. Some of the scripts that strip output change the format slightly compared to the Cell/All Output/Clear menu option, thereby creating unwanted noise in the diffs. This is resolved by some of the answers.
  4. When pulling changes to a clean version of the file, I need to find some way of incorporating those changes in my working notebook without having to rerun everything. (update)

I have considered several options that I shall discuss below, but have yet to find a good comprehensive solution. A full solution might require some changes to IPython, or may rely on some simple external scripts. I currently use mercurial, but would like a solution that also works with git: an ideal solution would be version-control agnostic.

This issue has been discussed many times, but there is no definitive or clear solution from the user's perspective. The answer to this question should provide the definitive strategy. It is fine if it requires a recent (even development) version of IPython or an easily installed extension.

Update: I have been playing with my modified notebook version which optionally saves a .clean version with every save using Gregory Crosswhite's suggestions. This satisfies most of my constraints but leaves the following unresolved:

  1. This is not yet a standard solution (requires a modification of the ipython source. Is there a way of achieving this behaviour with a simple extension? Needs some sort of on-save hook.
  2. A problem I have with the current workflow is pulling changes. These will come in to the .clean file, and then need to be integrated somehow into my working version. (Of course, I can always re-execute the notebook, but this can be a pain, especially if some of the results depend on long calculations, parallel computations, etc.) I do not have a good idea about how to resolve this yet. Perhaps a workflow involving an extension like ipycache might work, but that seems a little too complicated.

Notes

Removing (stripping) Output

  • When the notebook is running, one can use the Cell/All Output/Clear menu option for removing the output.
  • There are some scripts for removing output, such as the script nbstripout.py which remove the output, but does not produce the same output as using the notebook interface. This was eventually included in the ipython/nbconvert repo, but this has been closed stating that the changes are now included in ipython/ipython,but the corresponding functionality seems not to have been included yet. (update) That being said, Gregory Crosswhite's solution shows that this is pretty easy to do, even without invoking ipython/nbconvert, so this approach is probably workable if it can be properly hooked in. (Attaching it to each version control system, however, does not seem like a good idea — this should somehow hook in to the notebook mechanism.)

Newsgroups

Issues

Pull Requests

17条回答
我又没胸盯我看什么
2楼-- · 2019-01-02 21:53

(2017-02)

strategies

  • on_commit():
    • strip the output > name.ipynb (nbstripout, )
    • strip the output > name.clean.ipynb (nbstripout,)
    • always nbconvert to python: name.ipynb.py (nbconvert)
    • always convert to markdown: name.ipynb.md (nbconvert, ipymd)
  • vcs.configure():
    • git difftool, mergetool: nbdiff and nbmerge from nbdime

tools

查看更多
妹纸别胸我
3楼-- · 2019-01-02 21:53

To follow up on the excellent script by Pietro Battiston, if you get a Unicode parsing error like this:

Traceback (most recent call last):
  File "/Users/kwisatz/bin/ipynb_output_filter.py", line 33, in <module>
write(json_in, sys.stdout, NO_CONVERT)
  File "/Users/kwisatz/anaconda/lib/python2.7/site-packages/IPython/nbformat/__init__.py", line 161, in write
fp.write(s)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 11549: ordinal not in range(128)

You can add at the beginning of the script:

reload(sys)
sys.setdefaultencoding('utf8')
查看更多
àī ωǒ Ьīé zǒ
4楼-- · 2019-01-02 21:53

After digging around, I finally found this relatively simple pre-save hook on the Jupyter docs. It strips the cell output data. You have to paste it into the jupyter_notebook_config.py file (see below for instructions).

def scrub_output_pre_save(model, **kwargs):
    """scrub output before saving notebooks"""
    # only run on notebooks
    if model['type'] != 'notebook':
        return
    # only run on nbformat v4
    if model['content']['nbformat'] != 4:
        return

    for cell in model['content']['cells']:
        if cell['cell_type'] != 'code':
            continue
        cell['outputs'] = []
        cell['execution_count'] = None
        # Added by binaryfunt:
        if 'collapsed' in cell['metadata']:
            cell['metadata'].pop('collapsed', 0)

c.FileContentsManager.pre_save_hook = scrub_output_pre_save

From Rich Signell's answer:

If you aren't sure in which directory to find your jupyter_notebook_config.py file, you can type jupyter --config-dir [into command prompt/terminal], and if you don't find the file there, you can create it by typing jupyter notebook --generate-config.

查看更多
给我背影
5楼-- · 2019-01-02 21:55

I did what Albert & Rich did - Don't version .ipynb files (as these can contain images, which gets messy). Instead, either always run ipython notebook --script or put c.FileNotebookManager.save_script = True in your config file, so that a (versionable) .py file is always created when you save your notebook.

To regenerate notebooks (after checking out a repo or switching a branch) I put the script py_file_to_notebooks.py in the directory where I store my notebooks.

Now, after checking out a repo, just run python py_file_to_notebooks.py to generate the ipynb files. After switching branch, you may have to run python py_file_to_notebooks.py -ov to overwrite the existing ipynb files.

Just to be on the safe side, it's good to also add *.ipynb to your .gitignore file.

Edit: I no longer do this because (A) you have to regenerate your notebooks from py files every time you checkout a branch and (B) there's other stuff like markdown in notebooks that you lose. I instead strip output from notebooks using a git filter. Discussion on how to do this is here.

查看更多
见你爱笑
6楼-- · 2019-01-02 21:55

I've built python package that solves this problem

https://github.com/brookisme/gitnb

It provides a CLI with a git-inspired syntax to track/update/diff notebooks inside your git repo.

Heres' an example

# add a notebook to be tracked
gitnb add SomeNotebook.ipynb

# check the changes before commiting
gitnb diff SomeNotebook.ipynb

# commit your changes (to your git repo)
gitnb commit -am "I fixed a bug"

Note that last step, where I'm using "gitnb commit" is committing to your git repo. Its essentially a wrapper for

# get the latest changes from your python notebooks
gitnb update

# commit your changes ** this time with the native git commit **
git commit -am "I fixed a bug"

There are several more methods, and can be configured so that it requires more or less user input at each stage, but thats the general idea.

查看更多
胡撸娃i
7楼-- · 2019-01-02 22:01

I use a very pragmatic approach; which work well for several notebooks, at several sides. And it even enables me to 'transfer' notebooks around. It works both for Windows as Unix/MacOS.
Al thought it is simple, is solve the problems above...

Concept

Basically, do not track the .ipnyb-files, only the corresponding .py-files.
By starting the notebook-server with the --script option, that file is automatically created/saved when the notebook is saved.

Those .py-files do contain all input; non-code is saved into comments, as are the cell-borders. Those file can be read/imported ( and dragged) into the notebook-server to (re)create a notebook. Only the output is gone; until it is re-run.

Personally I use mercurial to version-track the .py files; and use the normal (command-line) commands to add, check-in (ect) for that. Most other (D)VCS will allow this to.

Its simple to track the history now; the .py are small, textual and simple to diff. Once and a while, we need a clone (just branch; start a 2nd notebook-sever there), or a older version (check-it out and import into a notebook-server), etc.

Tips & tricks

  • Add *.ipynb to '.hgignore', so Mercurial knows it can ignore those files
  • Create a (bash) script to start the server (with the --script option) and do version-track it
  • Saving a notebook does save the .py-file, but does not check it in.
    • This is a drawback: One can forget that
    • It's a feature also: It possible to save a notebook (and continue later) without clustering the repository-history.

Wishes

  • It would be nice to have a buttons for check-in/add/etc in the notebook Dashboard
  • A checkout to (by example) file@date+rev.py) should be helpful It would be to much work to add that; and maybe I will do so once. Until now, I just do that by hand.
查看更多
登录 后发表回答