Using IPython notebooks under version control

2019-01-02 21:13发布

What is a good strategy for keeping IPython notebooks under version control?

The notebook format is quite amenable for version control: if one wants to version control the notebook and the outputs then this works quite well. The annoyance comes when one wants only to version control the input, excluding the cell outputs (aka. "build products") which can be large binary blobs, especially for movies and plots. In particular, I am trying to find a good workflow that:

  • allows me to choose between including or excluding output,
  • prevents me from accidentally committing output if I do not want it,
  • allows me to keep output in my local version,
  • allows me to see when I have changes in the inputs using my version control system (i.e. if I only version control the inputs but my local file has outputs, then I would like to be able to see if the inputs have changed (requiring a commit). Using the version control status command will always register a difference since the local file has outputs.)
  • allows me to update my working notebook (which contains the output) from an updated clean notebook. (update)

As mentioned, if I chose to include the outputs (which is desirable when using nbviewer for example), then everything is fine. The problem is when I do not want to version control the output. There are some tools and scripts for stripping the output of the notebook, but frequently I encounter the following issues:

  1. I accidentally commit a version with the the output, thereby polluting my repository.
  2. I clear output to use version control, but would really rather keep the output in my local copy (sometimes it takes a while to reproduce for example).
  3. Some of the scripts that strip output change the format slightly compared to the Cell/All Output/Clear menu option, thereby creating unwanted noise in the diffs. This is resolved by some of the answers.
  4. When pulling changes to a clean version of the file, I need to find some way of incorporating those changes in my working notebook without having to rerun everything. (update)

I have considered several options that I shall discuss below, but have yet to find a good comprehensive solution. A full solution might require some changes to IPython, or may rely on some simple external scripts. I currently use mercurial, but would like a solution that also works with git: an ideal solution would be version-control agnostic.

This issue has been discussed many times, but there is no definitive or clear solution from the user's perspective. The answer to this question should provide the definitive strategy. It is fine if it requires a recent (even development) version of IPython or an easily installed extension.

Update: I have been playing with my modified notebook version which optionally saves a .clean version with every save using Gregory Crosswhite's suggestions. This satisfies most of my constraints but leaves the following unresolved:

  1. This is not yet a standard solution (requires a modification of the ipython source. Is there a way of achieving this behaviour with a simple extension? Needs some sort of on-save hook.
  2. A problem I have with the current workflow is pulling changes. These will come in to the .clean file, and then need to be integrated somehow into my working version. (Of course, I can always re-execute the notebook, but this can be a pain, especially if some of the results depend on long calculations, parallel computations, etc.) I do not have a good idea about how to resolve this yet. Perhaps a workflow involving an extension like ipycache might work, but that seems a little too complicated.

Notes

Removing (stripping) Output

  • When the notebook is running, one can use the Cell/All Output/Clear menu option for removing the output.
  • There are some scripts for removing output, such as the script nbstripout.py which remove the output, but does not produce the same output as using the notebook interface. This was eventually included in the ipython/nbconvert repo, but this has been closed stating that the changes are now included in ipython/ipython,but the corresponding functionality seems not to have been included yet. (update) That being said, Gregory Crosswhite's solution shows that this is pretty easy to do, even without invoking ipython/nbconvert, so this approach is probably workable if it can be properly hooked in. (Attaching it to each version control system, however, does not seem like a good idea — this should somehow hook in to the notebook mechanism.)

Newsgroups

Issues

Pull Requests

17条回答
拼命十四郎i
2楼-- · 2019-01-02 22:09

Here is a new solution from Cyrille Rossant for IPython 3.0, which persists to markdown files rather than json-based ipymd files:

https://github.com/rossant/ipymd

查看更多
哑剧真动
3楼-- · 2019-01-02 22:12

This jupyter extension enables users to push jupyter notebooks directly to github.

Please look here

https://github.com/sat28/githubcommit

查看更多
哑剧真动
4楼-- · 2019-01-02 22:15

Ok, so it looks like the current best solution, as per a discussion here, is to make a git filter to automatically strip output from ipynb files on commit.

Here's what I did to get it working (copied from that discussion):

I modified cfriedline's nbstripout file slightly to give an informative error when you can't import the latest IPython: https://github.com/petered/plato/blob/fb2f4e252f50c79768920d0e47b870a8d799e92b/notebooks/config/strip_notebook_output And added it to my repo, lets say in ./relative/path/to/strip_notebook_output

Also added the file .gitattributes file to the root of the repo, containing:

*.ipynb filter=stripoutput

And created a setup_git_filters.sh containing

git config filter.stripoutput.clean "$(git rev-parse --show-toplevel)/relative/path/to/strip_notebook_output" 
git config filter.stripoutput.smudge cat
git config filter.stripoutput.required true

And ran source setup_git_filters.sh. The fancy $(git rev-parse...) thing is to find the local path of your repo on any (Unix) machine.

查看更多
见你爱笑
5楼-- · 2019-01-02 22:16

We have a collaborative project where the product is Jupyter Notebooks, and we've use an approach for the last six months that is working great: we activate saving the .py files automatically and track both .ipynb files and the .py files.

That way if someone wants to view/download the latest notebook they can do that via github or nbviewer, and if someone wants to see how the the notebook code has changed, they can just look at the changes to the .py files.

For Jupyter notebook servers, this can be accomplished by adding the lines

import os
from subprocess import check_call

def post_save(model, os_path, contents_manager):
    """post-save hook for converting notebooks to .py scripts"""
    if model['type'] != 'notebook':
        return # only do this for notebooks
    d, fname = os.path.split(os_path)
    check_call(['jupyter', 'nbconvert', '--to', 'script', fname], cwd=d)

c.FileContentsManager.post_save_hook = post_save

to the jupyter_notebook_config.py file and restarting the notebook server.

If you aren't sure in which directory to find your jupyter_notebook_config.py file, you can type jupyter --config-dir, and if you don't find the file there, you can create it by typing jupyter notebook --generate-config.

For Ipython 3 notebook servers, this can be accomplished by adding the lines

import os
from subprocess import check_call

def post_save(model, os_path, contents_manager):
    """post-save hook for converting notebooks to .py scripts"""
    if model['type'] != 'notebook':
        return # only do this for notebooks
    d, fname = os.path.split(os_path)
    check_call(['ipython', 'nbconvert', '--to', 'script', fname], cwd=d)

c.FileContentsManager.post_save_hook = post_save

to the ipython_notebook_config.py file and restarting the notebook server. These lines are from a github issues answer @minrk provided and @dror includes them in his SO answer as well.

For Ipython 2 notebook servers, this can be accomplished by starting the server using:

ipython notebook --script

or by adding the line

c.FileNotebookManager.save_script = True

to the ipython_notebook_config.py file and restarting the notebook server.

If you aren't sure in which directory to find your ipython_notebook_config.py file, you can type ipython locate profile default, and if you don't find the file there, you can create it by typing ipython profile create.

Here's our project on github that is using this approach: and here's a github example of exploring recent changes to a notebook.

We've been very happy with this.

查看更多
家丑人穷心不美
6楼-- · 2019-01-02 22:17

Just come across "jupytext" which looks like a perfect solution. It generates a .py file from the notebook and then keeps both in sync. You can version control, diff and merge inputs via the .py file without losing the outputs. When you open the notebook it uses the .py for input cells and the .ipynb for output. And if you want to include the output in git then you can just add the ipynb.

https://github.com/mwouts/jupytext

查看更多
登录 后发表回答