What is a good strategy for keeping IPython notebooks under version control?
The notebook format is quite amenable for version control: if one wants to version control the notebook and the outputs then this works quite well. The annoyance comes when one wants only to version control the input, excluding the cell outputs (aka. "build products") which can be large binary blobs, especially for movies and plots. In particular, I am trying to find a good workflow that:
- allows me to choose between including or excluding output,
- prevents me from accidentally committing output if I do not want it,
- allows me to keep output in my local version,
- allows me to see when I have changes in the inputs using my version control system (i.e. if I only version control the inputs but my local file has outputs, then I would like to be able to see if the inputs have changed (requiring a commit). Using the version control status command will always register a difference since the local file has outputs.)
- allows me to update my working notebook (which contains the output) from an updated clean notebook. (update)
As mentioned, if I chose to include the outputs (which is desirable when using nbviewer for example), then everything is fine. The problem is when I do not want to version control the output. There are some tools and scripts for stripping the output of the notebook, but frequently I encounter the following issues:
- I accidentally commit a version with the the output, thereby polluting my repository.
- I clear output to use version control, but would really rather keep the output in my local copy (sometimes it takes a while to reproduce for example).
- Some of the scripts that strip output change the format slightly compared to the
Cell/All Output/Clear
menu option, thereby creating unwanted noise in the diffs. This is resolved by some of the answers. - When pulling changes to a clean version of the file, I need to find some way of incorporating those changes in my working notebook without having to rerun everything. (update)
I have considered several options that I shall discuss below, but have yet to find a good comprehensive solution. A full solution might require some changes to IPython, or may rely on some simple external scripts. I currently use mercurial, but would like a solution that also works with git: an ideal solution would be version-control agnostic.
This issue has been discussed many times, but there is no definitive or clear solution from the user's perspective. The answer to this question should provide the definitive strategy. It is fine if it requires a recent (even development) version of IPython or an easily installed extension.
Update: I have been playing with my modified notebook version which optionally saves a .clean
version with every save using Gregory Crosswhite's suggestions. This satisfies most of my constraints but leaves the following unresolved:
- This is not yet a standard solution (requires a modification of the ipython source. Is there a way of achieving this behaviour with a simple extension? Needs some sort of on-save hook.
- A problem I have with the current workflow is pulling changes. These will come in to the
.clean
file, and then need to be integrated somehow into my working version. (Of course, I can always re-execute the notebook, but this can be a pain, especially if some of the results depend on long calculations, parallel computations, etc.) I do not have a good idea about how to resolve this yet. Perhaps a workflow involving an extension like ipycache might work, but that seems a little too complicated.
Notes
Removing (stripping) Output
- When the notebook is running, one can use the
Cell/All Output/Clear
menu option for removing the output. - There are some scripts for removing output, such as the script nbstripout.py which remove the output, but does not produce the same output as using the notebook interface. This was eventually included in the ipython/nbconvert repo, but this has been closed stating that the changes are now included in ipython/ipython,but the corresponding functionality seems not to have been included yet. (update) That being said, Gregory Crosswhite's solution shows that this is pretty easy to do, even without invoking ipython/nbconvert, so this approach is probably workable if it can be properly hooked in. (Attaching it to each version control system, however, does not seem like a good idea — this should somehow hook in to the notebook mechanism.)
Newsgroups
Issues
- 977: Notebook feature requests (Open).
- 1280: Clear-all on save option (Open). (Follows from this discussion.)
- 3295: autoexported notebooks: only export explicitly marked cells (Closed). Resolved by extension 11 Add writeandexecute magic (Merged).
Pull Requests
- 1621: clear In[] prompt numbers on "Clear All Output" (Merged). (See also 2519 (Merged).)
- 1563: clear_output improvements (Merged).
- 3065: diff-ability of notebooks (Closed).
- 3291: Add the option to skip output cells when saving. (Closed). This seems extremely relevant, however was closed with the suggestion to use a "clean/smudge" filter. A relevant question what can you use if you want to strip off output before running git diff? seems not to have been answered.
- 3312: WIP: Notebook save hooks (Closed).
- 3747: ipynb -> ipynb transformer (Closed). This is rebased in 4175.
- 4175: nbconvert: Jinjaless exporter base (Merged).
- 142: Use STDIN in nbstripout if no input is given (Open).
Here is my solution with git. It allows you to just add and commit (and diff) as usual: those operations will not alter your working tree, and at the same time (re)running a notebook will not alter your git history.
Although this can probably be adapted to other VCSs, I know it doesn't satisfy your requirements (at least the VSC agnosticity). Still, it is perfect for me, and although it's nothing particularly brilliant, and many people probably already use it, I didn't find clear instructions about how to implement it by googling around. So it may be useful to other people.
~/bin/ipynb_output_filter.py
)chmod +x ~/bin/ipynb_output_filter.py
)Create the file
~/.gitattributes
, with the following contentRun the following commands:
Done!
Limitations:
somebranch
and you dogit checkout otherbranch; git checkout somebranch
, you usually expect the working tree to be unchanged. Here instead you will have lost the output and cells numbering of notebooks whose source differs between the two branches.git commit notebook_file.ipynb
, although it would at least keepgit diff notebook_file.ipynb
free from base64 garbage).My solution reflects the fact that I personally don't like to keep generated stuff versioned - notice that doing merges involving the output is almost guaranteed to invalidate the output or your productivity or both.
EDIT:
if you do adopt the solution as I suggested it - that is, globally - you will have trouble in case for some git repo you want to version output. So if you want to disable the output filtering for a specific git repository, simply create inside it a file .git/info/attributes, with
**.ipynb filter=
as content. Clearly, in the same way it is possible to do the opposite: enable the filtering only for a specific repository.
the code is now maintained in its own git repo
if the instructions above result in ImportErrors, try adding "ipython" before the path of the script:
EDIT: May 2016 (updated February 2017): there are several alternatives to my script - for completeness, here is a list of those I know: nbstripout (other variants), nbstrip, jq.
I have created
nbstripout
, based on MinRKs gist, which supports both Git and Mercurial (thanks to mforbes). It is intended to be used either standalone on the command line or as a filter, which is easily (un)installed in the current repository vianbstripout install
/nbstripout uninstall
.Get it from PyPI or simply
Unfortunately, I do not know much about Mercurial, but I can give you a possible solution that works with Git, in the hopes that you might be able to translate my Git commands into their Mercurial equivalents.
For background, in Git the
add
command stores the changes that have been made to a file into a staging area. Once you have done this, any subsequent changes to the file are ignored by Git unless you tell it to stage them as well. Hence, the following script, which, for each of the given files, strips out all of theoutputs
andprompt_number sections
, stages the stripped file, and then restores the original:NOTE: If running this gets you an error message like
ImportError: No module named IPython.nbformat
, then useipython
to run the script instead ofpython
.Once the script has been run on the files whose changes you wanted to commit, just run
git commit
.As pointed out by, the
--script
is deprecated in3.x
. This approach can be used by applying a post-save-hook. In particular, add the following toipython_notebook_config.py
:The code is taken from #8009.
I finally found a productive and simple way to make Jupyter and Git play nicely together. I'm still in the first steps, but I already think it is a lot better than all other convoluted solutions.
Visual Studio Code is a cool and open source code editor from Microsoft. It has an excellent Python extension that now allows you to import a Jupyter Notebook as python code.
After you import your notebook to a python file, all the code and markdown will be together in a ordinary python file, with special markers in comments. You can see in the image below:
Your python file just has the contents of the notebook input cells. The output will be generated in a split window. You have pure code in the notebook, it doesn't change while you just execute it. No mingled output with your code. No strange Json incomprehensible format to analyze your diffs.
Just pure python code where you can easily identify every single diff.
I don't even need to version my
.ipynb
files anymore. I can put a*.ipynb
line in.gitignore
.Need to generate a notebook to publish or share with someone? No problem, just click the export button in the interactive python window
I've been using it just for a day, but finally I can happily use Jupyter with Git.
P.S.: VSCode code completion is a lot better than Jupyter.
How about the idea discussed in the post below, where the output of the notebook should be kept, with the argument that it might take a long time to generate it, and it is handy since GitHub can now render notebooks. There are auto-save hooks added for exporting .py file, used for diffs and .html for sharing with team members who do not use notebooks or git.
https://towardsdatascience.com/version-control-for-jupyter-notebook-3e6cef13392d