File-conversion of HTML-published Jupyter Notebook

2019-08-25 19:02发布

问题:

I have HTML-published renditions of Jupyter Notebooks that I need to convert in bulk back to executable Jupyter.ipynb files. I have found many discussions and approaches for how to go the other way, publish from a Jupyter.ipynb file to a HTML file. Included under the "File..." menu in every Jupyter NB Web Client is a function to publish to HTML or "Download As..." with HTML as one of many options. However, there's no "Import into Jupyter" or "Import from HTML" functions. Am I missing something in this scenario? This is not that uncommmon of a need.

Short of writing my own webscraper to scrape the HTML-published version of the Jupyter NB, and then programmatically creating the JSON NB structure of an IPython NB file format, is there an easier way to do this?

I've tried the following code from IPython notebook: Convert an HTML notebook to ipynb with decent results, but this only captures and converts code cells and markdown cell.

from bs4 import BeautifulSoup
import json
import urllib.request
url = 'http://nbviewer.jupyter.org/url/jakevdp.github.com/downloads/notebooks/XKCD_plots.ipynb'
response = urllib.request.urlopen(url)
#  for local html file
# response = open("/Users/note/jupyter/notebook.html")
text = response.read()

soup = BeautifulSoup(text, 'lxml')
# see some of the html
print(soup.div)
dictionary = {'nbformat': 4, 'nbformat_minor': 1, 'cells': [], 'metadata': {}}
for d in soup.findAll("div"):
    if 'class' in d.attrs.keys():
        for clas in d.attrs["class"]:
            if clas in ["text_cell_render", "input_area"]:
                # code cell
                if clas == "input_area":
                    cell = {}
                    cell['metadata'] = {}
                    cell['outputs'] = []
                    cell['source'] = [d.get_text()]
                    cell['execution_count'] = None
                    cell['cell_type'] = 'code'
                    dictionary['cells'].append(cell)

                else:
                    cell = {}
                    cell['metadata'] = {}

                    cell['source'] = [d.decode_contents()]
                    cell['cell_type'] = 'markdown'
                    dictionary['cells'].append(cell)
open('notebook.ipynb', 'w').write(json.dumps(dictionary))

It doesn't convert the entire notebook, nor does it do it in batch mode.