How to use sys.path_hooks for customized loading o

2020-06-18 09:52发布

问题:

I hope the following question is not too long. But otherwise I cannot explain by problem and what I want:

Learned from How to use importlib to import modules from arbitrary sources? (my question of yesterday) I have written a specfic loader for a new file type (.xxx). (In fact the xxx is an encrypted version of a pyc to protect code from being stolen).

I would like just to add an import hook for the new file type "xxx" without affecting the other types (.py, .pyc, .pyd) in any way.

Now, the loader is ModuleLoader, inheriting from mportlib.machinery.SourcelessFileLoader.

Using sys.path_hooks the loader shall be added as a hook:

myFinder = importlib.machinery.FileFinder
loader_details = (ModuleLoader, ['.xxx'])
sys.path_hooks.append(myFinder.path_hook(loader_details))

Note: This is activated once by calling modloader.activateLoader()

Upon loading a module named test (which is a test.xxx) I get:

>>> import modloader
>>> modloader.activateLoader()
>>> import test
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named 'test'
>>>

However, when I delete content of sys.path_hooks before adding the hook:

sys.path_hooks = []
sys.path.insert(0, '.') # current directory
sys.path_hooks.append(myFinder.path_hook(loader_details))

it works:

>>> modloader.activateLoader()
>>> import test
using xxx class

in xxxLoader exec_module
in xxxLoader get_code: .\test.xxx
ANALYZING ...

GENERATE CODE OBJECT ...

  2           0 LOAD_CONST               0
              3 LOAD_CONST               1 ('foo2')
              6 MAKE_FUNCTION            0
              9 STORE_NAME               0 (foo2)
             12 LOAD_CONST               2 (None)
             15 RETURN_VALUE
>>>>>> test
<module 'test' from '.\\test.xxx'>

The module is imported correctly after conversion of the files content to a code object.

However I cannot load the same module from a package: import pack.test

Note: __init__.py is of course as an empty file in pack directory.

>>> import pack.test
Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 2218, in _find_and_load_unlocked
AttributeError: 'module' object has no attribute '__path__'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named 'pack.test'; 'pack' is not a package
>>>

Not enough, I cannot load plain *.py modules from that package anymore: I get the same error as above:

>>> import pack.testpy
Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 2218, in _find_and_load_unlocked
AttributeError: 'module' object has no attribute '__path__'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named 'pack.testpy'; 'pack' is not a package
>>>

For my understanding sys.path_hooks is traversed until the last entry is tried. So why is the first variant (without deleting sys.path_hooks) not recognizing the new extension "xxx" and the second variant (deleting sys.path_hooks) do? It looks like the machinery is throwing an exception rather than traversing further to the next entry, when an entry of sys.path_hooks is not able to recognize "xxx".

And why is the second version working for py, pyc and xxx modules in the current directory, but not working in the package pack? I would expect that py and pyc is not even working in the current dir, because sys.path_hooks contains only a hook for "xxx"...

回答1:

The short answer is that the default PathFinder in sys.meta_path isn't meant to have new file extensions and importers added in the same paths it already supports. But there's still hope!

Quick Breakdown

sys.path_hooks is consumed by the importlib._bootstrap_external.PathFinder class.

When an import happens, each entry in sys.meta_path is asked to find a matching spec for the requested module. The PathFinder in particular will then take the contents of sys.path and pass it to the factory functions in sys.path_hooks. Each factory function has a chance to either raise an ImportError (basically the factory saying "nope, I don't support this path entry") or return a finder instance for that path. The first successfully returned finder is then cached in sys.path_importer_cache. From then on PathFinder will only ask those cached finder instances if they can provide the requested module.

If you look at the contents of sys.path_importer_cache, you'll see all of the directory entries from sys.path have been mapped to FileFinder instances. Non-directory entries (zip files, etc) will be mapped to other finders.

Thus, if you append a new factory created via FileFinder.path_hook to sys.path_hooks, your factory will only be invoked if the previous FileFinder hook didn't accept the path. This is unlikely, since FileFinder will work on any existing directory.

Alternatively, if you insert your new factory to sys.path_hooks ahead of the existing factories, the default hook will only be used if your new factory doesn't accept the path. And again, since FileFinder is so liberal with what it will accept, this would lead to only your loader being used, as you've already observed.

Making it Work

So you can either try to adjust that existing factory to also support your file extension and importer (which is difficult as the importers and extension string tuples are held in a closure), or do what I ended up doing, which is add a new meta path finder.

So eg. from my own project,


import sys

from importlib.abc import FileLoader
from importlib.machinery import FileFinder, PathFinder
from os import getcwd
from os.path import basename

from sibilant.module import prep_module, exec_module


SOURCE_SUFFIXES = [".lspy", ".sibilant"]


_path_importer_cache = {}
_path_hooks = []


class SibilantPathFinder(PathFinder):
    """
    An overridden PathFinder which will hunt for sibilant files in
    sys.path. Uses storage in this module to avoid conflicts with the
    original PathFinder
    """


    @classmethod
    def invalidate_caches(cls):
        for finder in _path_importer_cache.values():
            if hasattr(finder, 'invalidate_caches'):
                finder.invalidate_caches()


    @classmethod
    def _path_hooks(cls, path):
        for hook in _path_hooks:
            try:
                return hook(path)
            except ImportError:
                continue
        else:
            return None


    @classmethod
    def _path_importer_cache(cls, path):
        if path == '':
            try:
                path = getcwd()
            except FileNotFoundError:
                # Don't cache the failure as the cwd can easily change to
                # a valid directory later on.
                return None
        try:
            finder = _path_importer_cache[path]
        except KeyError:
            finder = cls._path_hooks(path)
            _path_importer_cache[path] = finder
        return finder


class SibilantSourceFileLoader(FileLoader):


    def create_module(self, spec):
        return None


    def get_source(self, fullname):
        return self.get_data(self.get_filename(fullname)).decode("utf8")


    def exec_module(self, module):
        name = module.__name__
        source = self.get_source(name)
        filename = basename(self.get_filename(name))

        prep_module(module)
        exec_module(module, source, filename=filename)


def _get_lspy_file_loader():
    return (SibilantSourceFileLoader, SOURCE_SUFFIXES)


def _get_lspy_path_hook():
    return FileFinder.path_hook(_get_lspy_file_loader())


def _install():
    done = False

    def install():
        nonlocal done
        if not done:
            _path_hooks.append(_get_lspy_path_hook())
            sys.meta_path.append(SibilantPathFinder)
            done = True

    return install


_install = _install()
_install()

The SibilantPathFinder overrides PathFinder and replaces only those methods which reference sys.path_hook and sys.path_importer_cache with similar implementations which instead look in a _path_hook and _path_importer_cache which are local to this module.

During import, the existing PathFinder will try to find a matching module. If it cannot, then my injected SibilantPathFinder will re-traverse the sys.path and try to find a match with one of my own file extensions.

Figuring More Out

I ended up delving into the source for the _bootstrap_external module https://github.com/python/cpython/blob/master/Lib/importlib/_bootstrap_external.py

The _install function and the PathFinder.find_spec method are the best starting points to seeing why things work the way they do.



回答2:

@obriencj's analysis of the situation is correct. But I came up with a different solution to this problem that doesn't require putting anything in sys.meta_path. Instead, it installs a special hook in sys.path_hooks that acts almost as a sort of middle-ware between the PathFinder in sys.meta_path, and the hooks in sys.path_hooks where, rather than just using the first hook that says "I can handle this path!" it tries all matching hooks in order, until it finds one that actually returns a useful ModuleSpec from its find_spec method:

@PathEntryFinder.register
class MetaFileFinder:
    """
    A 'middleware', if you will, between the PathFinder sys.meta_path hook,
    and sys.path_hooks hooks--particularly FileFinder.

    The hook returned by FileFinder.path_hook is rather 'promiscuous' in that
    it will handle *any* directory.  So if one wants to insert another
    FileFinder.path_hook into sys.path_hooks, that will totally take over
    importing for any directory, and previous path hooks will be ignored.

    This class provides its own sys.path_hooks hook as follows: If inserted
    on sys.path_hooks (it should be inserted early so that it can supersede
    anything else).  Its find_spec method then calls each hook on
    sys.path_hooks after itself and, for each hook that can handle the given
    sys.path entry, it calls the hook to create a finder, and calls that
    finder's find_spec.  So each sys.path_hooks entry is tried until a spec is
    found or all finders are exhausted.
    """

    class hook:
        """
        Use this little internal class rather than a function with a closure
        or a classmethod or anything like that so that it's easier to
        identify our hook and skip over it while processing sys.path_hooks.
        """

        def __init__(self, basepath=None):
            self.basepath = os.path.abspath(basepath)

        def __call__(self, path):
            if not os.path.isdir(path):
                raise ImportError('only directories are supported', path=path)
            elif not self.handles(path):
                raise ImportError(
                    'only directories under {} are supported'.format(
                        self.basepath), path=path)

            return MetaFileFinder(path)

        def handles(self, path):
            """
            Return whether this hook will handle the given path, depending on
            what its basepath is.
            """

            path = os.path.abspath(path)

            return (self.basepath is None or
                    os.path.commonpath([self.basepath, path]) == self.basepath)

    def __init__(self, path):
        self.path = path
        self._finder_cache = {}

    def __repr__(self):
        return '{}({!r})'.format(self.__class__.__name__, self.path)

    def find_spec(self, fullname, target=None):
        if not sys.path_hooks:
            return None

        last = len(sys.path_hooks) - 1

        for idx, hook in enumerate(sys.path_hooks):
            if isinstance(hook, self.__class__.hook):
                continue

            finder = None
            try:
                if hook in self._finder_cache:
                    finder = self._finder_cache[hook]
                    if finder is None:
                        # We've tried this finder before and got an ImportError
                        continue
            except TypeError:
                # The hook is unhashable
                pass

            if finder is None:
                try:
                    finder = hook(self.path)
                except ImportError:
                    pass

            try:
                self._finder_cache[hook] = finder
            except TypeError:
                # The hook is unhashable for some reason so we don't bother
                # caching it
                pass

            if finder is not None:
                spec = finder.find_spec(fullname, target)
                if (spec is not None and
                        (spec.loader is not None or idx == last)):
                    # If no __init__.<suffix> was found by any Finder,
                    # we may be importing a namespace package (which
                    # FileFinder.find_spec returns in this case).  But we
                    # only want to return the namespace ModuleSpec if we've
                    # exhausted every other finder first.
                    return spec

        # Module spec not found through any of the finders
        return None

    def invalidate_caches(self):
        for finder in self._finder_cache.values():
            finder.invalidate_caches()

    @classmethod
    def install(cls, basepath=None):
        """
        Install the MetaFileFinder in the front sys.path_hooks, so that
        it can support any existing sys.path_hooks and any that might
        be appended later.

        If given, only support paths under and including basepath.  In this
        case it's not necessary to invalidate the entire
        sys.path_importer_cache, but only any existing entries under basepath.
        """

        if basepath is not None:
            basepath = os.path.abspath(basepath)

        hook = cls.hook(basepath)
        sys.path_hooks.insert(0, hook)
        if basepath is None:
            sys.path_importer_cache.clear()
        else:
            for path in list(sys.path_importer_cache):
                if hook.handles(path):
                    del sys.path_importer_cache[path]

This is still, depressing, far more complication than should be necessary. I feel like on Python 2, before the import system rewrite, it was much simpler to do this since less of the support for the built-in module types (.py, etc.) was built on top of the import hooks themselves, so it was harder to break importing normal modules by adding hooks to import new modules types. I'm going to start a discussion on python-ideas to see if there's any way we can't improve this situation.