I have a RAR file and a ZIP file. Within these two there is a folder. Inside the folder there are several 7-zip (.7z) files. Inside every 7z there are multiple files with the same extension, but whose names vary.
RAR or ZIP file
|___folder
|_____Multiple 7z
|_____Multiple files with same extension and different name
I want to extract just the ones I need from thousands of files...
I need those files whose names include a certain substring. For example, if the name of a compressed file includes '[!]'
in the name or '(U)'
or '(J)'
that's the criteria to determine the file to be extracted.
I can extract the folder without problem so I have this structure:
folder
|_____Multiple 7z
|_____Multiple files with same extension and different name
I'm in a Windows environment but I have Cygwin installed. I wonder how can I extract the files I need painlessly? Maybe using a single command line line.
Please, help me with this one. THANKS!
UPDATE Thanks to everyone for helping me out. There are some more specifications to improve the question:
- The inner 7z files and their respective files inside them can have spaces in their names.
- There are 7z files with just one file inside of them that doesn't meet the given criteria. Thus, being the only possible file, they have to be extracted too.
Thanks to everyone. The bash solution was the one that helped me out. I wasn't able to test Python3 solutions because I had problems trying to install libraries using pip
. I don't use Python so I'll have to study and overcome the errors I face with these solutions. For now, I've found a suitable answer. Thanks to everyone.
This is somehow final version after some tries. Previous was not useful so I'm removing it, instead of appending. Read till the end, since not everything may be needed for final solution.
To the topic. I would use Python. If that is one time task, then it can be overkill, but in any other case - you can log all steps for future investigation, regex, orchestrating some commands with providing input, and taking and processing output - each time. All that cases are quite easy in Python. If you have it however.
Now, I'll write what to do to have env. configured. Not all is mandatory, but trying install did some steps, and maybe description of the process can be beneficial itself.
I have MinGW - 32 bit version. That is not mandatory to extract 7zip however. When installed go to
C:\MinGW\bin
and runmingw-get.exe
:Basic Setup
I havemsys-base
installed (right click, mark for installation, from Installation menu - Apply changes). That way I have bash, sed, grep, and many more.All Packages
there ismingw32-libarchive with dll as class. Since python
libarchive` package is just a wrapper you need this dll to actually have binary to wrap.Examples are for Python 3. I'm using 32 bit version. You can fetch it from their home page. I have installed in default directory which is strange. So advise is to install in root of your disk - like mingw.
Other things - conemu is much better then default console.
Installing packages in Python.
pip
is used for that. From your console go to Python home, and there isScripts
subdirectory there. For me it is:c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\Scripts
. You can search with for instancepip search archive
, and install withpip install libarchive-c
:After
cd ..
callpython
, and new library can be used / imported:So it fails. I've tried to fix that, but failed with that:
Tried with
set
command to directly provide information, but failed... So I moved topylzma
- for that mingw is not needed.pip
install failed:Again failed. But that is easy one - I've installed visual studio build tools 2015, and that worked. I have
sevenzip
installed, so I've created sample archive. So finally I can start python and do:And got empty list. Looking closer... gives better understanding - empty files are not considered by
pylzma
- just to make you aware of that. So putting one character into my sample files, last line gives:So... rest is a piece of cake. And actually that is a part of original post:
As a side note - Anaconda is great tool, but full install takes 500+MB, so that is way too much.
Also let me share wmctrl.py tool, from my github:
That way you can orchestrate different commands - here it is
wmctrl
. Result can be processed, in the way that allows data processing.This solution is based on bash, grep and awk, it works on Cygwin and on Ubuntu.
Since you have the requirement to search for
(X) [!].ext
files first and if there are no such files then look for(X).ext
files, I don't think it is possible to write some single expression to handle this logic.The solution should have some if/else conditional logic to test the list of files inside the archive and decide which files to extract.
Here is the initial structure inside the zip/rar archive I tested my script on (I made a script to prepare this structure):
The output is this:
And this is the script to do the extraction:
The basic idea here is to go over 7zip archives and get the list of files for each of them using
7z l
command (list of files).The output of the command if quite verbose, so we use
awk
to clean it up and get the list of file names.After that we filter this list using
grep
to get either a list of[!]
files or a list of(X)
files. Then we just pass this list to 7zip to extract the files we need.What about using this command line :
Where :
The -y option is for forcing overwriting in case you have the same filename in different archives.
You state it is OK to use linux, in the question bounty footer. And also I don't use windows. Sorry about that. I am using Python3 on, and you have to be in a linux environment (I will try to test this on windows as soon as I can).
Archive structure
Extracted structure
Here is how I did it.
Above the main program, I've got all the required functions ready. I didn't use all of them, but I kept them in case you need them.
I used several python libraries with
python3
, but you only have to install libarchive and rarfile usingpip
, others are built-in libraries.And here is a copy of my source tree
Console output
This is the console output when you run this python file,
Issues
The only issue I faced so far is, there are some temporary files generating at the program root. It doesn't affect the program in anyway, but I'll try to fix that.
edit
You have to run
to install the actual
libarchive
program. Python library is just a wrapper arround it. Take a look at the official documentation.