Renaming HTML files using <title> tags

I'm a relatively new to programming. I have a folder, with subfolders, which contain several thousand html files that are generically named, i.e. 1006.htm, 1007.htm, that I would like to rename using the tag from within the file.

For example, if file 1006.htm contains Page Title , I would like to rename it Page Title.htm. Ideally spaces are replaced with dashes.

I've been working in the shell with a bash script with no luck. How do I do this, with either bash or python?

this is what I have so far..

#!/usr/bin/env bashFILES=/Users/Ben/unzipped/*
for f in $FILES
do
   if [ ${FILES: -4} == ".htm" ]
      then
    awk 'BEGIN{IGNORECASE=1;FS="<title>|</title>";RS=EOF} {print $2}' $FILES
   fi
done

I've also tried

#!/usr/bin/env bash
for f in *.html;
   do
   title=$( grep -oP '(?<=<title>).*(?=<\/title>)' "$f" )
   mv -i "$f" "${title//[^a-zA-Z0-9\._\- ]}".html   
done

But I get an error from the terminal exlaing how to use grep...

标签： python html bash scrape renaming

3条回答

做个烂人

2楼-- · 2020-04-02 01:18

You want to use a HTML parser (likelxml.html) to parse your HTML files. Once you've got that, retrieving the title tag is one line (probably page.get_element_by_id("title").text_content()).

Translating that to a file name and renaming the document should be trivial.

0人赞添加讨论(0) 举报

姐就是有狂的资本

3楼-- · 2020-04-02 01:18

Here is a python script I just wrote:

import os
import re

from lxml import etree


class MyClass(object):
    def __init__(self, dirname=''):
        self.dirname   = dirname
        self.exp_title = "<title>(.*)</title>"
        self.re_title  = re.compile(self.exp_title)

    def rename(self):
        for afile in os.listdir(self.dirname):
            if os.path.isfile(afile):
                originfile = os.path.join(self.dirname, afile)
                with open(originfile, 'rb') as fp:
                    contents = fp.read()
                try:
                    html  = etree.HTML(contents)
                    title = html.xpath("//title")[0].text
                except Exception as e:
                    try:
                        title = self.re_title.findall(contents)[0]
                    except Exception:
                        title = ''

                if title:
                    newfile = os.path.join(self.dirname, title)
                    os.rename(originfile, newfile)


>>> test = MyClass('/path/to/your/dir')
>>> test.rename()

0人赞添加讨论(0) 举报

该账号已被封号

4楼-- · 2020-04-02 01:22

use awk instead of grep in your bash script and it should work:

#!/bin/bash   
for f in *.html;
   do
   title=$( awk 'BEGIN{IGNORECASE=1;FS="<title>|</title>";RS=EOF} {print $2}' "$f" )
   mv -i "$f" "${title//[^a-zA-Z0-9\._\- ]}".html   
done

don't forget to change your bash env on the first line ;)

EDIT full answer with all the modifications

#!/bin/bash
for f in `find . -type f | grep \.html`
   do
   title=$( awk 'BEGIN{IGNORECASE=1;FS="<title>|</title>";RS=EOF} {print $2}' "$f" )
   mv -i "$f" "${title//[ ]/-}".html
done

0人赞添加讨论(0) 举报

Renaming HTML files using <title> tags

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间