Read HTML with Beautifulsoup and find typical data

2019-08-27 04:02发布

I wrote similar question before, but I need something different what I got from previous question.

I have a html data which is written below (part of the data where I need).

I already got rcpNo value, but eleId is changed from 1 to 33, offset, length don't have any regular pattern. Three of the data is consist of numbers, sometime different digit.

I need to read rcpNO, eleId, offset, length and dtd.

(dtd is fixed as 'dart3.xsd' but I try this only one html so there is possibility different dtd value for different html data. So I want to read from html data.)

# This is the part of html
#viewDoc(rcpNo, dcmNo, eleId, offset, length, dtd)


treeNode1.appendChild(treeNode2);

    treeNode2 = new Tree.TreeNode({
        text: "4. The number of stocks",
        id: "7",
        cls: "text",
        listeners: {
            click: function() {viewDoc('20180515000480', '6177478', '7', '59749', '7130', 'dart3.xsd');}
        }
    });
    cnt++;

Similar data is repeated so I write some part of HTML:

treeNode2 = new Tree.TreeNode({
        text: "1. Summary information",
        id: "12",
        cls: "text",
        listeners: {
            click: function() {viewDoc('20180515000480', '6177478', '12', '189335', '18247', 'dart3.xsd');}
        }
    });
    cnt++;

    treeNode1.appendChild(treeNode2);

    treeNode2 = new Tree.TreeNode({
        text: "2. Linked finance state",
        id: "13",
        cls: "text",
        listeners: {
            click: function() {viewDoc('20180515000480', '6177478', '13', '207823', '76870', 'dart3.xsd');}
        }
    });
    cnt++;

treeNode1.appendChild(treeNode2);

    treeNode2 = new Tree.TreeNode({
        text: "3. Comment for linked finance state",
        id: "14",
        cls: "text",
        listeners: {
            click: function() {viewDoc('20180515000480', '6177478', '14', '284697', '372938', 'dart3.xsd');}
        }
    });
    cnt++;

As you can see above text and id is changed regularly. I want to read all of the dcmNo, eleId, offset, length and dtd information. especially with typical id & text.

I tried to below

string = "{viewDoc('20180515000480', '6177478', '6', '58846', '899', 'dart3.xsd');}"
>>> pattern = re.compile(r'viewDoc\(\'\d+\', \'(\d+)\', \'(\d+)\', \'(\d+)\', \'(\d+)\', \'(\d+)\' .+\)', re.MULTILINE | re.DOTALL)

and with Beautifulsoup

>>> soup = BeautifulSoup(html, 'html.parser')
>>> soup.find_all(string = pattern)

and this command find all html, I cannot distinguish the data. But it doesn't work and it find the first text from html what I don't have to read.

Edit

This is how can I get the html from url

from bs4 import BeautifulSoup
import requests
import re

url = "http://dart.fss.or.kr/api/search.json?auth="+API_KEY \
  +"&crp_cd="+company_code + "&page_set=100" \
  +"&start_dt=19990101&bsn_tp=A001&bsn_tp=A002&bsn_tp=A003"

json_data = requests.get(url).json()
list = json_data['list']

data = pd.DataFrame.from_dict(list)

print(data['rcp_no'][0])

url2 = "http://dart.fss.or.kr/dsaf001/main.do?rcpNo="+data['rcp_no'][0]

temp = requests.get(url2)

html = temp.text

soup = BeautifulSoup(html, "html.parser")

and above example of html is the part of print(soup). As I said, there are a lot of same format in html and I want to read typical line. For example, if I can find below line then I want to get the data

# viewDoc(rcpNo, dcmNo, eleId, offset, length, dtd)

viewDoc('20180515000480', '6177478', '7', '59749', '7130', 'dart3.xsd')

viewDoc('20180515000480', '6177478', '13', '207823', '76870', 'dart3.xsd')

like, ['6177478', '7', '59749', '7130', 'dart3.xsd'], ['6177478', '7', '59749', '7130', 'dart3.xsd'], number and text data (dcmNo, eleId, offset, length and dtd)

0条回答
登录 后发表回答