How might one go about implementing a forward inde

I am looking to implement a simple forward indexer in PHP. Yes I do understand that PHP is hardly the best tool for the task, but I want to do it anyway. The rationale behind it is simple: I want one, and in PHP.

Let us make a few basic assumptions:

The entire Interweb consists of about five thousand HTML and/or plain-text documents. Each document resides within a particular domain (UID). No other proprietary/arcane formats exist in our imaginary cavemanesque Interweb.
The result of our awesome PHP-based forward indexing algorithm should be along the lines of:

UID1 -> index.html -> helen,she,was,champion,with,freckles

UID1 -> foo.html -> chicken,farmers,go,home,eat,sheep

UID2 -> blah.html -> next,week,on,badgerwatch

UID2 -> gah.txt -> one,one,and,one,is,not,numberwang

Ideally, I would love to see solutions that take into account, even at their most elementary, the concepts of tokenization/word boundary disambiguation/part-of-speech-tagging. Of course, I do realise this is wishful thinking, and therefore will humble any worthy attempts at parsing said imaginary documents by:

Extracting the real textual content stuff within the document as a list of words in the order in which they are presented.
All the while, ignoring any garbage such as <script> and <html> tags to compute a list of UIDs (which could be, for instance, a domain) followed by document name (the resource within the domain) and finally the list of words for that document. I do realise that HTML tags play an important role in the semantic placement of text within a document, but at this stage I do not care.
Bear in mind a solution that can build the list of words WHILE reading the document is cooler that one which needs to read in the whole document first.

At this stage, I do not care about the wheres or hows of storage. Even a rudimentary set of 'print' statements will suffice.

Thanks in advance, hope this was clear enough.

标签： php parsing indexing

2条回答

地球回转人心会变

2楼-- · 2020-07-05 06:39

I don't think I'm totally clear on what you're trying to do, but you can get a simple result fairly easily:

Run the page through Tidy (a good introduction) to make sure it's going to have valid HTML.
Throw away everything before (and including) <body>.
Step through the document one character at a time.
1. If the character is a '<', don't do anything with the following characters until you see a '>' (skips HTML)
2. If the character is a "word character" (alphanumeric, hyphen, possibly more) append it to the "current word".
3. If the character is a "non-word character" (punctuation, space, possibly more), add the "current word" to the word list in the forward index, and clear the "current word".
Do the above until you hit </body>.

That's really about it, you might have to add in some exceptions for handling things like <script> tags (you don't want to consider javascript to be words that should be indexed), but that should give you a basic forward index.

0人赞添加讨论(0) 举报

家丑人穷心不美

3楼-- · 2020-07-05 06:48

Take a look at

http://simplehtmldom.sourceforge.net/

You do somthing like

$p = new Simple_dom_parser();
$p->load("www.page.com");
$p->find("body")->plaintext;

And that will give you all the text. Want to iterate over just the links

foreach ($p->find("a") as $link)
{
    echo $link->innerText;
}

It is very usefull and powerfull. Check it out.

0人赞添加讨论(0) 举报

How might one go about implementing a forward inde

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间