Extract text from external URL

2019-05-27 01:56发布

I am making share a link feature like facebook. Currently I am parsing meta tags to get keywords, descriptions e.t.c but how to parse these type of pages http://en.wikipedia.org/wiki/Wikipedia There is no meta description for this page but facebook still fetches the following description: Wikipedia ( /ˌwɪkɪˈpiːdi.ə/ or /ˌwɪkiˈpiːdi.ə/ WIK-i-PEE-dee-ə) is a free,[3]web-based, collaborative, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. Its 17 million articles (over 3.4 million in English) have been written collaboratively by volunteers around the

How can I extract such description if there is no meta description tag found on the page.

4条回答
闹够了就滚
2楼-- · 2019-05-27 02:34

Download this page and parse for extracting all what you need to get:

System.Net.WebClient client = new System.Net.WebClient();

String url = "http://en.wikipedia.org/wiki/Wikipedia";

String pageHTMLSource = client.DownloadString(url);

//Parse pageHTMLSource
查看更多
我命由我不由天
3楼-- · 2019-05-27 02:36

Amazon faces a similar problem, and has a fairly novel solution. Obviously, it's not perfect, but by marrying it to the idea that Bing uses, I'd bet you could get some pretty solid and interesting keyword tags auto-generated to go with the inherently more suspect description.
So it'd look like:
Description from meta
Interesting Sentences according to bing\google
STP as tags, with hover-over for context.

I think that, in all likelyhood, this is like nuking a fly.
It'd oversolve your problem to a ridiculous degree.

查看更多
相关推荐>>
4楼-- · 2019-05-27 02:36

Looks like they generate the description the same way Bing does which might be difficult to easily re-create:

How does Bing generate a description of my Web site?

The way you design your Web page content has the greatest impact on your Web page description. As MSNBot crawls your Web site, it analyzes the content on indexed Web pages and generates keywords to associate with each Web page. MSNBot extracts Web page content that is most relevant to the keywords, and constructs the Web site description that appears in search results. The Web page content is typically sentence segments that contain keywords or information in the description tag. The Web page title and URL are also extracted and appear in the search results.

If you change the contents of a Web page, your Web page description might change the next time the Bing index is updated. To influence your Web site description, make sure that your Web pages effectively deliver the information you want in the search results. Webmaster Center recommends the following strategies when you design your content:

* Place descriptive content near the top of each Web page.
* Make sure that each Web page has a clear topic and purpose.
* Create unique <title> tag content for each page.
* Add a Web site description <meta> tag to describe the purpose of

each page on your site. For example:

> <META NAME="Description"
> CONTENT="Sample text - describe your

http://www.bing.com/toolbox/support/faqs.aspx

One option would be to hit Bing and try to fetch the description from there.

查看更多
乱世女痞
5楼-- · 2019-05-27 02:38

If you want to create a program that gives you a good description of an arbitrary website, you'll have to do nothing less than a full fledged KI, which would possibly even pass a Turing test. So short answer: You can't.

If you are willing to pay a human intelligence to write a summary about a webpage for you, google for "Microjobs". You can create an automated Job description like "Write a two sentence summary about webpage XY" and put some cents of value behind it.

Of course you could try to find the first paragraph of text and take the first N sentences out of it, but that will fail on a lot of websites.

查看更多
登录 后发表回答