I am making share a link feature like facebook. Currently I am parsing meta tags to get keywords, descriptions e.t.c but how to parse these type of pages http://en.wikipedia.org/wiki/Wikipedia There is no meta description for this page but facebook still fetches the following description: Wikipedia ( /ˌwɪkɪˈpiːdi.ə/ or /ˌwɪkiˈpiːdi.ə/ WIK-i-PEE-dee-ə) is a free,[3]web-based, collaborative, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. Its 17 million articles (over 3.4 million in English) have been written collaboratively by volunteers around the
How can I extract such description if there is no meta description tag found on the page.
Download this page and parse for extracting all what you need to get:
Amazon faces a similar problem, and has a fairly novel solution. Obviously, it's not perfect, but by marrying it to the idea that Bing uses, I'd bet you could get some pretty solid and interesting keyword tags auto-generated to go with the inherently more suspect description.
So it'd look like:
Description from meta
Interesting Sentences according to bing\google
STP as tags, with hover-over for context.
I think that, in all likelyhood, this is like nuking a fly.
It'd oversolve your problem to a ridiculous degree.
Looks like they generate the description the same way Bing does which might be difficult to easily re-create:
http://www.bing.com/toolbox/support/faqs.aspx
One option would be to hit Bing and try to fetch the description from there.
If you want to create a program that gives you a good description of an arbitrary website, you'll have to do nothing less than a full fledged KI, which would possibly even pass a Turing test. So short answer: You can't.
If you are willing to pay a human intelligence to write a summary about a webpage for you, google for "Microjobs". You can create an automated Job description like "Write a two sentence summary about webpage XY" and put some cents of value behind it.
Of course you could try to find the first paragraph of text and take the first N sentences out of it, but that will fail on a lot of websites.