I know that Google’s search algorithm is mainly based on pagerank. However, it also does analysis and uses the structure of the document H1
, H2
, title
and other HTML tags to enhance the search results.
What is the name of this technique "using the document structure to enhance the search results"?
And are there any academic papers to help me study this area?
The fact that Google is taking the HTML structure into account is well covered in SEO articles however I could not find it in the academic papers.
Like cletus said follow the google guidelines.
I did a few tests came to the conclusion that title, image alt and h tags the most important. Also worth to mention is google adsense. I had the feeling if you implement these, the rank of your site increase.
I have found this paper:
A New Study on Using HTML Structures to Improve Retrieval
however it is an old paper 1999,
still looking for more recent papers.
To keep it painfully simple. Make your information architecture logical. If the most important elements for user comprehension are highlighted with headings and grouped logically, then the document is easier to interpret using information processing algorithms. Magically, it will also be easier for users to interpret. Remember the search engine algorithms were written by people trying to interpret language.
The Basic Process is: Write well structured HTML - using header tags to indicate the most critical elements on the page. Use logical tags based on the structure of your information. Lists for lists, headers for major topics.
Supply relevant alt tags and names for any visual elements, and then use simple css to arrange these elements.
If the site works well for users and contains relevant information, you don't risk becoming a black listed spammer, and search engine algorithms will favor your page.
I really enjoyed the book Transcending CSS for a clean explanation of properly structured HTML.
You can also try searching the 'Computer Science' section of arXiv: http://arxiv.org for "search engine" and the various terms that others have suggested.
It contains many academic papers, all freely available... hopefully some of them will be relevant to your research. (Of course the caveat of validating any paper's content applies.)
I would also suggest looking at Microformats and RDF's. Both are used to enhance searching. These are mostly search engine agnostic, but there are some specific things as well. For google specific guidelines for HTML content read this link.
I think it's called "Semantic Markup"
A more practical article here http://robertnyman.com/2007/10/29/explaining-semantic-mark-up/