I am using Rails 4 and Kramdown, but I believe that this question extends to any (web-) programming language with Markdown support.
I am making a blogging website. On the overview page, I want to show the start of each of the articles.
As an article can be very long, I only want to show the first part.
A bare-bones idea would be to just truncate the article after N characters. A slightly better idea would be to truncate the article after N words.
Of course, when dealing with a document that contains additional markup, such as markdown, this can and will break stuff, so another solution is needed.
How to show only the first, say, 100, words of a Markdown document without breaking the markdown markup?
Now, let's assume you take the first 20 chars. You would get:
and 100 chars gives you:
While those char lengths are arbitrary and probably not lengths you would use, the point is that each of them break the Markdown syntax. A better approach would be parse the document to HTML, then break out the beginning of the HTML document.
Of course, you would probably want to use an HTML document model to some sort rather than splitting on raw char length for the same reasons. Why not simply take the first paragraph? If the paragraph is to long, break on the Nth char, but only counting the chars in the body text, not the chars which make up the HTML markup. How to do that would depend on which tool/library you are using to handle the HTML and this is not the place to make tool recommendations (and I'm not very familiar with Ruby/Rails - more of a Python guy).
Note that the second example I give above breaks the Markdown in the middle of a URL for a link. If you first convert the Markdown to HTML and break only counting text chars, then the URL will remain in tact even if the link text (label) gets truncated. Although, in that case, it might be better to truncate the text after the end of the link. That depends on how complicated you want to make your code.
A natural next step is to ask why not do all that with the Markdown text instead of converting the entire document to HTML first? You could, but then you would be re-implementing your own Markdown parser... unless you happen to use a Markdown parser which gives you access to the the internals (through some plug-in API) or outputs a parse three. If you are using a parser which returns a parse tree, you could truncate the parse tree, then pass it on to the renderer. Short of that, using parsed HTML is probably the best option.
Either way, lets work through an example. The HTML for the above example would look something like this:
Now, let's represent that document as some sort of pseudo document object (using JSON):
Now, just loop through that document (and its children), only counting chars for the "text" field of "text" types until you reach your maximum. Then truncate any additional elements after that in the document. When the document is rendered (using a proper HTML renderer), all the HTML elements will be properly closed. Obviously, the exact process would depend on what sort of document object the document is contained in (which may depend on the HTML parser and/or Markdown parser you are using).
In any event, the document truncated to 20 chars would result in this:
Which would render as:
Note that the text only (
Let's us this sampl
) counts as 20 chars.While the above examples use chars, you could certainly use the same principles and count words instead.