I'm trying to split a paragraph into series of sentences such that each sentence group stays under N characters. In case of a single sentence that is longer than N, it should be split into chunks with punctuation marks or spaces as separators.
E.g., if N = 50, then the following string
"Lorem ipsum, consectetur elit. Donec ut ligula. Sed acumsan posuere tristique. Sed et tristique sem. Aenean sollicitudin, sapien sodales elementum blandit. Fusce urna libero blandit eu aliquet ac rutrum vel tortor."
would become
["Lorem ipsum, consectetur elit. Donec ut ligula.", "Sed acumsan posuere tristique.", "Sed et tristique sem.", "Aenean sollicitudin,", "sapien sodales elementum blandit.", "Fusce urna libero blandit eu aliquet ac rutrum vel", "tortor."]
Are there any rails gems that could help me to achieve this? I looked at html_slicer, but I'm not sure it can handle the example above.
There are two non-trivial tasks to achieve what you are after:
- splitting a string into sentences
- and word-wrapping each sentence with extra care for punctuation.
I think the first one is not easy to implement from scratch so your best bet might just be to use natural language processing libraries provided that your "third-party language processing service" doesn't have such a feature. I don't know any "rails gem" to meet your requirement.
Here is just a toy example of splitting a string into sentences using stanford-core-nlp.
require 'stanford-core-nlp'
text = "Lorem ipsum, consectetur elit. Donec ut ligula. Sed acumsan posuere tristique. Sed et tristique sem. Aenean sollicitudin, sapien sodales elementum blandit. Fusce urna libero blandit eu aliquet ac rutrum vel tortor."
pipeline = StanfordCoreNLP.load(:tokenize, :ssplit)
a = StanfordCoreNLP::Annotation.new(text)
pipeline.annotate(a)
sentenses = a.get(:sentences).to_a.map &:to_s # Map with to_s if you want an array of sentence string.
# => ["Lorem ipsum, consectetur elit.", "Donec ut ligula.", "Sed acumsan posuere tristique.", "Sed et tristique sem.", "Aenean sollicitudin, sapien sodales elementum blandit.", "Fusce urna libero blandit eu aliquet ac rutrum vel tortor."]
The second problem is similar to word-wrapping and if it exactly were a word-wrapping problem, it should be easily solved using existing implementations like ActionView::Helpers::TextHelper.word_wrap.
However, there is an extra requirement concerning punctuations. I don't know any existing implementation to achieve exactly the same goal of yours. Maybe you have to come up with your own solution.
My only idea is to firstly word-wrap each sentence, secondly split each line with a punctuation and then join the pieces again but with limitation on length. I wonder if this would work though.