I have a section of a book, complete with punctuation, line breaks etc. and I want to be able to extract the first n words from the text, and divide that into 5 parts. Regex mystifies me. This is what I am trying. I creates an array of index size 0, with all the input text:
public static String getNumberWords2(String s, int nWords){
String[] m = s.split("([a-zA-Z_0-9]+\b.*?)", (nWords / 5));
return "Part One: \n" + m[1] + "\n\n" +
"Part Two: \n" + m[2] + "\n\n" +
"Part Three: \n" + m[3] + "\n\n" +
"Part Four: \n" + m[4] + "\n\n" +
"Part Five: \n" + m[5];
}
Thanks!
I have a really really ugly solution:
I'm just going to guess what you need here; hopefully this is close:
This produces:
Another possible interpretation
This uses
java.util.Scanner
:This prints the first 23 words of
text
,Or if 7:
Or if 3:
I think the simplest, and most efficient way, is to simply repeatedly find a "word":
You can vary the definition of "word" by modifying the regex. What I wrote just uses regex's notion of word characters, and I wonder if it might be more appropriate than what you're trying to do. But it won't for instance include quote characters, which you may need to allow within a word.
(See below the break for the next go at this. Leaving this top part here because of thought process...)
Based on my reading of the
split()
javadoc, I think I know what's going on.You want to split the string based on whitespace, up to n times.
Then stitch them back together with token whitespace if you must:
Finally, chop that into five equal strings:
It's late at night for me, so you might want to check that one yourself for correctness. I think I got it somewhere in the area code of correct.
OK, here's try number 3. Having run it through a debugger, I can verify that the only problem left is the integer math of slicing strings that aren't factors of 5 into five pieces, and how best to deal with the remaining characters.
It ain't pretty, but it works.
Important notes:
there is a better alternative made just for this using BreakIterator. That would be the most correct way to parse for words in Java.