This is my first ever question here and I'm new to R, trying to figure out my first step in how to do data processing, please keep it easy : )
I'm wondering what would be the best function and a useful data structure in R to load unstructured text data for further processing. For example, let's say I have a book stored as a text file, with no new line characters in it.
Is it a good idea to use read.delim()
and store the data in a list? Or is a character vector better, and how would I define it?
Thank you in advance.
PN
P.S. If I use "." as my delimeter, it would treat things like "Mr." as a separate sentence. While this is just an example and I'm not concerned about this flaw, just for educational purposes, I'd still be curious how you'd go around this problem.
read.delim
reads in data in table format (with rows and columns, as in Excel). It is not very useful for reading a string of text.To read text from a text file into R you can use
readLines()
.readLines()
creates a character vector with as many elements as lines of text. A line, for this kind of software, is any string of text that ends with a newline. (Read about newline on Wikipedia.) When you write text, you enter your system specific newline character(s) by pressingReturn
. In effect, a line of text is not defined by the width of your software window, but can run over many visual rows. In effect, a line of text is what in a book would be a a paragraph. SoreadLines()
splits your text at the paragraphs:Note that you can scroll long text to the left here in Stackoverflow. That seventh line is longer than this column is wide.
As you can see,
readLines()
read that long seventh paragraph as one line. And, as you can also see,readLines()
added a backslash in front of each quotation mark. Since R holds the individual lines in quotation marks, it needs to distinguish these from those that are part of the original text. Therefore, it "escapes" the original quotation marks. Read about escaping on Wikipedia.readLines()
may output a warning that an "incomplete final line" was found in your file. This only means that there was no newline after the last line. You can suppress this warning withreadLines(..., warn = FALSE)
, but you don't have to, it is not an error, and supressing the warning will do nothing but supress the warning message.If you don't want to just output your text to the R console but process it further, create an object that holds the output of
readLines()
:Besides
readLines()
, you can also usescan()
,readBin()
and other functions to read text from files. Look at the manual by entering?scan
etc. Look at?connections
to learn about many different methods to read files into R.I would strongly advise you to write your text in a .txt-file in a text editor like Vim, Notepad, TextWrangler etc., and not compose it in a word processor like MS Word. Word files contain more than the text you see on screen or printed, and those will be read by R. You can try and see what you get, but for good results you should either save your file as a .txt-file from Word or compose it in a text editor.
You can also copy-paste your text from a text file open in any other software to R or compose your text in the R console:
Note how entering
Return
does not cause R to execute the command before I closed the string with")
. R just replies with+
, telling me that I can continue to edit. I did not type in those plusses. Try it. Note also that now the newlines are part of your string of text. (I'm on a Mac, so my newline is\n
.)If you input your text manually, I would load the whole text as one string into a vector:
You could load different chapters into different elements of this vector:
For better reference, you can name the elements:
Now you can split the elements of any of these vectors:
Enter
?strsplit
to read the manual for this function and learn about the attributes it takes. The second attribute takes a regular expression. In this case I toldstrsplit
to split the elements of the vector at any of the three punctuation marks followed by an optional space (if you don't define a space here, the resulting "sentences" will be preceded by a space).sentences
now contains:You can access the individual sentences by indexing:
R will be unable to know that it should not split after "Mr.". You must define exceptions in your regular expression. Explaining this is beyond the scope of this question.
How you would tell R how to recognize subjects or objects, I have no idea.