I'm trying to get information from a table in the internet as shown below. I'm using jupyter notebook with python 2.7. I want to use this information in Python's panda modüle as data frame. But when ı copy the table with table headings and then use the read_clipboard command, I see the error as shown below the table link. But without table headings there is no problem. How can ı get the data from internet with table headindgs.
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from numpy.random import randn
df1 = pd.read_clipboard()
df1
The table which I want to import as a data frame.
CParserError Traceback (most recent call last)
<ipython-input-4-151d7223d8dc> in <module>()
----> 1 df1 = pd.read_clipboard()
2 df1
C:\Anaconda3\envs\python2\lib\site-packages\pandas\io\clipboard.pyc in read_clipboard(**kwargs)
49 kwargs['sep'] = '\s+'
50
---> 51 return read_table(StringIO(text), **kwargs)
52
53
C:\Anaconda3\envs\python2\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
496 skip_blank_lines=skip_blank_lines)
497
--> 498 return _read(filepath_or_buffer, kwds)
499
500 parser_f.__name__ = name
C:\Anaconda3\envs\python2\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
283 return parser
284
--> 285 return parser.read()
286
287 _parser_defaults = {
C:\Anaconda3\envs\python2\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
745 raise ValueError('skip_footer not supported for iteration')
746
--> 747 ret = self._engine.read(nrows)
748
749 if self.options.get('as_recarray'):
C:\Anaconda3\envs\python2\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
1195 def read(self, nrows=None):
1196 try:
-> 1197 data = self._reader.read(nrows)
1198 except StopIteration:
1199 if self._first_chunk:
pandas\parser.pyx in pandas.parser.TextReader.read (pandas\parser.c:7988)()
pandas\parser.pyx in pandas.parser.TextReader._read_low_memory (pandas\parser.c:8244)()
pandas\parser.pyx in pandas.parser.TextReader._read_rows (pandas\parser.c:8970)()
pandas\parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:8838)()
pandas\parser.pyx in pandas.parser.raise_parser_error (pandas\parser.c:22649)()
CParserError: Error tokenizing data. C error: Expected 1 fields in line 14, saw 2
There is a csv you can use on the page with all the data which
read_csv
can parse easily:If you want certain time periods you just have to change the params in the url i.e
s=AAPL&d=1&e=16&f=2016&g=d&a=11&b=12&c=1980
, if we change 1980 to 2015:We get:
Consider using an html web scraper like python's lxml module,
html()
method to scrape html table data and then migrate to a pandas dataframe. While there are automation features like pandas.read_html(), this approach provides more control over nuances in html content like the Feb 4 column span. Below uses an xpath expression on the<td>
position in table using brackets,[]
: