如何使用熊貓解析已經從其他地方加載的CSV？

我下載了一些TSV格式數據的網頁。在TSV數據周圍是我不想要的HTML。如何使用熊貓解析已經從其他地方加載的CSV？

我下載了網頁的html，並使用美麗的圖標剔除了我想要的數據。但是，我現在已經在內存中獲得了TSV數據。

如何在熊貓記憶中使用TSV數據？我可以找到的每種方法似乎都希望從文件或URI讀取，而不是從我已經掃入的數據中讀取。我不想下載文本，將其寫入文件，然後重新保存。

#!/usr/bin/env python2 

from pandas import pandas as p 
from BeautifulSoup import BeautifulSoup 
import urllib2 

def main(): 
    url = "URL" 
    html = urllib2.urlopen(url) 
    soup = BeautifulSoup(html) 
    # pre is the tag that the data is within 
    tab_sepd_vals = soup.pre.string 

    data = p.LOAD_CSV(tab_sepd_vals) 
    process(data)

來源

2013-10-24 Squidly

你可以用'pandas.read_html'直接讀嗎？ http://pandas.pydata.org/pandas-docs/dev/io.html#html – joris

不，因爲pandas.read_html取決於bs4，而我正在使用python2 – Squidly

如果您將文本/字符串版本的數據填入StringIO.StringIO（或Python 3.X中的io.StringIO），則可以將該對象傳遞給pandas解析器。所以你的代碼變成：

#!/usr/bin/env python2 

import pandas as p 
from BeautifulSoup import BeautifulSoup 
import urllib2 
import StringIO 

def main(): 
    url = "URL" 
    html = urllib2.urlopen(url) 
    soup = BeautifulSoup(html) 
    # pre is the tag that the data is within 
    tab_sepd_vals = soup.pre.string 

    # make the StringIO object 
    tsv = StringIO.StringIO(tab_sepd_vals) 

    # something like this 
    data = p.read_csv(tsv, sep='\t') 

    # then what you had 
    process(data)

來源

2013-10-24 15:49:59

我之前沒有遇到過StringIO，所以對於像我這樣好奇的人來說：http://docs.python.org/2/library/stringio.html - 它們允許字符串用於需要文件的地方。 – Squidly

方法，如read_csv做兩件事情，他們解析CSV和他們construct一個DataFrame對象 - 所以你的情況，你可能要構建DataFrame直接：

>>> import pandas as pd 
>>> df = pd.DataFrame([['a', 1], ['b', 2], ['c', 3]]) 
>>> print(df) 
    0 1 
0 a 1 
1 b 2 
2 c 3

的構造函數接受多種的數據結構。

來源

2013-10-24 15:17:21 miku

但我也想要一個CSV解析器，而不是IO組件。 – Squidly

如何使用熊貓解析已經從其他地方加載的CSV？

回答

相關問題