Python正則表達式和熊貓

我在html中有一個文本，後來我想轉換成一個熊貓數據框。Python正則表達式和熊貓

我有一個看起來像這樣一個文本：

<tr> 
    <td -some attributes- >Val1</td> 
    <td -some attributes- >Val2</td> 
    <td -some attributes- >Val3</td> 
</tr> 
<tr> 
    <td -some attributes- >Val4</td> 
    <td -some attributes- >Val5</td> 
    <td -some attributes- >Val6</td> 
</tr>

和我有正則表達式：<td.*>(.*)</td>但它不會捕獲所有的值，它cathces幾乎所有的文字...

在我把所有的東西都加入後，我把它放在一個數據框中。

那麼爲什麼這個正則表達式不會像它應該那樣捕獲值呢？

來源

2017-05-10 TheDaJon

我建議beautifulsoup代替的正則表達式https://pypi.python.org/pypi/beautifulsoup4 .....也顯示你試圖使用的實際代碼 – depperm

它可能是你看着每一行，一次一個，並且一個值跨越多個行，等等計算完全不同。我第二次以前的評論。使用beautifulsoup解析html。 – JohanL

舉一些它不能捕捉的標籤的例子。 –

你可以嘗試這樣的代替正則表達式的 - 只是一個意見

import pandas as pd 
movies_table = pd.read_html("xxx.y.com") 
movies = movies_table[0] // select the correct table from the tables array.

我得到這個工作和我在一起。下面我附上一個樣本供使用。

來源

2017-05-10 14:34:18

如果你（真的）要使用正則表達式，你可以做如下：

content = """\ 
<tr> 
    <td -some attributes- >Val1</td> 
    <td -some attributes- >Val2</td> 
    <td -some attributes- >Val3</td> 
</tr> 
<tr> 
    <td -some attributes- >Val4</td> 
    <td -some attributes- >Val5</td> 
    <td -some attributes- >Val6</td> 
</tr>""" 

import re 

td_regex = re.compile(r"<td[^>]+>" # <td> tag 
         r"((?:(?!</td>).)+)") # <td> content 

print(td_regex.findall(content))

您將獲得：

['Val1', 'Val2', 'Val3', 'Val4', 'Val5', 'Val6']

來源

2017-05-10 14:35:24

Python正則表達式和熊貓

回答

相關問題