2013-06-21 61 views
0

我想搜索並計算字符串在webscrape中出現的次數。不過,我想在Webscrape中的x和y之間進行搜索。限制由python搜索的文本區域

任何人都可以告訴我最簡單的方法來計算主要漁夫和次要漁民之間的SEA BASS在下面的示例webscrape。

<p style="color: #555555; 
    font-family: Arial,Helvetica,sans-serif; 
    font-size: 12px; 
    line-height: 18px;">June 21, 2013 By FISH PPL Admin </small> 

</div> 
<!-- Post Body Copy --> 

<div class="post-bodycopy clearfix"><p>MAIN FISHERMAN &#8211; </p> 
<p><strong>CHAMP</strong> &#8211; Pedro 00777<br /> 
BAIT &#8211; LOCATION1 &#8211; 2:30 &#8211; SEA BASS (3 LBS 11/4)<br /> 
MULTI – LOCATION2 &#8211; 7:30 &#8211; COD (3 LBS 13/8)<br /> 
LURE – LOCATION5 &#8211; 3:20 &#8211; RUDD (2 LBS 6/1)</p> 
<p>JOE BLOGGS <a href="url">url</a><br /> 
BAIT &#8211; LOCATION4 &#8211; 4:45 &#8211; ROACH (5 LBS 3/1)<br /> 
MULTI – LOCATION2 &#8211; 5:50 &#8211; PERCH (3 LBS 6/1)<br /> 
LURE – LOCATION1 &#8211; 3:45 &#8211; PIKE (2 LBS 5/1) </p> 
BAIT &#8211; LOCATION1 &#8211; 2:30 &#8211; SEA BASS (3 LBS 11/4)<br /> 
MULTI – LOCATION1 &#8211; 3:45 &#8211; JUST THE JUDGE (3 LBS 3/1)<br /> 
LURE – LOCATION3 &#8211; 8:25 &#8211; SCHOOL FEES (2 LBS 7/1)</p> 
<div class="post-bodycopy clearfix"><p>SECONDARY FISHERMAN &#8211; </p> 
<p><strong>SPOON &#8211; <a href="url">url</a></strong><br /> 
BAIT &#8211; LOCATION1 &#8211; 2:30 &#8211; SEA BASS (3 LBS 11/4)<br /> 
MULTI – LOCATION2 &#8211; 7:30 &#8211; COD (3 LBS 7/4)<br /> 
LURE – LOCATION1 &#8211; 4:25 &#8211; TROUT (2 LBS 5/1)</p> 

我試圖用下面的代碼來實現這一點,但無濟於事。

html = website.read() 

pattern_to_exclude_unwanted_data = re.compile('MAIN FISHERMAN(.*)SECONDARY FISHERMAN') 

excluding_unwanted_data = re.findall(pattern_to_exclude_unwanted_data, html) 

print excluding_unwanted_data("SEA BASS") 

回答

6

做的兩個步驟:

  1. 提取主要漁夫和二次漁夫之間的子字符串。
  2. 計數鱸魚

像這樣:

relevant = re.search(r"MAIN FISHERMAN(.*)SECONDARY FISHERMAN", html, re.DOTALL).group(1) 
found = relevant.count("SEA BASS") 
+0

哎呀,是的我沒有想到DOTALL--而「組合」這個東西只是馬虎,謝謝! – alexis

2

僞代碼(未經測試):

count = 0 
enabled = false 
for line in file: 
    if 'MAIN FISHERMAN' in line: 
    enabled = true 
    elif enabled and 'SEA BASS' in line: 
    count += 1 
    elif 'SECONDARY FISHERMAN' in line: 
    enabled = false 
4

如果你想使用'MAIN FISHERMAN''SECONDARY FISHERMAN'爲標記找到<div>元素數內'SEA BASS'

import re 
from bs4 import BeautifulSoup # $ pip install beautifulsoup4 

soup = BeautifulSoup(html) 
inbetween = False 
count = 0 
for div in soup.find_all('div', ["post-bodycopy", "clearfix"]): 
    if not inbetween: 
     inbetween = div.find(text=re.compile('MAIN FISHERMAN')) # check start 
    else: # inbetween 
     inbetween = not div.find(text=re.compile('SECONDARY FISHERMAN')) # end 
    if inbetween: 
     count += len(div.find_all(text=re.compile('SEA BASS'))) 

print(count)