2016-01-04 76 views
1

我目前正試圖應用正則表達式來過濾掉鏈接列表中的某些鏈接。當使用正則表達式篩選列表時,出現「TypeError:expected string or buffer」

我在severals方式嘗試過,但現在我總是得到這個錯誤:

Traceback (most recent call last): 
    File "/Users/User/Documents/pyp/pushbullet_updater/DoDa/test.py", line 20, in <module> 
    print(get_chapter_links(links)) 
    File "/Users/User/Documents/pyp/pushbullet_updater/DoDa/test.py", line 15, in get_chapter_links 
    match = re.findall(r"https://bluesilvertranslations\.wordpress\.com/\d{4}/\d{2}/\d{2}/douluo-dalu-\d{1,3}-\s*/", link) 
    File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/re.py", line 210, in findall 
    return _compile(pattern, flags).findall(string) 
TypeError: expected string or buffer 

我在做什麼錯?

下面的代碼:

import requests 
from bs4 import BeautifulSoup 
import re 

#Gets chapter links 
def get_chapter_links(index_url): 
    r = requests.get(index_url) 
    soup = BeautifulSoup(r.content, 'lxml') 
    links = soup.find_all('a') 
    url_list = [] 
    for url in links: 
     url_list.append((url.get('href'))) 

    for link in url_list: # Iterates through every line and looks for a match: 
     match = re.findall(r"https://bluesilvertranslations\.wordpress\.com/\d{4}/\d{2}/\d{2}/douluo-dalu-\d{1,3}-\s*/", link) 
    return match 

links = 'https://bluesilvertranslations.wordpress.com/chapter-list/' 

print(get_chapter_links(links)) 

回答

1

re文檔

re.findall(pattern, string, flags=0) 
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match. 

New in version 1.5.2. 

Changed in version 2.4: Added the optional flags argument. 

注:

  • 第一個參數應該是一個模式,第二個參數應該是一個字符串

修改代碼:

import requests 
from bs4 import BeautifulSoup 
import re 

#Gets chapter links 
def get_chapter_links(index_url): 
    r = requests.get(index_url) 
    soup = BeautifulSoup(r.content, 'lxml') 
    links = soup.find_all('a') 
    url_list = [] 
    for url in links: 
     url_list.append((url.get('href'))) 
    match = [] # Create a list and append to it the matched links 
    for link in url_list: # Iterates through every line and looks for a match: 
     if link: # I have added this to see in there are value in link. 
      match += re.findall(r"https://bluesilvertranslations\.wordpress\.com/\d{4}/\d{2}/\d{2}/douluo-dalu-\d{1,3}-.*/", link) # I have changed the regex a bit since your did not match 
    return match 

links = 'https://bluesilvertranslations.wordpress.com/chapter-list/' 

print(get_chapter_links(links)) 
相關問題