2017-03-25 50 views
0

python生成器的新東西我想嵌套它們,即有生成器A取決於生成器B的輸出(B正在生成文件路徑,A解析文檔),但只有第一個文件被讀取。嵌套的發生器沒有正確觸發

下面是一個最小的樣品(即使用TREC8all數據)

import itertools 
import spacy 
from bs4 import BeautifulSoup 
import os 
def iter_all_files(p): 
    for root, dirs, files in os.walk(p): 
     for file in files: 
      if not file.startswith('.'): 
       print('using: ' + str(os.path.join(root, file))) 
       yield os.path.join(root, file) 


def gen_items(path): 
    path = next(path) 
    text_file = open(path, 'r').read() 
    soup = BeautifulSoup(text_file,'html.parser') 
    for doc in soup.find_all("doc"): 
     strdoc = doc.docno.string.strip() 
     text_only = str(doc.find_all("text")[0]) 
     yield (strdoc, text_only) 


file_counter = 0 
g = iter_all_files("data/TREC8all/Adhoc") 
gen1, gen2 = itertools.tee(gen_items(g)) 
ids = (id_ for (id_, text) in gen1) 
texts = (text for (id_, text) in gen2) 
docs = nlp.pipe(texts, batch_size=50, n_threads=4) 

for id_, doc in zip(ids, docs): 
    file_counter += 1 
file_counter 

這將輸出僅

using: data/TREC8all/Adhoc/fbis/fb396002 
Out[10]: 
33 

如果要分析以下顯示,當然一些文件有:

g = iter_all_files("data/TREC8all/Adhoc") 
file_counter = 0 
for file in g: 
    file_counter += 1 
    # print(file) 
    for item in gen_items(g): 
     item_counter += 1 

print(item_counter) 
file_counter 

將返回大約2000個文件,如

using: data/TREC8all/Adhoc/fbis/fb396002 
using: data/TREC8all/Adhoc/fbis/fb396003 
using: data/TREC8all/Adhoc/fbis/fb396004 
using: data/TREC8all/Adhoc/fbis/fb396005 
using: data/TREC8all/Adhoc/fbis/fb396006 
using: data/TREC8all/Adhoc/fbis/fb396007 
using: data/TREC8all/Adhoc/fbis/fb396008 
using: data/TREC8all/Adhoc/fbis/fb396009 
using: data/TREC8all/Adhoc/fbis/fb396010 
using: data/TREC8all/Adhoc/fbis/fb396011 
using: data/TREC8all/Adhoc/fbis/fb396012 
using: data/TREC8all/Adhoc/fbis/fb396013 

因此很明顯,我的

g = iter_all_files("data/TREC8all/Adhoc") 
gen1, gen2 = itertools.tee(gen_items(g)) 
ids = (id_ for (id_, text) in gen1) 
texts = (text for (id_, text) in gen2) 
docs = nlp.pipe(texts, batch_size=50, n_threads=4) 

for id_, doc in zip(ids, docs): 

沒有消耗以正確的方式嵌套發電機。

編輯

與外部for循環嵌套似乎工作,但不好。有沒有更好的方式來制定它?

g = iter_all_files("data/TREC8all/Adhoc") 
for file in g: 
    file_counter += 1 
    # print(file) 
    #for item in gen_items(g): 
    gen1, gen2 = itertools.tee(genFiles(g) 
+0

'path = next(path)' - 爲什麼你只用了一個迭代器呢?如果你只打算第一個項目? – user2357112

回答

1

,但只有第一個文件是隻讀

好了,你只告訴Python讀取一個文件:

def gen_items(path): 
    path = next(path) 
    ... 

如果你想去過的所有文件,你需要一個循環。

def gen_items(paths): 
    for path in paths: 
     ... 
+0

因此沒有更優雅的方式來嵌套生成器? –

+0

@GeorgHeiler:如果你第一次沒有注意到,我告訴你使用的循環進入'gen_items'。如果你想要一個發電機來處理另一個發電機的項目,它需要循環。如果你想使用一個帶有一個項目的函數,並把它應用到'iter_all_files'產生的項目中,你需要'map'。 – user2357112

0

審查的代碼,我不知道 「nlp.pipe」 的意思,試試這樣

#docs = nlp.pipe(texts, batch_size=50, n_threads=4) 
for id_, doc in zip(ids, texts): 
    file_counter += 1 
file_counter 

看到 「file_counter」,你就會知道的錯誤。

+0

好主意。 file_counter仍然只有33. –