python生成器的新東西我想嵌套它們，即有生成器A取決於生成器B的輸出（B正在生成文件路徑，A解析文檔），但只有第一個文件被讀取。嵌套的發生器沒有正確觸發

下面是一個最小的樣品（即使用TREC8all數據）

import itertools 
import spacy 
from bs4 import BeautifulSoup 
import os 
def iter_all_files(p): 
    for root, dirs, files in os.walk(p): 
     for file in files: 
      if not file.startswith('.'): 
       print('using: ' + str(os.path.join(root, file))) 
       yield os.path.join(root, file) 


def gen_items(path): 
    path = next(path) 
    text_file = open(path, 'r').read() 
    soup = BeautifulSoup(text_file,'html.parser') 
    for doc in soup.find_all("doc"): 
     strdoc = doc.docno.string.strip() 
     text_only = str(doc.find_all("text")[0]) 
     yield (strdoc, text_only) 


file_counter = 0 
g = iter_all_files("data/TREC8all/Adhoc") 
gen1, gen2 = itertools.tee(gen_items(g)) 
ids = (id_ for (id_, text) in gen1) 
texts = (text for (id_, text) in gen2) 
docs = nlp.pipe(texts, batch_size=50, n_threads=4) 

for id_, doc in zip(ids, docs): 
    file_counter += 1 
file_counter

這將輸出僅

using: data/TREC8all/Adhoc/fbis/fb396002 
Out[10]: 
33

如果要分析以下顯示，當然一些文件有：

g = iter_all_files("data/TREC8all/Adhoc") 
file_counter = 0 
for file in g: 
    file_counter += 1 
    # print(file) 
    for item in gen_items(g): 
     item_counter += 1 

print(item_counter) 
file_counter

將返回大約2000個文件，如

using: data/TREC8all/Adhoc/fbis/fb396002 
using: data/TREC8all/Adhoc/fbis/fb396003 
using: data/TREC8all/Adhoc/fbis/fb396004 
using: data/TREC8all/Adhoc/fbis/fb396005 
using: data/TREC8all/Adhoc/fbis/fb396006 
using: data/TREC8all/Adhoc/fbis/fb396007 
using: data/TREC8all/Adhoc/fbis/fb396008 
using: data/TREC8all/Adhoc/fbis/fb396009 
using: data/TREC8all/Adhoc/fbis/fb396010 
using: data/TREC8all/Adhoc/fbis/fb396011 
using: data/TREC8all/Adhoc/fbis/fb396012 
using: data/TREC8all/Adhoc/fbis/fb396013

因此很明顯，我的

g = iter_all_files("data/TREC8all/Adhoc") 
gen1, gen2 = itertools.tee(gen_items(g)) 
ids = (id_ for (id_, text) in gen1) 
texts = (text for (id_, text) in gen2) 
docs = nlp.pipe(texts, batch_size=50, n_threads=4) 

for id_, doc in zip(ids, docs):

沒有消耗以正確的方式嵌套發電機。

編輯

與外部for循環嵌套似乎工作，但不好。有沒有更好的方式來制定它？

g = iter_all_files("data/TREC8all/Adhoc") 
for file in g: 
    file_counter += 1 
    # print(file) 
    #for item in gen_items(g): 
    gen1, gen2 = itertools.tee(genFiles(g)

來源

2017-03-25 Georg Heiler

'path = next（path）' - 爲什麼你只用了一個迭代器呢？如果你只打算第一個項目？ – user2357112

，但只有第一個文件是隻讀

好了，你只告訴Python讀取一個文件：

def gen_items(path): 
    path = next(path) 
    ...

如果你想去過的所有文件，你需要一個循環。

def gen_items(paths): 
    for path in paths: 
     ...

來源

2017-03-25 17:03:32 user2357112

因此沒有更優雅的方式來嵌套生成器？ –

@GeorgHeiler：如果你第一次沒有注意到，我告訴你使用的循環進入'gen_items'。如果你想要一個發電機來處理另一個發電機的項目，它需要循環。如果你想使用一個帶有一個項目的函數，並把它應用到'iter_all_files'產生的項目中，你需要'map'。 – user2357112

審查的代碼，我不知道「nlp.pipe」的意思，試試這樣

#docs = nlp.pipe(texts, batch_size=50, n_threads=4) 
for id_, doc in zip(ids, texts): 
    file_counter += 1 
file_counter

看到「file_counter」，你就會知道的錯誤。

來源

2017-03-25 16:24:25 cjremond

好主意。 file_counter仍然只有33. –

嵌套的發生器沒有正確觸發

編輯

回答

相關問題