此方法迭代數據庫中的術語列表,檢查術語是否在作爲參數傳遞的文本中,如果是,則將其替換爲帶有術語「參數」的搜索頁的鏈接。爲什麼這個Python方法泄漏內存?
條款數量很高(約100000),所以這個過程非常慢,但這是好的,因爲它是作爲一個cron作業執行的。然而,它使腳本存儲單耗扶搖直上,我找不到原因:
class SearchedTerm(models.Model):
[...]
@classmethod
def add_search_links_to_text(cls, string, count=3, queryset=None):
"""
Take a list of all researched terms and search them in the
text. If they exist, turn them into links to the search
page.
This process is limited to `count` replacements maximum.
WARNING: because the sites got different URLS schemas, we don't
provides direct links, but we inject the {% url %} tag
so it must be rendered before display. You can use the `eval`
tag from `libs` for this. Since they got different namespace as
well, we enter a generic 'namespace' and delegate to the
template to change it with the proper one as well.
If you have a batch process to do, you can pass a query set
that will be used instead of getting all searched term at
each calls.
"""
found = 0
terms = queryset or cls.on_site.all()
# to avoid duplicate searched terms to be replaced twice
# keep a list of already linkified content
# added words we are going to insert with the link so they won't match
# in case of multi passes
processed = set((u'video', u'streaming', u'title',
u'search', u'namespace', u'href', u'title',
u'url'))
for term in terms:
text = term.text.lower()
# no small word and make
# quick check to avoid all the rest of the matching
if len(text) < 3 or text not in string:
continue
if found and cls._is_processed(text, processed):
continue
# match the search word with accent, for any case
# ensure this is not part of a word by including
# two 'non-letter' character on both ends of the word
pattern = re.compile(ur'([^\w]|^)(%s)([^\w]|$)' % text,
re.UNICODE|re.IGNORECASE)
if re.search(pattern, string):
found += 1
# create the link string
# replace the word in the description
# use back references (\1, \2, etc) to preserve the original
# formatin
# use raw unicode strings (ur"string" notation) to avoid
# problems with accents and escaping
query = '-'.join(term.text.split())
url = ur'{%% url namespace:static-search "%s" %%}' % query
replace_with = ur'\1<a title="\2 video streaming" href="%s">\2</a>\3' % url
string = re.sub(pattern, replace_with, string)
processed.add(text)
if found >= 3:
break
return string
你可能會想這個代碼以及:
class SearchedTerm(models.Model):
[...]
@classmethod
def _is_processed(cls, text, processed):
"""
Check if the text if part of the already processed string
we don't use `in` the set, but `in ` each strings of the set
to avoid subtring matching that will destroy the tags.
This is mainly an utility function so you probably won't use
it directly.
"""
if text in processed:
return True
return any(((text in string) for string in processed))
我真的有隻引用了兩個對象那可能是這裏的犯罪嫌疑人:terms
和processed
。但我看不出有什麼理由讓他們不被垃圾收集。
編輯:
我想我應該說,這種方法被稱爲的Django模型方法本身裏面。我不知道它是否有關,但這裏是代碼:
class Video(models.Model):
[...]
def update_html_description(self, links=3, queryset=None):
"""
Take a list of all researched terms and search them in the
description. If they exist, turn them into links to the search
engine. Put the reset into `html_description`.
This use `add_search_link_to_text` and has therefor, the same
limitations.
It DOESN'T call save().
"""
queryset = queryset or SearchedTerm.objects.filter(sites__in=self.sites.all())
text = self.description or self.title
self.html_description = SearchedTerm.add_search_links_to_text(text,
links,
queryset)
我可以想象,自動Python正則表達式緩存吃了一些內存。但它應該只做一次,每次調用update_html_description
時內存消耗都會增加。
問題不僅在於它消耗了大量內存,問題在於它不會釋放它:每次調用都需要大約3%的內存,最終會填滿內存並導致腳本崩潰,導致無法分配內存」。
''在像Python這樣的垃圾收集語言中泄漏內存幾乎是不可能的。嚴格來說,內存泄漏是內存中沒有可變引用。在C++中,如果您在類中分配內存,但不聲明析構函數,則可能會發生內存泄漏。你在這裏只是高內存消耗.' ' –
:-)好的。然後,我在每次通話後都獲得了越來越高的內存消耗。但是,因爲這是一種方法。而且,因爲在完成任何事情後我都沒有做出任何決定,爲什麼某些東西仍然會消耗內存? –
更新了有關此問題。 –