你在PF做法將是無可救藥緩慢,即使os.path.walk
沒有采取任何時候都。在所有現存路徑中進行包含3個無界閉包的正則表達式匹配會在那裏殺死你。下面是從Kernighan and Pike,我引用的代碼,這是該任務的正確算法:
/* spname: return correctly spelled filename */
/*
* spname(oldname, newname) char *oldname, *newname;
* returns -1 if no reasonable match to oldname,
* 0 if exact match,
* 1 if corrected.
* stores corrected name in newname.
*/
#include <sys/types.h>
#include <sys/dir.h>
spname(oldname, newname)
char *oldname, *newname;
{
char *p, guess[DIRSIZ+1], best[DIRSIZ+1];
char *new = newname, *old = oldname;
for (;;) {
while (*old == '/') /* skip slashes */
*new++ = *old++;
*new = '\0';
if (*old == '\0') /* exact or corrected */
return strcmp(oldname,newname) != 0;
p = guess; /* copy next component into guess */
for (; *old != '/' && *old != '\0'; old++)
if (p < guess+DIRSIZ)
*p++ = *old;
*p = '\0';
if (mindist(newname, guess, best) >= 3)
return -1; /* hopeless */
for (p = best; *new = *p++;) /* add to end */
new++; /* of newname */
}
}
mindist(dir, guess, best) /* search dir for guess */
char *dir, *guess, *best;
{
/* set best, return distance 0..3 */
int d, nd, fd;
struct {
ino_t ino;
char name[DIRSIZ+1]; /* 1 more than in dir.h */
} nbuf;
nbuf.name[DIRSIZ] = '\0'; /* +1 for terminal '\0' */
if (dir[0] == '\0') /* current directory */
dir = ".";
d = 3; /* minimum distance */
if ((fd=open(dir, 0)) == -1)
return d;
while (read(fd,(char *) &nbuf,sizeof(struct direct)) > 0)
if (nbuf.ino) {
nd = spdist(nbuf.name, guess);
if (nd <= d && nd != 3) {
strcpy(best, nbuf.name);
d = nd;
if (d == 0) /* exact match */
break;
}
}
close(fd);
return d;
}
/* spdist: return distance between two names */
/*
* very rough spelling metric:
* 0 if the strings are identical
* 1 if two chars are transposed
* 2 if one char wrong, added or deleted
* 3 otherwise
*/
#define EQ(s,t) (strcmp(s,t) == 0)
spdist(s, t)
char *s, *t;
{
while (*s++ == *t)
if (*t++ == '\0')
return 0; /* exact match */
if (*--s) {
if (*t) {
if (s[1] && t[1] && *s == t[1]
&& *t == s[1] && EQ(s+2, t+2))
return 1; /* transposition */
if (EQ(s+1, t+1))
return 2; /* 1 char mismatch */
}
if (EQ(s+1, t))
return 2; /* extra character */
}
if (*t && EQ(s, t+1))
return 2; /* missing character */
return 3;
}
注意:此代碼ANSI C,ISO C或POSIX任何事情之前是書面的方式是即使一個想象讀取目錄文件raw。代碼的方法比所有的指針索引要有用得多。
來源
2010-04-23 00:11:07
msw
這似乎是一個非常奇怪的事情來優化。你在做什麼,這是一個性能至關重要的操作?你分析和研究你的算法,並確定這是問題?即使像Twisted這樣的「大」目錄樹也不大。 – 2010-04-22 23:34:20
@Mike:背景:我試圖爲cd,pushd等做一個聰明的替代品,並依靠傳遞路徑(更少,更多,貓,vim)的其他命令的包裝。該程序偶爾會使用文件遍歷來做到這一點(例如:鍵入「cd sr grai pat lxml」會自動轉換爲「cd src/pypm/grail/patches/lxml」)。如果這個'cd'替代品需要運行半秒鐘,我將不會滿意。 http://github.com/srid/pf – 2010-04-22 23:47:45
然後我們回到算法;路徑修正最好採用分段而不是「整棵樹並與之相匹配」,因爲在每個路徑組件上設置的搜索都很小。 Kernighan和Pike在Unix編程環境中有一個經典的交互路徑校正器 - 尋找spdist()http://netlib.bell-labs.com/cm/cs/upe/ – msw 2010-04-22 23:53:13