我應該每次取兩個句子並計算它們是否相似。我的意思是,在語法和語義上都是這樣。如何計算兩個句子之間的相似度(句法和語義)
INPUT1:奧巴馬簽署法律。奧巴馬簽署了一項新法律。
INPUT2: 總線停在這裏。 車輛停在這裏。
INPUT3:紐約的火災。 紐約被燒燬。
INPUT4:在紐約的火災。在紐約大火中死亡50人。
我不想用本體樹作爲靈魂。我寫了一個代碼來計算句子之間Levenshtein distance(LD),然後決定是否第二個句子:
- 可以忽略不計(INPUT1和2),
- 應更換的第一句話(INPUT 3),或
- 與第一句(INPUT4)一起存儲。
我對代碼不滿意,因爲LD只計算語法級別(還有其他什麼方法?)。語義如何融入(比如公交車就像是一輛車?)。
的代碼放在這裏:
%# As the difference is computed, a decision is made on the new event
%# (string 2) to be ignored, to replace existing event (string 1) or to be
%# stored separately. The higher the LD metric, the higher the difference
%# between two strings. Of course, lower difference indices either identical
%# or similar events. However, the higher difference indicates the new event
%# as a fresh event.
%#.........................................................................
%# Calculating the LD between two strings of events.
%#.........................................................................
L1=length(str1)+1;
L2=length(str2)+1;
L=zeros(L1,L2); %# Initializing the new length.
g=+1; %# just constant
m=+0; %# match is cheaper, we seek to minimize
d=+1; %# not-a-match is more costly.
% do BC's
L(:,1)=([0:L1-1]*g)';
L(1,:)=[0:L2-1]*g;
m4=0; %# loop invariant
%# Calculating required edits.
for idx=2:L1;
for idy=2:L2
if(str1(idx-1)==str2(idy-1))
score=m;
else
score=d;
end
m1=L(idx-1,idy-1) + score;
m2=L(idx-1,idy) + g;
m3=L(idx,idy-1) + g;
L(idx,idy)=min(m1,min(m2,m3)); % only minimum edits allowed.
end
end
%# The LD between two strings.
D=L(L1,L2);
%#....................................................................
%# Making decision on what to do with the new event (string 2).
%#...................................................................
if (D<=4) %# Distance is so less that string 2 seems identical to string 1.
store=str1; %# Hence string 2 is ignored. String 1 remains stored.
elseif (D>=5 && D<=15) %# Distance is larger to be identical but not enough to
%# make string 2 an individual event.
store= str2; %# String 2 is somewhat similar to string 1.
%# So, string 1 is replaced with string 2 and stored.
else
%# For all other distances, string 2 is stored along with string 1.
store={str1; str2};
end
任何幫助表示讚賞。
「語義上」。沒有簡單的文本書算法。自然語言(特別是英語)是一個非常複雜而反覆無常的野獸。 – 2010-09-07 22:16:49
@Amro:「'#'」使它們變灰,因爲這裏的註釋是SO? – Lazer 2010-09-14 08:41:33
@Lazer:是的,它的眼睛更容易..我希望StackOverflow引入了包含代碼塊的功能,如:'
...
',以便爲該特定語言正確突出顯示 – Amro 2010-09-14 15:54:46