2016-08-24 52 views
0

我有一個段落列表,我想在它們的組合上運行zipf分佈。使用matplotlib構建Zipf分佈,FITTED-LINE

我的代碼是下面:

from itertools import * 
from pylab import * 
from collections import Counter 
import matplotlib.pyplot as plt 


paragraphs = " ".join(targeted_paragraphs) 
for paragraph in paragraphs: 
    frequency = Counter(paragraph.split()) 
counts = array(frequency.values()) 
tokens = frequency.keys() 

ranks = arange(1, len(counts)+1) 
indices = argsort(-counts) 
frequencies = counts[indices] 
loglog(ranks, frequencies, marker=".") 
title("Zipf plot for Combined Article Paragraphs") 
xlabel("Frequency Rank of Token") 
ylabel("Absolute Frequency of Token") 
grid(True) 
for n in list(logspace(-0.5, log10(len(counts)-1), 20).astype(int)): 
    dummy = text(ranks[n], frequencies[n], " " + tokens[indices[n]], 
    verticalalignment="bottom", 
    horizontalalignment="left") 

目的我試圖繪製在該圖表「擬合線」,它的值分配給變量。但我不知道如何補充。任何幫助都將非常讚賞這兩個問題。

回答

1

我知道這個問題被問了一段時間了。但是,我在scipy site遇到了此問題的可能解決方案。
我以爲我會張貼在這裏,以防其他人需要的情況。

我沒有段落信息,所以這裏有一個叫做frequency的鞭012 dict,它有段落髮生作爲它的值。

然後我們得到它的值並將其轉換爲numpy數組。定義zipf distribution parameter必須> 1。

最後顯示的樣本的直方圖,隨着概率密度函數

工作編碼:

import random 
import matplotlib.pyplot as plt 
from scipy import special 
import numpy as np 

#Generate sample dict with random value to simulate paragraph data 
frequency = {} 
for i,j in enumerate(range(50)): 
    frequency[i]=random.randint(1,50) 

counts = frequency.values() 
tokens = frequency.keys() 


#Convert counts of values to numpy array 
s = np.array(counts) 

#define zipf distribution parameter. Has to be >1 
a = 2. 

# Display the histogram of the samples, 
#along with the probability density function 
count, bins, ignored = plt.hist(s, 50, normed=True) 
plt.title("Zipf plot for Combined Article Paragraphs") 
x = np.arange(1., 50.) 
plt.xlabel("Frequency Rank of Token") 
y = x**(-a)/special.zetac(a) 
plt.ylabel("Absolute Frequency of Token") 
plt.plot(x, y/max(y), linewidth=2, color='r') 
plt.show() 

劇情 enter image description here