如何統一採樣大圖？

我有一個大約有4M節點的大圖。該圖包含兩個文件，一個包含節點名稱，另一個包含邊（每行代表一條邊）。我想要統一採樣圖形節點，並獲得一個大到整個圖形15％的樣本。考慮圖的大小，生成這樣一個樣本的最佳方法是什麼（或可能）？如何統一採樣大圖？

2014-01-21 H.Z.

您想僅抽樣那些節點定義的節點或子圖（節點+相應的邊）？ –

實際上是子圖，即採樣節點形成的圖形。 –

使用此Java代碼來選擇頂點的15％隨機：

import java.io.IOException; 
import java.util.*; 

import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.conf.*; 
import org.apache.hadoop.io.*; 
import org.apache.hadoop.mapreduce.*; 
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 

public class RandomSample { 

public static class Map extends Mapper<LongWritable, Text, Text, Text> { 
    private Text word = new Text(); 

    public void map(LongWritable key, Text value, Context context) 
    throws IOException, InterruptedException { 
     if (Math.random()<0.15) 
      context.write(value,null); 
     else 
      context.write(null,null); 
    context.write(value,null); 
    } 
} 

public static void main(String[] args) throws Exception { 
    Configuration conf = new Configuration(); 

    Job job = new Job(conf, "randomsample"); 
    job.setJarByClass(RandomSample.class); 

    job.setOutputKeyClass(Text.class); 
    job.setOutputValueClass(IntWritable.class); 

    job.setOutputKeyClass(Text.class); 
    job.setOutputValueClass(Text.class); 
    job.setInputFormatClass(TextInputFormat.class); 
    job.setOutputFormatClass(TextOutputFormat.class); 

    job.setNumReduceTasks(0); 

    FileInputFormat.addInputPath(job, new Path(args[0])); 
    FileOutputFormat.setOutputPath(job, new Path(args[1])); 

    job.waitForCompletion(true); 
} 

}

，並使用此bash腳本來運行它

echo "Running Job" 
hadoop jar RandomSample.jar RandomSample $1 tmp 
echo "copying result to local path (RandomSample)" 
hadoop fs -getmerge tmp RandomSample 
echo "Clean up" 
hadoop fs -rmr tmp

例如，如果我們將其命名腳本random_sample.sh ，從文件夾/ example /中選擇15％，只需運行

./random_sample.sh /example/

然後，您可以使用簡單的grep對第二個文件進行操作以僅選擇包含隨機選擇的頂點的邊線

來源

2014-12-24 11:38:38

如何統一採樣大圖？

回答

相關問題