對於包含多個項目的域,您應該能夠使用隨機抽樣方法獲得高度準確的屬性列表。下面是一些C#-ish僞代碼:
int domainCount = "select count(*) from Person";
int avgSkipCount = domainCount/2500;
int processedCount = 0;
string nextToken = null;
Set attributeNames;
do
{
int nextSkipCount = Random.Next(0, avgSkipCount*2);
string nextToken = "select count(*) from Person limit " + nextSkipCount;
var countRequest = new SelectRequest
{
NextToken = nextToken,
SelectExpression = "select count(*) from Person limit " + nextSkipCount
};
var countResponse = SimpleDb.Select(countRequest);
nextToken = countResponse.NextToken;
processedCount += countResponse.Count;
var getRequest = new SelectRequest
{
NextToken = nextToken,
SelectExpression = "select * from Person limit 1"
};
var getResponse = SimpleDb.Select(getRequest);
nextToken = getResponse.NextToken;
processedCount++;
attributeNames.Add(getResponse.AttributeNames);
} while (domainCount > processedCount);
這依賴於事實,你可以使用的nextToken從SELECT COUNT(*)查詢返回的記載跳過的SimpleDB的。 Mocky寫了an excellent explanation of how to accomplish this。我已經解釋了how to accomplish efficient paging like this with Simple Savant。
這對大多數數據集來說可以達到99%的準確率,這些數據集對於大多數實際應用來說應該足夠好。統計理論認爲,2500的樣本大小可以爲任何大小的數據集有效地提供相同的精度,所以這種方法甚至可以擴展到數百萬個項目。
這顯然不理想,因爲它仍然需要大量的查詢,但如果數據集的屬性變化數量相對有限,您應該可以用更小的樣本大小完成同樣的事情。