使用嵌入式列表從CSV文件創建數據框

我對R還是比較陌生，可能已經完全搞亂了數據框架的概念。使用嵌入式列表從CSV文件創建數據框

但我有以下格式的CSV文件：

ID;Year;Title;Authors;Keywords;

如果作者和關鍵詞應該是一個字符串列表。例如。

1; 2013;向基於SOA和雲動態非侵擾健康監測;穆罕默德Serhani，Abdelghani Benharret，Erlabi Badidi; E-健康，疾病監測，預防，SOA，雲計算，平臺，間高科技;

有沒有辦法將這個csv文件讀入R中，這樣作者和關鍵字的數據框列就被構建爲列表清單了？這是否需要我以特定的方式格式化csv文件？

讀取用下列選項

articles <- read.csv(file="ls.csv",head=TRUE,sep=";",stringsAsFactors=F)

該CSV產率作者式柱作爲含有字符實例的列表。但我想要實現的是獲取作者列中每個字段的字符列表。

來源

2013-05-22 tschmitty

你是說你的文件包含五個用分號隔開的變量（ID，年份，標題，作者，關鍵字）嗎？然後，根據定義，它不是一個csv文件！請記住，csv代表逗號-分隔值。有人把它命名爲這樣搞砸了。

您可以閱讀使用read.table任意分隔的數據：

articles <- read.table("ls.csv", header=TRUE, sep=";", stringsAsFactors=FALSE)

來源

2013-05-22 09:00:37

在許多國家/地區，逗號用作小數點分隔符，因此分號用於csv文件（是的，它們仍稱爲csv文件）作爲列分隔符。 'read.table'起作用，但這些文件也有一個'read.csv2'。 –

@JanvanderLaan - '在很多國家......'據我所知，只有荷蘭人使用這個慣例，這是我*討厭*使用荷蘭Excel版本的原因之一，特別是當與擁有國際化的人合作時版。 +1提到'read.csv2'！ – nluigi

@nluigi有更多國家使用逗號作爲小數點分隔符（可能不是中國和印度人使用的時間段）。請參閱https://en.wikipedia.org/wiki/Decimal_mark#Countries_using_Arabic_numerals_with_decimal_comma。我不知道這些國家的電子表格在做什麼。但是我同意，excel的行爲依賴於語言環境這一事實很令人討厭。 –

像香港大井指出，你們的田地，由「;」分隔，而不是「」。功能read.csv具有默認值sep =「，」而read.csv2有默認值sep =「;」。如果我理解正確，您的字段作者和關鍵字由'，'分隔，您希望將它們分開。

我不認爲你可以有項目在列作者和關鍵詞在data.frame列表類型，作爲data.frame的列不能是列表。如果給一個data.frame一個列表，它將被分解到它的列組件。在你的情況下，將無法正常工作，會有不同數量的作者和/或關鍵字：

# Works 
data.frame(a=list(first=1:3, second=letters[1:3]), b=list(first=4:6, second=LETTERS[1:3])) 
# a.first a.second b.first b.second 
#1  1  a  4  A 
#2  2  b  5  B 
#3  3  c  6  C 

# Does not work 
data.frame(a=list(first=1:3, second=letters[1:2]), b=list(first=4:6, second=LETTERS[1:6])) 
#Error in data.frame(first = 1:3, second = c("a", "b"), check.names = FALSE, : 
# arguments imply differing number of rows: 3, 2

但由於列表可能包含列表，你可以嘗試下破該數據幀這樣的方式。 '例子的內容。TXT'：

ID;Year;Title;Authors;Keywords; 
1;2013;Towards Dynamic Non-obtrusive Health Monitoring Based on SOA and Cloud;Mohammed Serhani, Abdelghani Benharret, Erlabi Badidi;E-health, Diseases, Monitoring, Prevention, SOA, Cloud, Platform, m-tech; 
2;1234;Title2;Author1, Author2;Key1, Key2, Key3; 
3;5678;Title3;Author3, Author4, Author5;Key1, Key2, Key4;

下面是如何做到這一點的一個示例：

x <- scan("example.txt", what="", sep="\n", strip.white=TRUE) 
y <- strsplit(x, ";") 
# Leave out the header 
dat <- y[-1] 

# Apply a function to every element inside the highest level list 
dat <- lapply(dat, 
    FUN=function(x) { 
     # Splits in authors and keywords list 
     ret <- strsplit(x, ","); 
     # Remove leading and trailing whitespace 
     ret <- lapply(ret, FUN=function(z) gsub("(^ +)|(+$)", "", z)); 
     # Assign names to all the fields 
     names(ret)<-unlist(y[1]); 
     ret 
    } 
)

輸出：

> str(dat) 
List of 3 
$ :List of 5 
    ..$ ID  : chr "1" 
    ..$ Year : chr "2013" 
    ..$ Title : chr "Towards Dynamic Non-obtrusive Health Monitoring Based on SOA and Cloud" 
    ..$ Authors : chr [1:3] "Mohammed Serhani" "Abdelghani Benharret" "Erlabi Badidi" 
    ..$ Keywords: chr [1:8] "E-health" "Diseases" "Monitoring" "Prevention" ... 
$ :List of 5 
    ..$ ID  : chr "2" 
    ..$ Year : chr "1234" 
    ..$ Title : chr "Title2" 
    ..$ Authors : chr [1:2] "Author1" "Author2" 
    ..$ Keywords: chr [1:3] "Key1" "Key2" "Key3" 
$ :List of 5 
    ..$ ID  : chr "3" 
    ..$ Year : chr "5678" 
    ..$ Title : chr "Title3" 
    ..$ Authors : chr [1:3] "Author3" "Author4" "Author5" 
    ..$ Keywords: chr [1:3] "Key1" "Key2" "Key4" 

# Keywords of first item 
> dat[[1]]$Keywords 
[1] "E-health" "Diseases" "Monitoring" "Prevention" "SOA"  
[6] "Cloud"  "Platform" "m-tech" 

# Title of second item 
> dat[[2]][[3]] 
[1] "Title2" 

# Traveling inside the list of lists, accessing the very last data element 
> lastitem <- length(dat) 
> lastfield <- length(dat[[lastitem]]) 
> lastkey <- length(dat[[lastitem]][[lastfield]]) 
> dat[[lastitem]][[lastfield]][[lastkey]] 
[1] "Key4"

通知列表的，該目錄可以存儲在數據的低效的方式R，所以如果你有很多數據，你可能想要轉向更高效的方法，例如關係數據庫結構，其中訪問密鑰是您的ID，假設它是唯一的。

來源

2013-05-22 10:05:45

使用嵌入式列表從CSV文件創建數據框

回答

相關問題