数据文件智能读取: R语言vroom包
最近折腾Shiny的时候接触到了一款非常好用的数据读取包。写一下备忘录。
1. 自动识别分隔文件
vroom有自动识别文件格式功能,所以不管是csv,还是tsv文件都只需要同一个读取指令vroom(”xxx.csv”)
就可以。
library(vroom)
data <- vroom("flights.tsv")
#> Observations: 336,776
#> Variables: 19
#> chr [ 4]: carrier, tailnum, origin, dest
#> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
会跳出来一大段有关该数据各列属性的信息,不需要的话可以关掉。
s <- spec(data)
data <- vroom("flights.tsv", col_types = s)
2. 同时读取多个文件
批量读取数据是vroom的一大亮点。
files <- fs::dir_ls(glob = "flights_*tsv")
files
#> flights_9E.tsv flights_AA.tsv flights_AS.tsv flights_B6.tsv flights_DL.tsv
#> flights_EV.tsv flights_F9.tsv flights_FL.tsv flights_HA.tsv flights_MQ.tsv
#> flights_OO.tsv flights_UA.tsv flights_US.tsv flights_VX.tsv flights_WN.tsv
#> flights_YV.tsv
data <- vroom(files)
#> Observations: 336,776
#> Variables: 19
#> chr [ 4]: carrier, tailnum, origin, dest
#> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
3. 读取和写出压缩文件
-
vroom_write()
可以直接写出压缩文件
vroom_write(flights, "flights.tsv.gz")
# Check file sizes to show file is compressed
fs::file_size(c("flights.tsv", "flights.tsv.gz"))
#> 29.62M 7.87M
# Read the file back in
data <- vroom("flights.tsv.gz")
#> Observations: 336,776
#> Variables: 19
#> chr [ 4]: carrier, tailnum, origin, dest
#> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
4. 读取网页文件
file <- "https://raw.githubusercontent.com/r-lib/vroom/master/inst/extdata/mtcars.csv"
data <- vroom(file)
#> Observations: 32
#> Variables: 12
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
5. 读取和写出管道代码连接数据
这个有点神奇的,完全代替Perl。
- 提取United Airlines(包含UA字符)的数据
# Return only flights on United Airlines
data <- vroom(pipe("grep -w UA flights.tsv"), col_names = names(flights))
#> Observations: 58,665
#> Variables: 19
#> chr [ 4]: carrier, tailnum, origin, dest
#> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
- 又或者可以在写出压缩文件的时候指定压缩工具
pigz
bench::workout({
vroom_write(flights, "flights.tsv.gz")
vroom_write(flights, pipe("pigz > flights.tsv.gz"))
})
#> # A tibble: 2 x 3
#> exprs process real
#>
#> 1 vroom_write(flights, "flights.tsv.gz") 3.5s 2.69s
#> 2 vroom_write(flights, pipe("pigz > flights.tsv.gz")) 1.54s 975.09ms
6. 选择数据列
- 提取指定列
data <- vroom("flights.tsv", col_select = c(year, flight, tailnum))
#> Observations: 336,776
#> Variables: 3
#> chr [1]: tailnum
#> dbl [2]: year, flight
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
- 不提取指定列
data <- vroom("flights.tsv", col_select = c(-dep_time, -air_time:-time_hour))
#> Observations: 336,776
#> Variables: 13
#> chr [4]: carrier, tailnum, origin, dest
#> dbl [9]: year, month, day, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr...
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
- 重命名指定列
data <- vroom("flights.tsv", col_select = list(plane = tailnum, everything()))
#> Observations: 336,776
#> Variables: 19
#> chr [ 4]: carrier, tailnum, origin, dest
#> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
data
#> # A tibble: 336,776 x 19
#> plane year month day dep_time sched_dep_time dep_delay arr_time
#>
#> 1 N142… 2013 1 1 517 515 2 830
#> 2 N242… 2013 1 1 533 529 4 850
#> 3 N619… 2013 1 1 542 540 2 923
#> 4 N804… 2013 1 1 544 545 -1 1004
#> 5 N668… 2013 1 1 554 600 -6 812
#> 6 N394… 2013 1 1 554 558 -4 740
#> 7 N516… 2013 1 1 555 600 -5 913
#> 8 N829… 2013 1 1 557 600 -3 709
#> 9 N593… 2013 1 1 557 600 -3 838
#> 10 N3AL… 2013 1 1 558 600 -2 753
#> # … with 336,766 more rows, and 11 more variables: sched_arr_time ,
#> # arr_delay , carrier , flight , origin ,
#> # dest , air_time , distance , hour , minute ,
#> # time_hour
7. 修改变量属性
大多数情况下vroom可以准确的判断变量属性,当然偶尔也会出错,这个时候可以手动指定。当然也可以后期用dplyr
改,当然这样做就会稍微麻烦点。
属性对照,[ ]里的字符是实际用到的缩写字符。
-
col_logical()
‘l’, containing onlyT
,F
,TRUE
,FALSE
,1
or0
. -
col_integer()
‘i’, integer values. -
col_double()
‘d’, floating point values. -
col_number()
[n], numbers containing thegrouping_mark
-
col_date(format = "")
[D]: with the locale’sdate_format
. -
col_time(format = "")
[t]: with the locale’stime_format
. -
col_datetime(format = "")
[T]: ISO8601 date times. -
col_factor(levels, ordered)
‘f’, a fixed set of values. -
col_character()
‘c’, everything else. -
col_skip()
‘_, -', don’t import this column. -
col_guess()
‘?', parse using the “best” type based on the input.
用例如下:
# read the 'year' column as an integer
data <- vroom("flights.tsv", col_types = c(year = "i"))
# also skip reading the 'time_hour' column
data <- vroom("flights.tsv", col_types = c(year = "i", time_hour = "_"))
# also read the carrier as a factor
data <- vroom("flights.tsv", col_types = c(year = "i", time_hour = "_", carrier = "f"))
data <- vroom("flights.tsv",
col_types = list(year = col_integer(), time_hour = col_skip(), carrier = col_factor())
)
8. 数据读取速度
一个字,快!非常适合机器学习动不动就几个G的数据。
下图是读取和输出1.55G数据时各个包所用的时间比较。
共有 0 条评论