Imprimir

Cómo saber cuanto ocupa un objeto en memoria (aprox)

Si el objeto se llama mediadto:

print(object.size(mediadto), units = "auto")

 

(fixme: esto es un resumen para ordenar)

 

2.7 Working With Very Large Data Files

R objects can be as big as our physical computer memory (and operating system) will allow, but it is not designed for very large datasets. This means that extremely large objects can slow everything down tremendously and suck up RAM greedily. The read.table() family of routines assume that we are not working with very large data sets and so are not careful to conserve on memory1. They load everything at once and probably make at least one copy of it. A better way to work with huge datasets is to read the lea line (or group of lines) at a time. We do this using connections. A connection is an R object that points to a le or other input/output stream. Each time we read from a connection the location in the le from which we read moves forward.
Before we can use a connection, we must create it using file() if our data source is a le or url() for an online source (there are other types of connections too). Then we use open() to open it. Now we can read one or many lines of text using readLines(), read fields of data using scan(), or write data using cat() or one of the write.table() family of routines. When we are done we close using close().

2.7.1 Reading fields of data using scan()

Reading fields of data from a huge le is a very common task, so we give it special attention. The most important argument to scan() is what, which speci es what type of data to read in. If the le contains columns of data, what should be a list, with each member of the list representing a column of data. For example, if the le contains a name followed by a comma separator and an age, we could read a single line using

> a <- scan(f,what=list(name="",age=0),sep=",",nlines=1)

where f is an open connection. Now a is a list with fields name and age. Example 13.4 shows how to read from a large data le.

If we try to scan when the connection has reached the end of the le, scan() returns an empty list. We can check it using length() in order to terminate our loop.

Frequently we know how many fields are in each line and we want to make sure the scan() gets all of them, lling the missing ones with NA. To do this we specify fill=T. Notice that in many cases scan() will ll empty fields with NA anyway.

Unfortunately scan returns an error if it tries to read a line and the data it nds is not what it is expecting. For example, if the string "UNK" appeared under the age column in the above example, we would have an error. If there are only a few possible exceptions, they can be passed to scan() as na.strings. Otherwise we need to read the data in as strings and then convert to numeric or other types using as.numeric() or some other tool.

Notice that reading one line at a time is not the fastest way to do things. R can comfortably read 100, 1000, or more lines at a time. Increasing how many lines are read per iteration could speed up large reads considerably. With large les, we could read lines 1000 at a time, transform them, and then write 1000 at a time to another open connection, thereby keep system memory free. If all of the data is of the same type and belong in the same object (a 2000x2000 numeric matrix, for example) we can use scan() without including the nlines argument and get tremendously faster reads. The resulting vector would need only to be converted to type matrix.

2.7.2 Utilizing Unix Tools

If you are using R on a linux/unix machine2 you can use various unix utilities (like grep and awk) to read only the colunms and rows of your le that you want. The utility grep trims out rows that do or do not contain a speci ced pattern. The programming language awk is a record oriented tool that can pull out and manipulate columns as well as rows based on a number of criteria.
1According to the help le for read.table() you can improve memory usage by informing read.table() of the number of rows using the nrows parameter. On unix/linux you can obtain the number of rows in a text le using \wc -l".

Some of these tools are useful within R as well. For example, we can preallocate our dataframes according to the number of records (rows) we will be reading in. For example to know how large a dataframe to allocate for the calls in the above example, we could use

> howmany <- as.numeric(system ("grep -c ',C,' file.dat"))

Since allocating and reallocating memory is one of the time consuming parts of the scan() loop, this can save a lot of time and troubles this way. To just determine the number of rows, we can use the utility wc.

> totalrows <- as.numeric(strsplit(system("wc -l Week.txt",intern=T),split=" ")[1?]1?)

Here system() returns the number of rows, but with the le name as well, strsplit() breaks the output into words, and we then convert the rst word to a number.
The bottom line is that we should use the right tool for the right job. Unix utilities like grep, awk, and wc can be fast and dirty ways to save a lot of work in R.

2.7.3 Using Disk instead of RAM

Unfortunately, R uses only system RAM by default. So if the dataset we are loading is very large it is likely that our operating system will not be able to allocate a large enough chunck of system RAM for it, resulting in termination of our program with a message that R could not allocate a vector of that size. Although R has no built in disk cache system, there is a package called lehash that allows us to store our variables on disk instead of in system RAM. Clearly this will be slower, but at least our programs will run as long as we have su cient disk space and our le size does not exceed the limits of our operating system. And sometimes that makes all the di erence.

Instead of reading a le into memory, we read into a database and load that database as an environment

> dumpDF(read.table("large.txt", header=T), dbName="mydb")
> myenv<-db2env(db="mydb")

Now we can operate on the data within its environment using with()

> with(mydb, z<-y+x )

Actually I haven't spent much time with this package so I'm not familiar with its nuances, but it seems very
promising3.

2.7.4 Using RSQLite

VER RSQLite

Acá el ejemplo del que habla:

13.4 Extracting Info From a Large File

Let 301328226.csv be a large file (my test case was about 80 megabytes with 1.5 million lines). We want to extract the lines corresponding to put options and save information on price, strike price, date, and maturity date. The first few lines are as follows (data has been altered to protect the innocent) date,exdate,cp_flag,strike_price,best_bid,best_offer,volume,impl_volatility,optionid,cfadj,ss_flag
04JAN1997,20JAN1997,C,500000,215.125,216.125,0,,12225289,1,0
04JAN1997,20JAN1997,P,500000,0,0.0625,0,,11080707,1,0
04JAN1997,20JAN1997,C,400000,115.375,116.375,0,,11858328,1,0
Reading this file on my (relatively slow) computer is all but impossible using read.csv().
 

> LENGTH<-600000
> myformat<-list(date="",exdate="",cp="",strike=0,bid=0,ask=0,
+ volume=0,impvolat=0,id=0,cjadj=0,ss=0)
> date=character(LENGTH)
> exdate=character(LENGTH)
> price=numeric(LENGTH)
> strike=numeric(LENGTH)
> f<-file("301328226.csv")
> open(f,open="r")
> titles<-readLines(f,n=1) # skip the first line
> i<-1
> repeat{
+ b<-scan(f,what=myformat,sep=",",nlines=1,quiet=T)
+ if (length(b$date)==0) break
+ if (b$cp=="P"){
+ datei?<-b$date
+ exdatei?<-b$exdate
+ pricei?<-(b$bid+b$ask)/2
+ strikei?<-b$strike
+ i<-i+1
+ }
+ }
> close(f)

This read took about 5 minutes. Notice that I created the vectors ahead of time in order to prevent having to reallocate every time we do a read. I had previously determined that there were fewer than 600000 puts in the file. The variable i tells me how many were actually used. If there were more than 600000, the program would still run, but it would reallocate the vectors at every iteration (which is very slow).
This probably could have been much speeded up by reading many rows at a time, and memory could have been saved by converting the date strings to dates using as.Date() at each iteration (see section 2.4).
I welcome suggestions on improvements to this example.


Próximos eventos

No hay registros que mostrar