samedi 27 juin 2015

Reading very large fixed(ish) width format txt files from SQL into R data.tables or likewise

I'm trying to read in (and eventually merge/link/manipulate) a series of large (~300M) and very large (~4G) fixed width files for eventual regressions, visualizations, etc., and am hitting some snags.

First, the format of the files themselves is odd - I'm guessing something SQL-y. The file format is referenced here: http://ift.tt/1NnUwWE . It's fixed width, but the last column seems to (sometimes?) cut off with an \r\n before the full fixed width is experienced for that column. For reading it in I've tried laf_open_fwf and data.table::fread, but they both seem to get confused. A sample file and the associated non-XML format descriptor is here. I can't even get the thing to read in properly with that goofy last column.

Second, file size is an issue. I know I have a lot of table manipulation to do, so I'm tempted to look at data.table... but I also believe data.table stores the entire object in RAM, and that's going to be problematic. LaF or ffdf or sqlite seem like options, though I'm new to them, and would need to cope with this file format issue first.

Some questions get at this general idea, suggesting LaF, ffbase or data.table are below...

Reading big data with fixed width

Quickly reading very large tables as dataframes in R

Speed up import of fixed width format table in R

... but none seems to (1) deal with this odd fixed-width-ish format or (2) move data eventually into data.tables, which seems like I'd like to try first. I thought about trying to open and rewrite them as well-formatted CSVs so data.table could handle them (my goofy hack through data.frames and back to csv feels ridiculous and unscalable, below). And the CSV export demonstates how confused the file gets, since the laf reader is strictly going by field length instead of adjusting based on where the /r/n is...

Currently I'm trying something like the below for starters. Help, if possible?

searchbasis.laf = laf_open_fwf("SEARCHBASIS.txt",
                               column_widths = c(12, 12, 12, 12, 10), 
                               column_names = c("SearchBasisID", "SearchID", "PersonID", "StopID", "Basis"),
                               column_types = rep("string",5),
                               trim = T)
searchbasis.dt = as.data.table(as.data.frame(laf_to_ffdf(searchbasis.laf)))
write.csv(searchbasis.dt, file="SEARCHBASIS.csv")

Aucun commentaire:

Enregistrer un commentaire