Working with large datasets

Dear All,

I have been working on a data analysis routine which reads data from *.csv files, perform several statistical calculations on it and generates some useful easy to understand scientific outputs. One part of the routine handles 3D data containing many zeros and/or NANs. At the end I replace zeros with NANs and exclude NANs by using WaveTransform function. However the data is so large that my computer doesn't have enough memory to keep hundreds of 3D waves with thousends of entries. Could someone have a look at the code and give some programming clues which may solve this issue? I know it seems complicated and confusing. I would be appreciate for any helpful comment.

Cheers,
Emre

 // This part chooses csv files one by one and calls another function to analyse them.
Function MyFunctionforFIA()
    if(DataFolderExists("root:RDDataFolder")==1)
        SetDataFolder root:RDDataFolder
    else
        print "Error! Please first load some data or press reset to delete prevoius data"
        abort
    endif
   
    // Create sizeBinList and use in AnalyseWIBSFiles()
    // make sizeBinList
    Make size_bins1 = {0.5,0.5314,0.5647,0.6001,0.6377,0.6776,0.7201,0.7652,0.8132,0.8642,0.9183,0.9759,1.0371,1.102}
    Make size_bins2 = {1.1711,1.2445,1.3225,1.4054,1.4935,1.5871,1.6866,1.7923,1.9046,2.024,2.1508,2.2856,2.4288,2.5811}
    Make size_bins3 = {2.7824,2.9147,3.0974,3.2915,3.4978,3.7171,3.95,4.1976,4.4607,4.7402,5.0373,5.353,5.6885,6.0451}
    Make size_bins4 = {6.4239,6.8265,7.2544,7.709,8.1922,8.7056,9.2512,9.8311,10.4472,11.102,11.7978,12.5372,13.3229,14.1579,15.0453,15.9882,16.9903,18.0551,20}
    Variable ssb1, ssb2, ssb3, ssb4, ssizebinlist
    ssb1 = numpnts(size_bins1)
    ssb2 = numpnts(size_bins2)
    ssb3 = numpnts(size_bins3)
    ssb4 = numpnts(size_bins4)
    ssizebinlist = ssb1+ssb2+ssb3+ssb4
    Make/O/N=(ssizebinlist) SizeBinList
    Concatenate/NP/O {size_bins1,size_bins2,size_bins3,size_bins4}, SizeBinList
    KillWaves size_bins1, size_bins2, size_bins3, size_bins4
   
    SetDataFolder root:RDDataFolder
    // Choose one file to analyse
    Variable k
    String DataFolderList = ""
    for( k = 0; k < CountObjects("",4); k += 1)
        DataFolderList += GetIndexedObjName("", 4, k)+":;"
    endfor
   
    // Calculate time required to analyse one file
    Variable refnumTimer, timeElapsed, timeasprogress
    // Read strings from list and set data folder accordingly
    Variable numDataFolders = ItemsInList(DataFolderList)
    Variable i
    NVAR perc = root:perc
    Variable progress = 0
    ControlUpdate/W=toolKIT/A
   
    for(i=0; i<numDataFolders; i +=1)
        refnumTimer = startMSTimer
        String nameCurDF = "root:RDDataFolder:" + StringFromList(i, DataFolderList)
        SetDataFolder nameCurDF
        AnalyseWIBSFluorescence(i)      // Main function which performs analysis
        SetDataFolder nameCurDF
        timeElapsed = stopMSTimer(refnumTimer)
        timeasprogress = timeElapsed/1000000        // elapsed time as seconds
        //print "Last file took " + num2str(timeasprogress) + " seconds"        // We can see how long it takes to analyse a file
        progress += 1
        perc = (progress / numDataFolders) * 100
        ControlUpdate/W=toolKIT progress
        // change progress color to green.
        ValDisplay progress win=toolKIT, highColor= (26112,0,10240)
        KillDataFolder/Z nameCurDF 
    endfor
   
    // make destination waves
    SetDataFolder root:RDDataFolder:
    Make/O/N=(0,ssizebinlist-1,3) SizeResFluW
    AppendWaves("", "", 0)
    //MoveResultingData()
    NewDataFolder/O/S root:FLData
    if(!waveexists(root:FLData:SizeResFluW))
        MoveWave root:RDDataFolder:SizeResFluW, root:FLData:
        if(!waveexists(root:FLData:SizeBinList))
            MoveWave root:RDDataFolder:SizeBinList, root:FLData:
        endif
    endif
End

// This part deals with time binning and decides on selection criteria. Output is used by another function to perform size binning
Function AnalyseWIBSFluorescence(whichData)
    Variable whichData
    Variable i, j
   
    DFREF homeDFR = GetDataFolderDfr()
    // wave references to global waves
    Wave dateTimeWave = homeDFR:DateTimeWave
    Wave sizeBinList = root:RDDataFolder:sizeBinList
    Wave particleSizes = homeDFR:ParSize
    Wave paramSet0 = homeDFR:Pwr_280
    Wave paramSet1 = homeDFR:Pwr_370
    Wave paramSet2 = homeDFR:FL1_280
    Wave paramSet3 = homeDFR:FL2_280
    Wave paramSet4 = homeDFR:FL2_370
   
    NVAR Threshold1 = root:FTDataFolder:Threshold1
    NVAR Threshold2 = root:FTDataFolder:Threshold2
    NVAR Threshold3 = root:FTDataFolder:Threshold3
    String suff
    // print Threshold1, Threshold2, Threshold3         for debugging only
    // each type is potentially composed of criteria from multiple parameters. List the waves containing those parameters in ";" strings
    // here we'll use 5 parameters to define each of my made-up types
    Make /FREE /T /N=3 typeParamNames
    typeParamNames[0] = NameOfWave(paramSet0)+";"+NameOfWave(paramSet2) // FL1_280 Type (Intensity[FL1_280] > Threshold && Intensity[Power_280 > 200])
    typeParamNames[1] = NameOfWave(paramSet0)+";"+NameOfWave(paramSet3) // FL2_280 Type (Intensity[FL2_280] > Threshold && Intensity[Power_280 > 200])
    typeParamNames[2] = NameOfWave(paramSet1)+";"+NameOfWave(paramSet4) // FL2_370 Type (Intensity[FL2_370] > Threshold && Intensity[Power_370 > 200])
   
    // set the minimum and maximum values - if there are different numbers of criteria for different types set the column size to the biggest (in terms of # of criteria) type
    Make /FREE /D /N=(3, 3) minCriteria
    Make /FREE /D /N=(3, 3) maxCriteria
    // set a bunch of arbitrary minimum criteria according to the parameter combinations specified in typeParamNames   
    // FL1_280 criteria
    minCriteria[0][0] = MinFailPower
    maxCriteria[0][0] = inf
    minCriteria[0][1] = Threshold1
    maxCriteria[0][1] = inf
    minCriteria[0][2] = minParSize
    maxCriteria[0][2] = inf
    // FL2_280 criteria
    minCriteria[1][0] = MinFailPower
    maxCriteria[1][0] = inf
    minCriteria[1][1] = Threshold2
    maxCriteria[1][1] = inf
    minCriteria[1][2] = minParSize
    maxCriteria[1][2] = inf
    // FL2_370 criteria
    minCriteria[2][0] = MinFailPower
    maxCriteria[2][0] = inf
    minCriteria[2][1] = Threshold3
    maxCriteria[2][1] = inf
    minCriteria[2][2] = minParSize
    maxCriteria[2][2] = inf

    // make the final output wave
    Variable nRanges = dimSize(DateTimeWave,0)
    Make /O/D/N=(nRanges, numpnts(sizeBinList)-1,3) homeDFR:SizeResFluW
    Wave outWave = homeDFR:SizeResFluW
   
    Make /FREE /N=(3) currMins // set size to the # of criteria in the largest type
    Make /FREE /N=(3) currMaxs // set size to the # of criteria in the largest type
   
    // build the output wave
    // sortARange is the function which takes several inputs and gives one output which will be analysed to find
    // particle counts, distributions, ratios, etc.
    for(i=0; i<nRanges; i +=1)
        for (j=0; j<3; j+=1)
            currMins = minCriteria[j][p]
            currMaxs = maxCriteria[j][p]
            fluoreSizeBin(sizeBinList, particleSizes, currMins, currMaxs, typeParamNames[j], outWave, i, j)
        endfor
    endfor
    // Move resulting wave and kill data folder
    SetDataFolder root:RDDataFolder
    suff = "DF_" + num2str(whichData)
    suff = CleanupName(suff, 0)
    NewDataFolder /O /S $suff
    MoveWave homeDFR:SizeResFluW $suff
End

// This part deals with size binning according to the information generated by the Function above
Function fluoreSizeBin(binBoundsWave, parSizeWave, criteriaMin, criteriaMax, dataWaveNames, destinationWave, rowIn, layIn)
    Wave binBoundsWave, parSizeWave
    Wave criteriaMin, criteriaMax           // wave of min and max values.  must be same length as each other and of dataWaveNames
    String dataWaveNames
    Wave destinationWave
    Variable rowIn, layIn
   
    Variable nWaves = ItemsInList(dataWaveNames)
    Variable j
   
    Variable satisfied = 1         
   
    for(j=0; j<nWaves; j +=1)
        Wave currDataWave = $(StringFromList(j, dataWaveNames))
   
        if (currDataWave[rowIn] < criteriaMin[layIn] || currDataWave[rowIn] > criteriaMax[layIn])
            satisfied = 0
            //print currDataWave[i]
        endif
    endfor
   
    if (satisfied)
        FindLevel /Q/P binBoundsWave, parSizeWave[rowIn]
        if (!V_flag)
            destinationWave[rowIn][floor(V_LevelX)][layIn] = currDataWave[rowIn]
        endif
    endif
End

// This part appends all resulting data into one single wave. Output is a 3D wave which may contain millions of data points.
// Apparently this is where I got stuck.
Function AppendFluoreWaves(parentDF, DFpattern, killChildDFs, [sortedAppend, sortOptions, modelDF])
    String parentDF, DFpattern
    Variable killChildDFs
    Variable sortedAppend
    Variable sortOptions
    String modelDF
   
// get list of childDFs matching DFpattern
        if( strlen(parentDF) == 0 )
            parentDF = GetDataFolder(1)
        endif
    String childDFs = ListMatch( ChildDFList(parentDF), DFpattern+"*" )
        if( sortedAppend )      // default is 0 = false
        childDFs = SortList(childDFs, ";", sortOptions) // default is 0 = ascending case-sensitive
        endif                                           // alphabetic ASCII sort
    Variable numChildren = ItemsInList(childDFs)        // count the "good" children
        if( numChildren == 0 )                      // ...none -- abort
        Abort "Please first load some data to analyse"
        endif
// build WList from waves in modelDF or in parentDF
    string WList, saveDF = GetDataFolder(1)
    if( !ParamIsDefault(modelDF) )  // use modelDF if specified
        SetDataFolder modelDF
        WList = WaveList("*", ";", "")
    else
        SetDataFolder parentDF  // default to parentDF
        WList = WaveList("*", ";", "")
    endif
    variable numWaves = ItemsInList(WList)      // count waves to append to
    if( numWaves == 0 )                     // ...none -- abort
        Abort "AppendWave found no waves to append to in data folder \""+GetDataFolder(1)+"\""
    endif
 
    variable k, m, childDataAppended
    string childDF, currentWave, sourceWave, destWave
// cycle through childDFs
    for( k = 0; k < numChildren; k += 1 )
        childDF = StringFromList(k, childDFs)
        SetDataFolder childDF
        childDataAppended = 0   // don't want to kill childDFs that contained no data to append
// ... cycle through WList
        for( m = 0; m < numWaves; m += 1 )
            currentWave = StringFromList(m, WList)
            sourceWave = childDF+currentWave
            Variable sizeofSourceWave = dimSize($sourceWave,0)
            if( exists(sourceWave) == 1 )   // not all waves in child data folder may be relevant
                destWave = parentDF+currentWave
                   
                    //if(dimSize($sourceWave,0) > 1)
                        //if(dimSize($sourceWave,2) > 0)
                            Concatenate/NP=0 sourceWave+";", $destWave
                        //endif
                    //endif
                childDataAppended = 1
            endif
        endfor
       
            KillDataFolder childDF
       
    endfor
    SetDataFolder saveDF
End

// There is also one another function which deals with statistical analysis by using the main output wave (3D Wave)
tooprock wrote:
Dear All,
However the data is so large that my computer doesn't have enough memory to keep hundreds of 3D waves with thousends of entries.


Does that mean you get an "out of memory" error message or what are the symptoms?
How large are those 3D waves? Are they double precision? Could they be in single precision?
Does the routine work when you use a much smaller subset of the data?

The 3-D matrices are created in fluoreSizeBin(...). Why again do you want to turn them back in to 1-D waves with AppendFluoreWaves(...)?

Where are you replacing 0's by NaN's ... I don't see that in your code? Have you thought about first transforming the 0's to NaN's in the CSV files using a text-editor?

--
J. J. Weimer
Chemistry / Chemical & Materials Engineering, UAHuntsville
Hello,

First of all thanks for comments. I will try to explain why and how I do those calculations. Firstly, the routine works great for small datasets.

1) I get an "out of memory error" when I run the routine with large datasets.

2) I can't change zeros with NANs in the csv files because those zeros and/or NANs are output of several calculations. It means, original data may or may not contain zeros and/ot NANs. However this is not directly related to the output data which is a product of several csv files.

3) I don't convert them from 3-D matrices into 1D waves. I just append seperate outputs (one resulting wave from each csv file) into a single wave. However I have to decompose 3-D wave to 1D waves to be able to use WaveStats or StatsQuantiles or mean functions. They work only for 1D waves.

4) Only date time waves have to be in double precision but the others can have single precision.

5) I replace zeros with NANs in a separate function which I didn't put here. This function handles only statistical analysis.

6) With "large" I mean a 3-D matrice consists of almost 10 Million rows, 60 Columns and 3 Layers. There are also single waves having almost same number of data points. When I analyse only 60 csv files (around 500,000 rows, 60 columns, 3 layers) the total memory used reaches 3.4 GB.

So, you suggest using single precision waves instead of double precision to save some memory.

Some more information concerning how the entire toolkit runs:

i) User chooses one or more csv data. These data are read and saved into separate data folders. In each data folder, data points are read from csv files and saved in separate waves ( each column or let's say parameter as new wave).
ii) Routine works for each data folder one by one and the results are written in current data folders to prevent owerwriting.
iii) After that, resulting waves from separate data folders are appended into single waves and the remainder (data folders containing row files) is deleted.

Cheers,
Emre
tooprock wrote:
Hello,
So, you suggest using single precision waves instead of double precision to save some memory.


Well, this produces a wave of 360 MB

Make/N=(500000, 60, 3) single


and this of 720 MB

Make/D/N=(500000, 60, 3) double


Depending and what the data are and the precision you need that could be a possibility. Maybe even integer waves... it really depends on what you are working on. Otherwise have a look at

displayhelptopic "Memory management"


tooprock wrote:
6) With "large" I mean a 3-D matrice consists of almost 10 Million rows, 60 Columns and 3 Layers. There are also single waves having almost same number of data points. When I analyse only 60 csv files (around 500,000 rows, 60 columns, 3 layers) the total memory used reaches 3.4 GB.

That's 1.8e9 data points. If it is single precision data, that's 7.2e9 bytes, which is more than twice the address space available to a 32-bit program. And your sequence of steps produces several waves and copies of waves. Within the 3-GByte address space available to a 32-bit program on 32-bit Windows, you also have to load Igor itself, have memory for graphs, etc.

If you have a 64-bit Windows, you have 4 GBytes available, which still isn't enough. But if you have 64-bit Windows you could use Igor64. Go to http://www.wavemetrics.net/ and look for the link to download the Igor64 installer. Be sure to read the notes about installing Igor64.

John Weeks
WaveMetrics, Inc.
support@wavemetrics.com
Using 64-bit is also not a good solution because after some point the process slows down and takes very long to analyse even a small dataset. So, I wrote another function to work out the data in a different way. I am not really sure if this is statistically meaningful but it works pretty good. Thanks for all suggessions.

Cheers,
Emre
I sometimes work on big datasets and in order to get statistics from those datasets I normally chunk the reading.

For example, say you are interested in the mean and standard deviation of the data:


1) open the file using the IGOR open operation.
2) read in e.g. 10000 lines of data and parse into values
3) For the variables you are interested in calculate the number of values (should be 10000, but if you are discounting NaN/Inf it may be smaller), the sum of values and the sum of the squared values. From the 2nd iteration onwards you need to accumulate those sums.
4) goto 2) until you've read the data
5) calculate the statistics you want from the sum of the values and the sum of the squared values.

The point is that you never need to have the whole dataset in memory, just a subset.
..doh, Andy beat me to it, was going to offer the suggestion of parsing the main file into smaller subsets and combining the results. Also, Imagestats has the same kind of functionality as wavestats and it will work on individual planes in 3D waves, so you can avoid some of the extra computations there, perhaps.
Correct me please if I'm wrong. ImageStats works for multidimensional waves but doesn't calculate V_avg, V_min, etc. for each column but for entire matrice. Therefore I decompose a matrice to several 1D waves (one wave for one column) and use WaveStats to find statistical parameters for that single column and save them in a separate wave. However, using a small subset of the data may help solving the problem. Thanks all.
tooprock wrote:
Correct me please if I'm wrong. ImageStats works for multidimensional waves but doesn't calculate V_avg, V_min, etc. for each column but for entire matrice. Therefore I decompose a matrice to several 1D waves (one wave for one column) and use WaveStats to find statistical parameters for that single column and save them in a separate wave. However, using a small subset of the data may help solving the problem. Thanks all.


ImageStats is not ideal for per column calculations. Depending on the stats you need it may be far more efficient to use WaveStats with /R=[...] and possibly with /M=1.

A.G.
WaveMetrics, Inc.
ImageStats [/M=val /P=planeNumber ]/G={startP, endP, startQ, endQ } imageWave

lets you pick a plane and then restrict your region of interest to a subset of that plane, all the typical values are returned as with wavestats for this ROI, and you can give it /M=1 if you are only interested in V_avg the local max and min. I just mentioned it because you can use it straight forward rather than recopying large sections of your data out, which seemed to be one of your problems regarding memory usage.
daggaz wrote:
ImageStats [/M=val /P=planeNumber ]/G={startP, endP, startQ, endQ } imageWave

lets you pick a plane and then restrict your region of interest to a subset of that plane, all the typical values are returned as with wavestats for this ROI, and you can give it /M=1 if you are only interested in V_avg the local max and min. I just mentioned it because you can use it straight forward rather than recopying large sections of your data out, which seemed to be one of your problems regarding memory usage.


I am aware of this possibility (I wrote it). I'm actually impressed that you found this relatively obscure feature. I still think that WaveStats/M=1 would execute faster. FWIW, this ImageStats feature was added to support ImageStats over rectangular areas without using ROI. In general, using ROI is less efficient because each pixel in the ROI wave is scanned at least once.

A.G.
WaveMetrics, Inc.
Igor wrote:
FWIW, this ImageStats feature was added to support ImageStats over rectangular areas without using ROI. In general, using ROI is less efficient because each pixel in the ROI wave is scanned at least once.


So using the /G flag wont scan the pixels more than once or? Seems good. I havent used it on enormous data sets, but on large tiff files it seems to execute as quickly as wavestats, and at least now he wouldnt have to recopy the data into a suitable 1D format for wavestats..

At any rate, didnt realize this was obscure. Wavestats is for 1D and you get sent right to Imagestats in the documentation for higher dimensionalities, and this command is right in the middle and seems to emulate wavestats which is probably why a lot of users got sent there in the first place, trying to use wavestates on the wrong kind of wave. Glad you put it in there!
daggaz wrote:
At any rate, didnt realize this was obscure.


You can only find this feature if you read the documentation -- it is not available as a menu or dialog selection. Even if you come across this flag's documentation it is not immediately obvious why it is more efficient than using an ROI.