object name length limits, metaData, hashing, and searching

Hi Jamie,

there's nothing preventing you from creating a relational database in Igor. But I think this is stretching Igor's limits a bit too much, and that you'll end up with lots of work in exchange for poor to mediocre performance.

If I were you I'd start by looking into leaving the relational database functionality to a ... relational database. Have you considered using SQL.xop in combination with something like SQLite?

March 1, 2012 at 04:12 am - Permalink

aclight

I second 741's suggestion of using the SQL XOP to do this kind of thing. But if you are not familiar with databases and/or the SQL XOP, the learning curve could be steep. And you would need to get your information into the database in the first place.

Several years ago I wrote a package that I used for collecting patch clamp data from cells. Originally, for each trace I recorded, I saved stimulus and recording parameters as KEY:value; pairs in the wave note. The problem with doing this is that a lot of the bytes in the wave note (the KEY strings) are redundant, and my files were getting too large. So I later rewrote the program to create a 2D wave in which each column was a parameter I wanted to save and each row corresponded to a single trace. I then wrote a query engine that used the Extract operation with the data in this 2D wave (and actually several other similar 2D waves that contained other information). This worked well and was plenty fast for my needs but my queries couldn't get too complicated or they would exceed the maximum length of a command line or line in a procedure file (400 characters).

An alternative would be to scrap your hash table completely and go back to just searching through each wave's wave note--but with a twist.

Here is an example function:

Function test(numWavesToGenerate)

    Variable numWavesToGenerate

    // Generate a bunch of waves in a

    // new data folder.

    NewDataFolder/O/S test

    WAVE/WAVE generatedWaves = generateWaves(numWavesToGenerate)

    // Start timing.

    Variable start = StopMsTimer(-2)

    // Instead of searching through a wave note,

    // we'll just calculate the total number of points

    // in all waves in the current data folder.

    Variable totalNumPoints = 0

    //////////////////////////////

    // Slow way

    //////////////////////////////

    // Iterate through all the waves in the current

    // data folder and get the number of points in the

    // wave.

    Variable numWaves = CountObjects("", 1);

    Variable n

    String thisWaveName

    // Start timing the first loop.

    Variable firstLoopStart = StopMsTimer(-2)

    For (n=0; n < numWaves; n+=1)

        thisWaveName = GetIndexedObjName("", 1, n)

        if (strlen(thisWaveName) == 0)

            break

        endif

        WAVE thisWave = $(thisWaveName)

        totalNumPoints += numpnts(thisWave)

    EndFor

    // Stop timing the first loop.

    Variable firstLoopStop = StopMsTimer(-2)

    print "Total Num Points of first loop:", totalNumPoints

    //////////////////////////////////

    // Fast way

    //////////////////////////////////

    // Iterate through all waves in generatedWaves.

    // Since this wave is a wave of wave references,

    // we don't have to look up each wave in the current

    // data folder by name, which gets slower the more

    // waves a data folder contains.

    totalNumPoints = 0

    // Start timing the second loop.

    Variable secondLoopStart = StopMsTimer(-2)

    For (n=0; n < numWaves; n+=1)

        WAVE thisWave = generatedWaves[n]

        totalNumPoints += numpnts(thisWave)

    EndFor

    // Stop timing the second loop.

    Variable secondLoopStop = StopMsTimer(-2)

    // Stop timing.

    Variable stop = StopMsTimer(-2)

    print "Total Num points of second loop:", totalNumPoints

    // Go back to the root data folder

    SetDataFolder root:

    // Kill the data folder containing the waves.

    KillDataFolder root:test

    // Print timing results.

    Variable total = (stop - start)/1000

    Variable first = (firstLoopStop - firstLoopStart)/1000

    Variable second = (secondLoopStop - secondLoopStart)/1000

    printf "%d waves (ms):: Overall: %g\t1st loop: %g\t2nd loop: %g\r", numWavesToGenerate, total, first, second

End

// Generates numToGenerate waves in the

// current data folder.

Function/WAVE generateWaves(numToGenerate)

    Variable numToGenerate

    Variable n

    String thisWaveName

    // Create a free wave to hold wave references to the

    // generated waves.

    Make/O/N=(numToGenerate)/WAVE/FREE waveOfWaves

    For (n=0; n < numToGenerate; n+=1)

        sprintf thisWaveName, "wave%.8d", n

        Make/O $thisWaveName /WAVE=thisWave

        waveOfWaves[n] = thisWave

    EndFor

    return waveOfWaves

End

The first loop iterates through all of the waves in the current data folder using GetIndexedObjName(). This is relatively slow because waves in a data folder are stored in a linked list. This is not a problem when there are not very many waves in a data folder, but when you have thousands of waves in the data folder, the time to iterate through all waves rapidly increases.

The second method uses the wave of wave references returned by the generateWaves() function and iterates through that wave. In this case the time to iterate through the waves is roughly linear, and it is much faster even for fewer waves because there is no need to lookup the waves from the linked list.

On a development build of Igor 6, here are the timings I get:


•test(5000)
  Total Num Points of first loop:  640000
  Total Num points of second loop:  640000
  5000 waves (ms):: Overall: 681.901	1st loop: 679.458	2nd loop: 1.19863
•test(10000)
  Total Num Points of first loop:  1.28e+06
  Total Num points of second loop:  1.28e+06
  10000 waves (ms):: Overall: 2719	1st loop: 2715.07	2nd loop: 2.54419
•test(20000)
  Total Num Points of first loop:  2.56e+06
  Total Num points of second loop:  2.56e+06
  20000 waves (ms):: Overall: 20047	1st loop: 20037.8	2nd loop: 5.97675

Note that in the real world, you would need to generate a wave of WAVE references and store that wave somewhere. Generating that wave the first time could be slow if there are lots of waves, but you only need to generate it once (unless you are adding/deleting waves in the data folder of interest).

Depending on how you are doing the search, it might even be possible to do the search using multiple threads, which could speed things up even further.

March 1, 2012 at 09:14 am - Permalink

jamie

Dr. Jamie Boyd, Ph.D.
UBC In Vivo Imaging Core

Thanks for the ideas.

Adam, it was a bit of a surprise to me to see that GetIndexedObjName(sourceFolderStr, objectType, ii) runs in linear time, traversing a linked list. I wrote functions that trawl through all the waves in a folder with getIndexedObjectName, doing matching based on wave names, or wave notes, and indeed, they are big O of n^2.

There must be some way to maintain a sorted structure of pointers to waves so you could get the nth wave without having to traverse a whole list. That's where you idea of maintaining your own database of wavereferences is interesting. You could have several lists of references to the same wave, sorted by different metadata. That is, pair each waveReference wave with a wave of corresponding metadata, and sort both waves by the metadata wave. You could find all the data waves corresponding to a range of values for some metadata quite quickly then, by binary search on the metadata wave. But as you point out, you would have to make and maintain the list of wave references.

By contrast, the WaveList(matchStr, separatorStr, optionsStr ) function runs in linear time; it can do the check for the wavename matching on each wave as it traverses the list. I use a function (below) that duplicates (and extends to other Igor objects) the WaveList function. Another advantage is that you don't have to change datafolders to use GetIndexedObjName as with WaveList. I didn't realize what a performance hit that was (n vs n^2).

function wavelistTest (nT)

    variable nT

    variable ii, iii

    variable timer, eTIme, nELs

    newdatafolder/o/s root:wlt

    killwaves/a/z

    make/o/n= (nT/10) rTWave, rElWave, rTwave1, rElwave1

    rTWave=0

    relwave =0

    display rTWave vs rElWave

    appendtograph/r rTwave1 vs rElwave1

    modifygraph mode=2, lsize=2

    modifygraph rgb (rTwave1) = (0,0,65535)

    doupdate

    for (ii=0; ii < nT; ii+=1)

        for (iii=0; iii < 3; iii += 1)

            make/o/n = (100) $"W_xxx" + num2str (3*ii+iii)

            make/o/n = (100) $"W_xxx" + num2str (3*ii+iii)

            make/o/n = (100) $"W_yy" + num2str (3*ii+iii)

        endfor

        if (mod (ii, 10) ==0)

        timer = startmstimer

        nELs= itemsinlist (wavelist ("*yy*", ";", ""), ";")

        eTIme = stopMSTimer(timer )

        rTWave [ii/10] = eTIme

        rElWave [ii/10] = nEls

        timer = startmstimer

        nELs= itemsinlist (ListObjects("root:wlt", 1, "*yy*", 0, ""), ";") // uses getIndexedObject in a loop

        eTIme = stopMSTimer(timer )

        rTWave1 [ii/10] = eTIme

        rElWave1 [ii/10] = nEls

        endif

        if (mod (ii, 100) ==0)

        doupdate

        endif

    endfor

end

//******************************************************************************************************

Function/s ListObjects(sourceFolderStr, objectType, matchStr, givepath, erstring)

    string sourceFolderStr // can be either ":" or "" to specify the current data folder. You can also use a full or partial data folder path.

    variable objectType / 1 = waves, 2= numeric variables, 3= string variables, 4 = data folders

    string matchstr // limit list of objects with this wildcard-enhanced string. use "*"  to get all objects    

    variable givepath // use 1 to get full or relative filepaths depending on sourceFolderStr. use 0 to get just the objects with no path.

    string erstring // a string to return if no objects are found

    string objName =""  // Will contain all the objects in the folder, one by one as we iterate through the index

    string liststr = ""   // will contain the list of objects found and matched 

    variable numObjs //  the number of objects in the folder under consideration

    variable ii  // used to iterate through the list of objects

    if (!(datafolderexists (sourceFolderStr)))

        return "\\M1(" + sourceFolderStr + " does not exist." // why the "\\M1(" ? It is a formatting code for disabling choices in pop-up menus

    else

        numObjs =  CountObjects(sourceFolderStr, objectType)

        for (ii = 0; ii < numObjs; ii += 1)

            objName = GetIndexedObjName(sourceFolderStr, objectType, ii)

            if (!(stringmatch(objname, matchStr )))

                continue

            endif

            if (givepath & 1)

                if ((cmpstr (sourceFolderStr [strlen(sourceFolderStr)-1], ":")) == 0)

                    objName = sourceFolderstr + objname

                else

                    objName = sourceFolderstr + ":" + objname

                endif

            endif

            liststr += objname + SelectString(((objectType == 4) && (!(givePath &2))), ";", ":;")

        endfor

        if ((strlen (liststr)) < 2)

            liststr = erstring

        endif

        return liststr

    endif

end

Thinking about the idea of making a wave of wavereferences for indexing: Another simpler to set up and maintain solution might be to have a dictionary containing all the metadata keys mapped to individual letters, and, for each key, another dictionary of all the values for that key mapped to an integer. This mapping would provide short letter/number codes that could be used in wavenames. So the keys would be A,B,C, etc. and the values would be 1,2,3 etc, Wave named would look like A12B31C6. With 32 character length, you could have 7 keys, each of which could contain a maximum of 2^4 values, and still have 2 characters left over. Selecting all the waves that contained a particular value of a particular key would be a matter of finding the corresponding letter code for the key, the integer code for the value, and using the wavelist function, which at least gives us linear time, (and does it in compiled code, not Igor script).

741, the problem with using a database is it requires getting the data into a database, serving the database, and connecting to it. Also, a lot of our data are long waveform traces, or multiframe images. Not the stuff you want to be passing back and forth over a network. That being said, a lot of the analysis is done after some processing, extracting relative fluorescence changes from ROIS, eg. It may make sense to put those results into a database, and then use queries to easily look at results across multiple experiments.

While the topic of databases and metadata-rich scientific data, especially images is at hand, as anyone looked at aOMERO? http://www.openmicroscopy.org/site

OMERO is client-server software for visualisation, management and analysis of biological microscope images.

Looks kind of cool, and it has open source libraries that it would be possible to call from an Igor XOP. There is already Matlab access for it.

jamie

March 2, 2012 at 02:39 pm - Permalink