Bug in Directory processing

Clive L
Freshman Member

Posts: 17

Bug in Directory processing Jun 29, 2021 11:02:12 GMT -5

Quote

Post by Clive L on Jun 29, 2021 11:02:12 GMT -5

The version I'm using is a couple of release back (2.3.21026), so not sure if this is identified in later versions, but it appears that SPFLite has some performance issues when large directories are in use.

I have just tried listing a directory containing around 11,600 members. This took many seconds to display (it did come up eventually). Having used members within the list I then filtered the list and at this point it issued a couple of crash files (attached):

Attachments:

SPFLiteCrash.2021062916520084.txt (540 B)

SPFLiteCrash.2021062916521723.txt (540 B)

Clive L
Freshman Member

Posts: 17

Bug in Directory processing Jun 29, 2021 13:34:52 GMT -5

Quote

Post by Clive L on Jun 29, 2021 13:34:52 GMT -5

Well ok, rename "bug" to "issue" if you prefer. I called it a bug, because some number of releases ago I used to get almost identical symptoms with the internal loop message, which was then followed by an actual crash (this was attempting "Find Files" after a long period with many files open - an issue that was subsequently fixed). The difference now is that okaying the pop-up means that execution continues, whereas previously it used to crash. My bad. Perhaps the file name ought not to contain the word "crash"?

Regarding your point about Notepad, I just tried it and the folder appeared in the Open dialogue almost instantaneously.

George
Administrator

Posts: 4,054

Bug in Directory processing Jun 29, 2021 14:12:09 GMT -5

Quote

Post by George on Jun 29, 2021 14:12:09 GMT -5

Clive: Does your FM layout contain any of the extended Properties? If so, this requires quite literally an open of every file to fetch the Metadata.

Frankly I doubt you do, so I will try and setup some test folders to check this out.

But Robert is quite right 11K+ files must be a chore to manage.

George

George
Administrator

Posts: 4,054

Bug in Directory processing Jun 29, 2021 14:54:37 GMT -5

Quote

Post by George on Jun 29, 2021 14:54:37 GMT -5

Robert: Without writing a pile of custom code, reading folders in 'chunks' or some other threaded process is non-trivial. All the API functions (whether Windows or PB) work the same, like - Issue the Start search parameters, then Do while Not-Done, Loop.

So tell me, how do you 'thread' that? The FM screen display wants to display a list, not some moving target of a list which is being dynamically added to by some other thread.

No, threading won't help here without some kind of major internal re-structuring, and basically that ain't going to happen.

What I have to do is add some timing traces and see just where this current process is spending all its time.

I created a folder with 12,000 files in it, and yes, it's painful to load. So until I figure out just where the time is being wasted, all solutions are off the table, we just don't know yet what we're trying to fix.

George

Clive L
Freshman Member

Posts: 17

Bug in Directory processing Jun 29, 2021 15:17:48 GMT -5

Quote

Post by Clive L on Jun 29, 2021 15:17:48 GMT -5

George: No, these are just a bunch of text files (COBOL programs in fact).

This whole thing is handled much more gracefully since you reworked the code a while back to keep things going after hitting a problem. Just thought it would be worth highlighting. Thankfully I'm not using folders of that size very often.

I'm guessing there's a bit of "re-work" going on each time you switch from an open file back to FM, as each time this happens the screen goes blank while it (re)processes the directory.

George
Administrator

Posts: 4,054

Bug in Directory processing Jun 29, 2021 15:51:58 GMT -5

Quote

Post by George on Jun 29, 2021 15:51:58 GMT -5

Clive: I can see the problem quite easily with my test folder. My problem is simply "where the heck is it spending all that time?"

The PB compiler has a nice PROFILE feature that measures where you 'spend' your time, but right now it sems to be not working 100%, so I'm struggling with that. One way or the other, I'll track this down.

What I'm concerned about is that recent versions have converted the FM 'List' into a list of Objects that hold all the file information. I'm hoping that the overhead is not related to switching to Objects from simple arrays of TYPES. That would be really tough to overcome without dropping the Object structure and going back to TYPE arrays.

We'll see.

George

George
Administrator

Posts: 4,054

Bug in Directory processing Jun 30, 2021 9:05:29 GMT -5

Quote

Post by George on Jun 30, 2021 9:05:29 GMT -5

Robert:
a) there is no API which returns the count in a folder, if you want the count, you have to read the whole folder anyway

hFirst = FindFirstFile(Path + "*.*", FD) 
IF hFirst = %INVALID_HANDLE_VALUE THEN EXIT METHOD 
DO WHILE hFirst  
   INCR count
   hNext = FindNextFile(hFirst, FD)
LOOP

b) I will simply never attempt this virtualization, multi-thread approach. Period.

c) I'll continue looking at determining where the slowdown is and trying to optimize it.

d) But while I sympathize with Clive, I'm not about to embark on some major revisions for a single user's unusual sized folder. SPFLite is still basically an Editor, not a File Manager.

George

George
Administrator

Posts: 4,054

Bug in Directory processing Jul 1, 2021 10:50:28 GMT -5

Quote

Post by George on Jul 1, 2021 10:50:28 GMT -5

Robert: I tried your suggestion reading with a small test program. The slowdown is not in the reading.

It reads the folder, allocates a new FM Data Object for each one, and stores the basic FD directory data. (The full program also completes a lot of other Object fields with formatted data, which this doesn't)

And this runs basically instantly. And the BigFolder is on a real hard drive, not an SSD.

So, I'll just have to continue my search for who/what/where the overhead is coming from.

Here's the test program.

George

GLOBAL ghDebug AS DWORD
GLOBAL gFMD()              AS iFMData                             ' Latest Dir scan results (Object pointers)

FUNCTION PBMAIN () AS LONG
LOCAL i, j AS LONG, t AS STRING
LOCAL FD AS DIRDATA
LOCAL FP AS ASCIIZ * %MAX_PATH
DIM gFMD(1 TO 100) AS GLOBAL iFMData

    FP = "E:\BigFolder\*.*"
Debug "Starting"
    t = DIR$(FP, 22 TO FD)
    DO WHILE ISNOTNULL(t)
       INCR i
       IF i > UBOUND(gFMD()) THEN
          REDIM PRESERVE gFMD(1 TO 2 * i) AS GLOBAL iFMData
       END IF 
       LET gFMD(i) = CLASS "cFMData"
       gFMD(i).FD = FD
       t = DIR$(NEXT, TO FD)
    LOOP
Debug "Ended"
debug "Read " + FORMAT$(i) + " records"
debug "Record " + FORMAT$(i) + " = " + gFMD(i).Filename
    MSGBOX "OK"
END FUNCTION

Last Edit: Jul 1, 2021 10:51:44 GMT -5 by George

George
Administrator

Posts: 4,054

Bug in Directory processing Jul 1, 2021 13:13:47 GMT -5

Quote

Post by George on Jul 1, 2021 13:13:47 GMT -5

Robert: StringBuilder doesn't help here. It's purpose is to build ONE string very efficiently from a bunch of individual additions. Like building a text file by joining lines + $CRLF over and over and then writing the entire file as one big 'string'.

No, I've just got to track down the guilty party here.

George

George
Administrator

Posts: 4,054

Bug in Directory processing Jul 1, 2021 15:53:41 GMT -5

Quote

Post by George on Jul 1, 2021 15:53:41 GMT -5

OK, I've identified one routine that is certainly accounting for the vast majority of the overhead.

The routine that reads file folders is driven by a requirement created by FILELISTs, that is, the user can request multiple folder reads, with multiple filemasks, and this can result in duplicate entries in the final displayed list.

So, before adding an entry to the final displayed file list, a call is made to a routine whose sole function is to say YES/NO if the file is already in the list. This is the routine that's creating the extreme overhead.

And as the # of files increases, this routine aggravates things more and more.

And it's retty hard to optimize, the list is not ordered in any way at this point.

      METHOD IsAFDup(fn AS STRING) AS LONG
      '---------- See if a filename is already in the gFMD array
     REGISTER i AS LONG
         FOR i = 1 TO gFMDCtr                                     ' Spin through the table
            IF fn = gFMD(i).FullPath THEN METHOD = %True: EXIT METHOD
         NEXT i                                                   '
      END METHOD

Not sure how to handle this yet, still thinking.

George

Last Edit: Jul 1, 2021 15:54:08 GMT -5 by George

George
Administrator

Posts: 4,054

Bug in Directory processing Jul 2, 2021 9:51:12 GMT -5

Quote

Post by George on Jul 2, 2021 9:51:12 GMT -5

Robert: My thoughts right now are to simply eliminate the Dup check if the FILELIST has only 1 entry. Note: A FilePath is simply fudged as a 1 item FILELIST internally, so they all use the same reading logic.

I seriously doubt anyone would ever have a FILELIST containing a request for a huge folder like this. Going to test this out this morning. It's obviously the simplest solution.

George

George
Administrator

Posts: 4,054

Bug in Directory processing Jul 2, 2021 10:28:40 GMT -5

Quote

Post by George on Jul 2, 2021 10:28:40 GMT -5

Robert:
Clive:
OK, I've altered the data reading routine to skip the DUP check when there is only one path being loaded (e.g. a FilePath FM mode, or a FILELIST with only 1 entry)

The improvement is way more than enough to satisfy our needs. I tried with both a hard drive and an SSD drive, barely any difference. It loads and displays the 12,000 item folder is about 2 seconds.

This will be in the next release.

George

Stefan
Administrator

Posts: 989

Bug in Directory processing Jul 3, 2021 11:40:51 GMT -5

Quote

Post by Stefan on Jul 3, 2021 11:40:51 GMT -5

George,

Just an idea. I don't if this helps at all in PB, but I had a similar issue in REXX with a dictionary application. How can I find quickly if any given word is in the dictinary or not.

I read the dictionary file and built the DICT. array. The point was that the elements were the words, e.g. If the word "crater" is in the dictionary, the variable called "DICT.CRATER" exists. If it doesn't exist, the word isn't in the dictionary. In my case the value of that variable was irrelevant.

Luckily, REXX offers a Method called hasIndex so checking the array is as simple as IF DICT.~hasIndex("CRATER") THEN...

Hope this helps.

Last Edit: Jul 3, 2021 11:42:02 GMT -5 by Stefan

George
Administrator

Posts: 4,054

Bug in Directory processing Jul 3, 2021 12:33:50 GMT -5

Quote

Post by George on Jul 3, 2021 12:33:50 GMT -5

Stefan: PowerBasic provides a Collection Object, which performs mostly the same way, except it stores key/value pairs. So that I could say:

CollectionNaaaame.Add("crater", 1) to add the name (all names would have a 'value' of 1

and then ask

IF CollectionName.Contains("crater') THEN ... True if crater exists.

Never used it this way, have to benchmark it for performance, but most PB provided tools are bloody efficient.

I'll try it out just to see. Let you know.

George

[UPDATE]
Only did a quick test in a small test program, but the Collection seems pretty fast.

I did a test with my BigFolder (12,000 files)

Using a simple text array, and doing a serial search for DUPS before adding a new entry, it takes around 3100 ms.

Using a PB Collection Object, and doing a Contains search before adding a new entry, it takes around 300 ms.

So the PB code is pretty fast.

But right now I'll stick with just avoiding the DUP search for single path loads. But now knowing the performance of the PB Collection Object it will certainly be considered for other uses.

[\UPDATE]

Last Edit: Jul 3, 2021 13:29:29 GMT -5 by George

George
Administrator

Posts: 4,054

Bug in Directory processing Jul 3, 2021 14:14:50 GMT -5

Quote

Post by George on Jul 3, 2021 14:14:50 GMT -5

Robert: PB has no problem calling external DLL routines, after all, all the Windows API calls are just DLL calls.

I've looked at a bunch of sample code in the PB Forums for a variety of collections, Hash trees etc. (there's tons available) But really, I think using the built-in PB support is fine. Stuffing in a dummy '1' value for the data part of key/data is a nit. In fact some of the Dictionary code I reviewed in the forums still had an associated index/pointer/whatever data portion, even though a dictionary has no requirement for it.

There's even wrappers to utilize the Dictionary object included in the Microsoft Script Runtime (scrrun.dll).

I'll stick with PB (if I ever decide to use a dictionary). It's simple, it works, it's documented, and it seems more than fast enough.

George