🗿jjabrahams567

Organisation of files

Jul 3, 2018 at 7:10pm

May I have a general question? When writing program which will access many data files to search records in them, is it important to organize the data file to hierarchical structure, so there won't be folder which have hundrets of files, but only tens, or is this hierarchy of folders irrelevant? I know when I would create algorithm to search files, this would be important because the more files are in a folder, the longer it will take to search the folder. However, if I am trying to open certain file directly, you know the path exactly like, this/is/my/crazy/path/to/a/file.txt ... But what does this mean? What does computer do with the folders information? Is there some difference if you have 1000 files in a folder or just 5 files when we try to open path with the exact path? I am using system windows XP.

Last edited on Jul 3, 2018 at 7:28pm

Jul 3, 2018 at 7:31pm

jonnin (11491)

some of this depends on the OS and its filesystem. For example windows has indexed files for searching, which mitigates some of the problems for some searches. I don't know what unix supports anymore, but it does pretty well until folders get pretty big. You can dump into one folder until you see a performance issue while testing, see where the break-even point is on a hierarchy on various target machines.

it depends some on your hardware and how big the files are and how many you really have etc. Searching 100-200 5kb text files won't make that much difference all in one folder or split across several. Its going to finish quickly. One reason to divide them would be multi threaded... can you thread the search and do 4, 8, more sub-folders in parallel? That makes more sense to me than trying to split one big folder 8 ways by hand, by getting a list of files or something.

what exactly a folder means on a filesystem depends on the filesystem, but generally think of it like a (not binary) tree. the root is the whole disk, each node is a folder or a file, leaves are files or empty folders. Knowing the path / folder / etc lets it skip to the file pretty quickly, a few hops and its there (its probably more efficient than this using various tricks, but this is good enough to get an idea of what its doing.) The exact path and name should take you directly there with minimal inefficiencies. Remember too that a file is mapped to chunks on the disk, and may be fragmented or whatever... the disk hardware has to reconcile down to bytes on a magnetic platter or SSD index.

there is nothing stopping YOU from making your own index to search from as well. You can crawl over the files every day and build a shortcut 'search' that gets you the answer you need for any file unchanged since it looked, and it can drop out to look harder at changed files (and reindex them when it does..) or something complicated like this.

Its hard to answer generally ... how many files, how big are they, how fast do you need the answer... what approach will you use (OS doing the work? You doing the work? Threaded? other stuff?).

Jul 3, 2018 at 7:45pm

h4ever (529)

Thank you for your answer. I use Win XP. I think it could be so like you have written, maybe 100 - 200 very small files. In fact I am not sure because I did not finished my app. But I see I'll have to test and see. In the moment no multi-thread, but later it would be interesting to do it. It would make the export super fast (My machine has cores threads, 8 threads).

Yet the final use of the data files would be for translation. Program would translate text using the words saved in small files. So the access would be needed to be fast. The definition of the path depends in first letter value, last letter value and length of the word. I can just guess how many files it would need to open to translate this post... Maybe 120.

Last edited on Jul 3, 2018 at 7:50pm

Jul 4, 2018 at 1:27am

jonnin (11491)

it feels like you may be doing something wrong.
I think every word in English will fit in 2-3 MB of memory in a lookup table.
If each word translated to a similar sized word on the average to the other language, it would all still fit in less than 10 MB and if arranged in a good data structure, it would do its work very quickly. 120 files to do that sounds … terrible, unless you are supporting many, many languages.

Jul 4, 2018 at 10:42am

h4ever (529)

"I think every word in English will fit in 2-3 MB of memory in a lookup table. "
I do not know why you think so. The file information is 4Kb, when it is empty.

It will depend on the translation. This is my source of data:
http://www.definify.com/word/Scout?q=scout
Let's suppose I want to translate the word scout to Mandarin (Chinese) - there is 60 chars, which could take up to 180 bytes for one item. So 186 bytes total.

Well I must to try the performance. I will let you know if it will run slow or fast.

Jul 4, 2018 at 11:15pm

jonnin (11491)

I do not know why you think so. The file information is 4Kb, when it is empty.

-- I looked up the # of words in English & the average length and rounded up. You may not have even close to every word, many of them are not used much.

-- Chinese may not agree with my caveat, where I said if the average target word is about the same as English average. It won't be for everything, of course.

-- My example of 10 MB may or may not be super accurate but even if it took 10 times that, it would still be tiny on a modern PC. The point is I don't think you need hundreds of files here, more like dozens? /shrug, its your project, I am just trying to outguess you :)

Anyway, let us know how it goes for sure.

Jul 6, 2018 at 2:21pm

dhayden (5799)

At work we deal with millions of relatively small (mostly less than 1MB) files in a distributed system that are being created, modified, read and deleted all the time. We've had to deal with this exact problem.

First, recognize that you may be doing premature optimization. I think you'll find that opening a file takes a lot more time than finding it. If you have to open a file for every word, the program will run slowly no matter where you store the files. I think h4ever is right: you're better off storing the data in a single file (or a few large ones). There might be 100,000 English words. If you store 4K for each one then that's 4MB, which is trivial on any computer built in the past 25 years.

Returning to your original question, we found that when a directory contains around 10,000 files, it becomes noticeably slow. We split the files up so that there are 100 per directory. So object 12345678 is stored in directory 12/1234/123456/ and it's name is 12345678. Of course we could have shortened it to 12/34/56/ but kept the longer names because it makes support easier.

Jul 8, 2018 at 5:16pm

h4ever (529)

dhayden:

There is one more aspect. The more items you have in the stand-alone file, the more time it takes to find a single item. Also if you need to make backup of the file after user edited it, then the bigger the file is, the more space you need on drive... So my idea was to make it easy to search the item in the file. If there is for example 20 items, than it is easy to export them all to text file and read them. Whereas if you have 1000 items, it is harder for human to find a single item in the text file... If any problem will happen during my programming, it will affect only this small area of data, which I selected to test. So I have it easy to check the testing file. While if the testing file contains 100 items, it is more likely I may not see error which can happen anywhere in the file. So at least for testing purposes, it can be good to have very small files. But later, one can choose optimal solution.

Jul 8, 2018 at 7:46pm

jonnin (11491)

are you searching the file linearly or reading it all into the program?! If you read it in the program, you can search in O(1) if you want, or whatever other slower algorithm. If reading from the file, you should use binary and fixed width strings and use a better algorithm. You can read a "binary" file full of text in a text editor; all you are doing is padding it with spaces or something so you can use an efficient search.

Is disk space a real or imagined issue? You still seem to be worried about a MB or two in a GB world. 1 GB is 1000 MB. Your data uncompressed (most OS will let you compress a folder if you need to) is a few MB best as I can tell?

Jul 9, 2018 at 3:14pm

h4ever (529)

I've choosen this solution because I did not want to read the hole file. If the file would have 2.500.000 of items, then every time I would want to search word, I would need to read whole header to be able to find the word in the file. It would take some time to find it in the big file.

I also realized that there could be identical items too. E.g.

In my language when we say "dal", it has also several meaning in English depending on its use. E.g. He have, he have given or he had given; or I gave, I have given or I had given or I would like to give, or he would like to give... So in that case I would also need to create about 6 items.

So the file could be bigger than I expected. I don't really see it as a problem if I would need to read 200 small files when they would be well organized. In fact, they would be opens just for few ms.

So I know it looks crazy but the solution is very simple and does not need much code.

Space issue. It is better the dictionary files are smaller because it will be more convenient to users who would like to download. Yet I am using Windows XP so I am limited to 3.5 GB. I still have 2.GB free RAM and 150MB for testing purposes in the Ram disk.

Jul 10, 2018 at 3:37pm

icy1 (596)

Are you trying to conserve RAM for some reason? If lots of searches will be taking place (e.g. some translation-based memory game), it makes more sense to read everything once at startup/initialization, as suggested by jonnin. Once it's in random-access memory, the searches will be fast, e.g. O(1) if exists or O(log n) to determine it doesn't for the already-sorted std::map.

You don't want too much RAM being used? Then consider making a subset of the full dictionary with the most common, e.g. conversational, words used with their translations. But first try with the full dictionary and see if it really is a problem.

As dhayden said, reading files is slow. You really only want to do this once.

Once you've created your dictionary completely (and it will no longer change), you can binary serialize/deserialize with Boost. Might be faster than manually reading it back in.

Side note: while you may use Windows XP, this is not realistic for most users, since even Windows 7 is about to get deprecated "for realz". I still use Win7 at home but will soon be changing that once I get a new processor. If it's a desktop app, you can expect 64-bit for most OS. But really, how much memory space would a map<string, vector<string>>, where most of the value translations have one item, with a couple million items even take up? If https://stackoverflow.com/questions/720507/how-can-i-estimate-memory-usage-of-stdmap is any indication, with 2million items you're looking at maybe 200MB usage.

Last edited on Jul 10, 2018 at 3:49pm

Jul 12, 2018 at 10:29pm

h4ever (529)

I already use ramdisk. I don't see the sense of loading big file into memory and also in the case of many files... I don't like the idea the program would take a lot of space in RAM. I go different way. The program itself is very small about 1MB. The thing which will take the space are the vocabularies. I believe user should choose a language which he want to study. Then he could extract the files from archive to RAM disk (in the case that he wants to perform fast searches). But if he is the lazy user, then he could install the files on harddisk. The choice is important here. Every one can choose whether to do things simple but slow or complicated but fast. I have some ideas how to make everything superfast, but here is not place and time for details.

Jul 13, 2018 at 2:52pm

h4ever (529)

Another case when is it not useful to have big file is when I would need to append unique words into the file. Imagine I have 500.000 or words and I need to go over all the items and compare with a substring... If I want to add 20 unique items, then I need to search 20 times. It much depends whether one wants to add unique items or does not mind the duplicity.

Topic archived. No new replies allowed.

Python

Forum

Organisation of files