python - Recommendation for how to best handle information for random access? -


imagine have filesystem tree:

root/aa/aadata root/aa/aafile root/aa/aatext root/ab/abinput root/ab/aboutput root/ac/acinput ... 

totally around 10 million files. each file around 10kb in size. key-value storage, separated folders in order improve speed (fs die if put 5 million files in single folder).

now need to:

  1. archive tree single big file (it must relatively fast have compression ratio - thus, 7z slow)

  2. seek result big file - so, when need content of "root/ab/aboutput", should able read quickly.

i won't use redis because in future amount of files might increase , there no space them in ram. on other side, can use ssd-powered servers data access relatively fast (compared hdd).

also should not exotic file system, such squashfs or similar file systems. should work in ordinary ext3 or ext4 or ntfs.

i thought storing files simple zlib-compressed strings, remembering file offset each string, , creating map in ram. each time need file, read content offset form map, , - using offsets - actual file. maybe there easier or done?

assumptions (from information in contents). might use following strategy: use 2 files (one "index", second actual content. simplicity, make second file set of "blocks" (say each of 8196). process files, read them programmatic structure file name (key) along block number of second file content begins. write file content second file (compressing if storage space @ premium). save index information.

to retrieve, read index file programmattic storage , store binary tree. if search times problem, might hash keys , store values table , handle collision simple add next available slot. retrieve content, block number (and length) index find; read content second file (expand if compressed).


Comments

Popular posts from this blog

Android layout hidden on keyboard show -

google app engine - 403 Forbidden POST - Flask WTForms -

c - Why would PK11_GenerateRandom() return an error -8023? -