java - How to take out all duplicated files in subfolders from a huge set of files/subfolders in shortest time span? -


situation: have build application in java runs on lan on pc use server. pc dedicated application , has configuration (core-i 7 8 gb ram). app aimed :

  • store data huge file manipulation project (around 12 million files 200 kb each) in database (mysql).
  • server pc acts database server also, , 20 pc's, interacts database day long.
  • project process consists of multiple stages. each stage there separate folders, software moves files on network on separate storage server of size 16 tb.
  • the network cable between server pc , storage server giga cable while other network cables normal ones.
  • in process everyday 60,000 new files generated, , 100,000 records inserted in database , @ end of day app sends mail report client.

problem: our client has asked provide mechanism identify files same name, , submit them in seperately. can not rely on database records, since on different stages, users delete or modify files. other hand need think time. process of duplicate check done everyday before sending report. prefered solution use java and/or mysql based solution. tried:

  • searching , indexing files storage server, changed mind when program running 3 hours , still going on.
  • also database accept duplicate names, can not put unique constraint on it, can not make column file count , put unique constraint on file names, since slows down data enteries (they use insert batch). not want create table keep unique file names purpose only, since redundency.
  • mutli tasking tried.

question: best solution in java/mysql check files same name on huge amount of files/folders, on busy network in minimum amount of time, considering in mind entered database records not accurate?

i had similar situation before, well, not of de-duplication, of categorization. not many ready-made tools available free , open-source , can take information database. but, after long hunt, did find 1 great useful tool, directorylistprint.

http://download.cnet.com/directory-list-print-pro/3000-2248_4-10911895.html

the last saw, had free version can dump data csv-like format in own window or csv file. there, take on simple database access or sql server or something, run query find duplicate files. if have repetitively, use automation tool such autoit or autohotkeys automate task.


Comments

Popular posts from this blog

Android layout hidden on keyboard show -

google app engine - 403 Forbidden POST - Flask WTForms -

c - Why would PK11_GenerateRandom() return an error -8023? -