Exporting a lot of files at once from M-Files
Background
I've written about how to do a mass file import into M-Files here and here before, but recently I was contacted by a client who had quite the opposite problem - he wanted to export a lot of files out of M-Files.
Getting Info
After a quick skype call to get to know the client and the details of the project, the following facts were available:
- The data resides in a single M-Files Vault
- It's backed by a whopping 230+ GB SQL Server Database
- Other ways to export, direct access to the SQL Server and manual exporting had failed
- An export of all document files (.docx, .pdf, ...) is needed
- All properties of the files should be exported to a CSV file
- Time was of the essence (when is it not?)
What didn't work
It seems that the client tried to export the data directly from the SQL Server, but I heard that this approach failed as they couldn't make out what goes where. From a software engineering perspective, this is fair, as the data storage is an implementation detail that can be changed anytime (for example by using another database backend).
Next they tried to export the files manually. That's not only slow, but also an awfully error prone process and when you're interested in the metadata as well, then you really really shouldn't go down this route, even with a few documents, but in this case we had a few hundred thousand.
So what's the alternative?
You probably know what's the right approach: a small custom application that uses the M-Files API to access all documents programmatically.
Solution
Armed with this knowledge we can formulate the characteristics of the solution:
- Create the files
- Create metadata
- Reliable
- Fast
- Inexpensive
Considering the fact, that the tool needed to be done quickly and keep the development cost low, I decided on writing a Commandline Application. Another reason was, that it did not need to look fancy and would be operated by skilled IT personal who preferred a simple commandline interface anyway. I would have been happy to create a Wpf Application like Chronographer but that would have been overkill.
Exporting the files
As I had prior experience with the M-Files API, it didn't take me long get the file export running. In the screenshot below you see the results if run against the Sample Vault.
Exporting the Data to CSV
A little more interesting but still straight forward was the csv export, as you need to know all properties to create the csv header and put them in the right column. To do this, I enumerate all classes in the Vault and collect their properties and then they are written in the header row.
A feature of M-Files is, that all M-Files documents can contain 0 or more files, which meant for me that a found document could result in zero exported files (if it didn't hold any) or more than 1 file, in which case the csv export also needed to repeat the properties accordingly.
I settled on creating 4 fixed csv columns followed by the properties of all classes. The 4 columns are:
- FilePath
Allows mapping to the exported files - FileId
The id of the document in the M-Files Vault - SourceFileId
The id of the binary file document (.docx, .pdf, ...) - ClassName
The name of the class that the exported file belongs to
Below you'll find a screenshot of the result when run against the Sample Vault.
Enumerating the files
An interesting problem was enumerating all the files to export. I settled on creating a search for the documents that skips deleted objects. As the maximum number of search results is capped at 100.000 items, you are not able to fetch all documents with a single search. I solved that by adding an additional search condition, namely searching for ids with a specified segment, where segment 0 for example means items 0-9999 and segment 1 returns 10000-19999. By repeatedly searching for files in this way and incrementing the segment, I was able to traverse the whole vault.
Complications
As always, nothing works perfectly on the first try and while the CSV Export completed successfully, when we were exporting the files for a few hours M-Files threw an Exception with the error message "An SQL update statement yielded a probable lock conflict. Lock request time out period exceeded." which seemed like a M-Files glitch to me.
Anyway, as we needed to try again we were loath to start the export from the beginning, so I added another commandline argument that allowed us to specify a starting segment so that we could continue our export instead of starting it from the beginning.
So that the final commandline interface looked like this:
Additional parameters like the Vault Name and the Credentials are stored in a config file in the same folder and read by the application at startup.
On the second run, we didn't encounter any errors and we had exported over 200K files and produced a nice 120MB CSV file. In the end, we had a nice, repeatable process that was able to save the client a lot of time, money and headaches.
References
- Blog post: M-Files mass file import Part I (http://lostindetails.com/blog/post/"M-Files-mass-file-import-Part-I)
- Blog post: M-Files mass file import Part II (http://lostindetails.com/blog/post/"M-Files-mass-file-import-Part-II)
- Chronographer, a Wpf Application I wrote (http://lostindetails.com/chronographer)