Real Time DMS

In these mini series of blog posts I want to focus on Papermerge features currently being developed. I will describe in detail the motivation behind those additions, the development status and the roadmap.

Features I am speaking about are following:

  • Real Time (subject of current post)
  • Dual Panel
  • Performance Improvements / Refactoring
  • OCRmyPDF integration

Each of above mentioned feature will be detailed in a separate blog post. In this article I will write about the most exiting feature – Papermerge is real time document management system !

What is Real Time App?

A real time web application, simply put, is a web application which provides you feedback in the moment the event occurs.

Hm… not very helpful definition. Let me explain by example: when you upload a document, the very first thing Papermerge does – it OCRs the document; depending on the document, of course, the OCR operation may take couple of seconds, couple of minutes or even hours (for ~ 800 pages documents). The OCR process runs in background and all this time user does not have any feedback on the status of OCR! With real time the web application that is no longer the case – Papermerge will provide user with visual feedback about the status of OCR. Even more, there is visual status about the document itself and about document’s each individual page!

OCR Statuses updated in real time

In illustration above you see all four possible OCR statuses: unknown, pending, processing and complete. Statuses “processing” and “complete” are self explanatory. The “unknown” and “pending” need some clarification. “Pending” means that Papermerge is aware of the OCR task and it scheduled OCR task for this document for specific worker; this basically means one worker received the OCR task, added that task to its “to do list” so to speak, but currently that worker busy processing some other document. Finally status “unknown” means that Papermerge did not schedule the task to any worker as all of them (workers) are very busy and have their queue all full.

Every document has one or multiple pages. When a document’s OCR status is marked as complete it means all its pages were OCRed. If at least one page from the document is scheduled for processing/OCR, then the whole document is considered as scheduled for processing. I won’t go into all boring details, but my main takeaway point here is that OCR statuses are per document and as per page. Following picture will help illustrates each individual page’s OCR status:

OCR statuses per page

All information described above is much easier to show in a video. Below is a short screencast which I recorded specifically to demonstrate what real time DMS means and what was developed so far.