Tuesday, February 15, 2011

Applying Business Intelligence to Bug Tracking

Last week I released AnkiDroid 0.5.1, and judging by the Android Market's comments, people seem to love it :-)

Since a few releases already, an opt-in feedback mechanism sends us a report everytime a problem happens. The anonymous reports are automatically scanned by a Google App Engine application to determine whether it is a new bug or just an additional occurrence of an already known bug. The data can be exploited in two ways:

1) An online application allows one to browse the reports and bugs, see which bugs happen the most often for a given version, and associate them with issues in the bug tracker.

2) A set of Business Intelligence tools allows one to drill-down reports in a multidimensional OLAP cube, and generate reports to show any interesting findings. As a quick example, here is the distribution of crashes among Android versions. Those tools use the open source Pentaho Business Intelligence suite.

Friday, February 4, 2011

Alfresco: Categories vs. Spaces

Managing huge amounts of documents requires to know the limits of the ECM software you are using. Here is a study I performed about the limits and best strategies for Alfresco.

Categories vs. Spaces

In Alfresco, documents are usually hierarchized in spaces (kind of folder). But how about using Alfresco's "categories" feature instead of spaces?

Note: For all graphs in this article, horizontal axis = number of documents, vertical axis = time taken in milliseconds

This graph shows the time taken to show a space, based on the number of document this space contains, and the same for a category. There are no sub-categories nor sub-spaces involved.
For the same number of documents, categories show faster than spaces. That is especially true for above 100 documents. In a space, time is proportional to the number of documents. In a category, time is more logarithmic.
For huge numbers of documents, categories show in less than 3 seconds, whereas spaces take a very long time only to show Java errors related to a shortage of memory.

Impact of the spaces hierarchy on performance

To measure the impact of hierarchy, a comparison was done between two file spaces organization strategies:
(1) All files in about five spaces.
(2) Each file contained in its own 3 levels of sub-spaces (subspace1/subspace2/subspace3/file).

This graph shows the time taken by Alfresco's explorer to show a category, based on the number of documents that are shown.
Surprisingly, having the files scattering in a lot of different folders is more efficient.
Alfresco seems to have difficulties handling many files in the same space.
This has to be taken into account when analyzing the category performance tests, they use strategy (1), the slowest.

Impact of the number of categories applied

This graph shows the time taken by Alfresco's explorer to show a category, based on the number of documents in this category.
Two tests have been done, with a different numbers of categories.
As one would expect, if each document has 3 categories, it is faster than if each document has 10 categories.

Impact of the size of the repository on performances

This graph shows the time taken by Alfresco's explorer to show a category, based on the number of categorized documents in the repository.
Each document has 3 categories randomly selected from a pool of 20 existing categories.
The different curves show different usages of the explorer:
- Navigation in the categories tree view, with subcategories inclusion checked/unchecked.
- With or without 10000 additional uncategorized documents.
- First click or after three clicks (to measure cache performance)
- Search in a root category (Software Document Classification) including subcategories
- Search in a leaf category (Configuration Description)

Performance seem to be proportional to the number of documents in the repository at first, and then become more stable after 6000 documents.
Cached requests don't take more than 3 seconds, even with a repository of 160000 documents, which means a result set of 24000 documents.
On the contrary, search time grows consistently with the size of the repository.
The light blue curve's values are surprising and might be an artifact to the state of the database during the measure. Usual values are expected to be closer to the brown curve.

Method

Using Google Chromium and its Speed Tracer extension, I measured the time between the DOM click and the end of the processing (excluding repaints that occur after the page is shown completely).

Conditions:
- Alfresco Enterprise 3.2 with heap.maxsizesize = 500MB
- Ubuntu Karmic 2009.10 with Sun Java HotSpot 1.6.0_15
- Laptop with Intel Core Duo T9600 2.8GHz and 4GB RAM

Notes:
Empty categories are shown in about 600 milliseconds. But once, with 10000 categorized documents plus 10000 uncategorized documents, a particular empty category took 6 seconds to load, consistently.
Even for a well-defined operation, performance is not very predictable.

Conclusion

Categories show faster than spaces in the Alfresco explorer, especially when they contain large numbers of documents.
On huge repositories, performances are slow at start, but that get better once requests have been cached, most pages take less than 3 seconds to load.

Depending on the requirements, performances of categories might be deemed acceptable, there is no bottleneck or operation that takes more than 10 seconds.

However, some features are not available to someone who would use the Alfresco's “Categories” tree view exclusively, for instance:
- No permissions settings based on categories.
- No content rules settings based on categories.
- No "Add content" button when browsing categories.