ZxPowerstore: Item Deduplication

From ZeXtras Suite Wiki

Jump to: navigation, search
Language: English  • español • português
ZxPowerStore logo box.png
Available since version: 1.3.0
Latest Version: 2.12.2
Released on: January 2nd, 2019
Compatibility List
Admin Guide
FAQ
Troubleshooting
ZspPowerstore logo box.png

What is Item Deduplication

Item Deduplication is a technicque that allows to save disk space by storing a single copy of an item and referencing it multiple times instead of storing multiple copies of the same item and referencing each copy only once.

This might seem a minor improvement, in theory, but in practical use can make a huge difference. Think about that user, the one that improperly sends nice and unnecessary 15Mb "motivational" or "funny" presentations to a-hundred-and-something-recipient-all-in-the-"to:"-field.

Item Deduplication in Zimbra

Item Deduplication is performed by Zimbra at the moment of storing a new item in the Primary Volume.

When a new item is being created its "message ID" is compared to a list of cached items, and in case of a match a hardlink to the cached message's BLOB is created instead of a whole new BLOB for the message.

The dedupe cache is managed in Zimbra 8 through the following config attributes:

zimbraPrefDedupeMessagesSentToSelf

Used to set the deduplication behaviour for sent-to-self messages.

<attr id="144" name="zimbraPrefDedupeMessagesSentToSelf" type="enum" value="dedupeNone,secondCopyifOnToOrCC,dedupeAll" cardinality="single" 
optionalIn="account,cos" flags="accountInherited,domainAdminModifiable">
  <defaultCOSValue>dedupeNone</defaultCOSValue>
  <desc>dedupeNone|secondCopyIfOnToOrCC|moveSentMessageToInbox|dedupeAll</desc>
</attr>

zimbraMessageIdDedupeCacheSize

Number of cached Message IDs.

<attr id="334" name="zimbraMessageIdDedupeCacheSize" type="integer" cardinality="single" optionalIn="globalConfig" min="0">
  <globalConfigValue>3000</globalConfigValue>
  <desc>
    Number of Message-Id header values to keep in the LMTP dedupe cache.
    Subsequent attempts to deliver a message with a matching Message-Id
    to the same mailbox will be ignored.  A value of 0 disables deduping.
  </desc>
</attr>

zimbraPrefMessageIdDedupingEnabled

Manage deduplication at Account or COS-level.

<attr id="1198" name="zimbraPrefMessageIdDedupingEnabled" type="boolean" cardinality="single" optionalIn="account,cos" flags="accountInherited"
 since="8.0.0">
  <defaultCOSValue>TRUE</defaultCOSValue>
  <desc>
    Account-level switch that enables message deduping.  See zimbraMessageIdDedupeCacheSize for more details.
  </desc>
</attr>

zimbraMessageIdDedupeCacheTimeout

Timeout for each entry in the dedupe cache.

<attr id="1340" name="zimbraMessageIdDedupeCacheTimeout" type="duration" cardinality="single" optionalIn="globalConfig" since="7.1.4">
  <globalConfigValue>0</globalConfigValue>
  <desc>
    Timeout for a Message-Id entry in the LMTP dedupe cache. A value of 0 indicates no timeout.
    zimbraMessageIdDedupeCacheSize limit is ignored when this is set to a non-zero value.
  </desc>
</attr>

(older Zimbra versions might use different attributes or lack some of them)

Item Deduplication and ZeXtras Powerstore

The ZeXtras Powerstore module features a "doDeduplicate" operation that parses a target volume to find and deduplicate any duplicated item.

Doing so you will save even more disk space, as while Zimbra's automatic deduplication is bound to a limited cache, ZeXtras Powerstore's deduplication will also find and take care of multiple copies of the same email regardless of any cache or timing.

Running the "doDeduplicate" operation is also highly suggested after a migration or a large data import in order to optimize your storage usage.

Running a Volume Deduplication

Via the ZeXtras Administration Zimlet

To run a volume deduplication via the ZeXtras Administration Zimlet simply click on the "ZxPowerstore" tab select the volume you wish to deduplicate and press the "Deduplicate" button: ZxPowerstore Deduplicate.png

Via the ZeXtras CLI

zimbra@mailserver:~$ zxsuite powerstore doDeduplicate

command doDeduplicate requires more parameters

Syntax:
   zxsuite powerstore doDeduplicate {volume_name} [attr1 value1 [attr2 value2...]]

PARAMETER LIST

NAME              TYPE           EXPECTED VALUES    DEFAULT
volume_name(M)    String[,..]                       
dry_run(O)        Boolean        true|false         false

(M) == mandatory parameter, (O) == optional parameter

Usage example:

zxsuite powerstore dodeduplicate secondvolume
Starts a deduplication on volume secondvolume

To list all available volumes, you can use the `zxsuite powerstore getAllVolumes` command.


"doDeduplicate" stats

The "doDeduplicate" operation is a valid target for the "monitor" command, meaning that you can watch the command's statistics while it's running through the `zxsuite powerstore monitor [operationID]` command.

Sample Output

Current Pass (Digest Prefix):  63/64
 Checked Mailboxes:             148/148
 Deduplicated/duplicated Blobs: 64868/137089
 Already Deduplicated Blobs:    71178
 Skipped Blobs:                 0
 Invalid Digests:               0
 Total Space Saved:             21.88 GB
  • "Current Pass (Digest Prefix)" - The "doDeduplicate" command will analyze the BLOBS in groups based on the first characted of their digest (name).
  • "Checked Mailboxes" - The number of mailboxes analyzed for the current pass.
  • "Deduplicated/duplicated Blobs" - Number of BLOBS deduplicated by the current operation / Number of total duplicated items on the volume.
  • "Already Deduplicated Blobs" - Number of deduplicated blobs on the volume (duplicated blobs that have been deduplicated by a previous run).
  • "Skipped Blobs" - BLOBs that have not been analyzed, usually because of a read error or missing file.
  • "Invalid Digests" - BLOBs with a bad digest (name different from the actual digest of the file).
  • "Total Space Saved" - Amount of disk space freed by the doDeduplicate operation.


Looking at the sample output above we can see that:

  • The operation is running the second to last pass on the last mailbox
  • 137089 duplicated BLOBs have been found, 71178 of which have already been deduplicated previously.
  • The current operation deduplicated 64868 BLOBs, for a total disk space saving of 21.88GB
Personal tools