Data Singularization

Ziya Akkoca
September 05, 2013

Unavoidable Rise of Data

Before 5-6 years ago from now, a cell phone was only SMS, talking and e-mail which was more luxurios.  While the world has changed from desktop to mobile life, portable devices have become smaller.  Although it is mentioned as “Big Data” and it is not too much related with corporate-wise “structured” data, “unstructured data” has more importance and it is the data type that dictates data storage sector in the world. The largest source of unstructured data is internet and mobile world.

Before going into technology and singularization and compression subjects to be addressed in this article, it will be useful to give some impressive information about size of the mentioned data and growth rate to get a better view that how severe the situation is.

Since up to 2013, there was no any meaningful data, it is correct to take 2011 and 2012 as base year, as of year-end;

There are 29 million mobile users listening to music over public networks.

52% of the shared content on the same networks is video content

1600 status update per second is being done on Facebook, 405 of them is media-containing updates.

Twitter has itself 13 million user accessing via mobile devices.

Shared contents via Instagram which was acquired by Facebook recently has increased 1900% in comparison with last years.

Smart phone sales have risen 3500%, tablet sales ave risen 12100% during passing from ordinary phones to smart phones.

Public institutions and private companies serving for public tend to use/develop tablet and smart phone friendly applications.

Global data growth rate is 18 Petabyte per day.

Size of the data that was created and replicated in 2011 is 1.8 zettabyte (1 zettabyte=1 trillion terabyte)

Data created by CERN experiments is 40 terabyte per second. This data is circulating through European information networks.

A 30-MB file stored in a personal e-mail account can rise up to 7 GB according to back-up frequency and type.

Corporates must back-up their data as per ISO, BS, and DR standards requirements.

Although such data seems to be far away from “personal” and “professional” environments, employers access and save these data in corporate environments frequently.  The biggest obstacle in front of information technologies all over the world is storing of these data which is growing in uncontrolled manner and managing the systems.

IT executives must maintain these data available and use their available capacity effectively. Providing performance and capacity in the same environment has high cost. Supplementary hardware is needed to compensate performance gap pf high-capacity systems, high-performance systems can only provide such high performance for only a limited capacity and capacity enlargement is possible with expensive and low-capacity discs.

Providing performance and capacity simultaneously is very difficult and costly, therefore efficient data storage is first priority of IT departments. Thin Provisioning and Snapshot technologies and data storage system source management have been entirely developed for addressing the problems of IT executives and employers.  Hence, responsible people can maintain data storage environments usage ratio and performance at desired levels.

Data is also growing in professional ambient nowadays. Cities are being recorded; finger prints and personal information are wandering between the computers in public institutions and organizations. All these data is stored in an institution, most of times, they are stored in more than one institution. Actually the data covers a small area however since it is not singularized, it creates a burden on data storage systems. Singularization and compression technologies have been developed in order to reduce this burden and make the management simple.

Current Status of Data

According to IDC reports speed-cutting data increase and total disk capacity that is accessed accelerated in 2009. It still continues in our era. Expected rise in 2014 is about 50% as compared to today. This data does not only consist of rich content and sharing, significant amount of this data is dependent upon data-focused analytical application and it is growing over it, this rate will increase in the future. Rich data concept has been settled and richer the data is, it brings more money to its owner.  Data processing systems and storage systems become partner in this intelligence.

Growing and getting a flexible structure is possible with virtualization. Installations which used to take days have become very easy and they can be completed within a coffee break. Data is created very rapidly, copying is very quick. Data is not only stored but also processed and modified. Performance and capacity enlargements are necessary during this process. Thin Porivisioning, Automated Tiering, Snapshot, Virtual Storage Method, Archiving, Singularization and Compression technologies enables to close storage systems vulnerabilities which have difficulties to catch up with data growth rate. The relevant chapter here is singularization for us. I will try to give basic information about singularization.

Singularization

Data singularization has become a desired and important feature of data storage systems over the last couple of years. Almost every data storage system manufacturers from the largest one to the smallest one have announced their singularization-based or singularization-supported products in their product range. Data singularization feature owes its current market share and attraction to back-up systems basically. Since back-up systems copy the same data multiple times, advantage of singularization is obviously seen in back-up operations.

Back-up data singularization technologies can be defined as key opportunity nowadays. Every back-up solution commercially available in the market support data singularization in software-based, box solution, hardware or any other format. Even, most of the products supply this feature as a standart, in other words you do not have to concern about licensing issues.

Well then, what is this data singularization, how it works?  I would like to mention about them little bit. In order to avoid misunderstandings and not to be away from the technology, I will try to use English terms.

Basically, data singularization is based on finding and removing of repeating data patterns in a data chunk. For this process, a hash is created for each chunk with the utilized algorithm. If the created hash matches with an already stored hash, newly created data cluster replaces with the older cluster by appointing a “pointer” to the older one.  In order to access old data, pointer which replaced with the old one addresses new data and data is read from here.

During singularization process, these clusters are created in 3 different types such as File level, Block level and Byte Level. Each of them has particular features.

File level: Actually all clusters are complete files in this type which is also called as Single-Instancing.  Hash files are created for entire file and the file and its copies are written on only 1 data field. It is the lowest effect singularization type with respect to Overhead (can be defined as additional load loaded on the processor. However in case that singularized file is modified, hash file is recalculated and most probably a different file is stored.

Block level is also known as Fixed and Variable). Data clusters are created at block level. On the other hand, it is more effective method with respect to stored data and covered area. Different data can be stored without creating data copy even after small changes due to detailed classification. All block files creating the data are same,   different blocks are stored separately for further using with a different data. Block level singularization is very effective method for especially virtual machinery images composing host operating systems having large size and difference.

Byte level: It is the most efficient type with respect to saved file size on the paper however most expensive singularization type with respect to spent operation power. During process, hash is created and also head and end of each data cluster is recalculated, each hash is stored individually. It is the most suitable singularization method for data types such as e-mail environments which do not have block structures and repeated constantly and manage and use the data at the same.

Another important feature in singularization systems is related with how data is written onto discs at the back stage after data calculation is completed. There are two different approaches here, one is inline and second is postprocess singularization.

Data copies are defined, eliminated before writing onto disc in Inline singularization systems. Copied data writing on disc system and disc i/o process are avoided.  Disadvantage of such systems is that each data is alive and should be processed on cache before writing on disc, it needs high cache area and high operation power. In the background, there is no disc input/output, it only keeps data discs to be written busy.

During Postprocess Singularization approach, alternatively data coming into the system for storage is written onto the disc without processing in the background.  It is recommended for the institutions which have daily peak hours and perform maintanence and data management during the night. Data is stored rapidly in day time without exposing to singularization process.  Operation power needed during day time is not spent for singularization  and response time of the system is short. Singularization of data which is written onto system without processing can be left to its initiative according to data storage system intensity, or work orders can be created to run data singularization at certain days/hours of the week. Disadvantages of Postprocess is needing more area at the beginning since the data must be written before singularization and creating burden on system as disk input output after the process.