Data Classification

spacerData Classification - Why and How?











Introduction

For many companies the necessity of classifying their data is a pre-requisite for being able to work with it. It goes without saying that any solution worth considering cannot be too cost- or labor-intensive and must be sustainable.

The reasons for classifying data are various:

Business Drivers (external drivers)

  • HGB Trade laws
  • AO Tax codes
  • GoBS Laws for digital accounting systems
  • GDPdU Data accessibility regulations and the verifiability of digital data
  • PCI DSS Payment Card Industry Data Security Standard
  • EUDPD European Union Data Protection Directive
  • HIPAA Health Insurance Portability and Accountability Act
  • SOX Sarbanes-Oxley
  • GLBA Gramm-Leach-Bliley
  • SEC 17-A4 Electronic Books and Records Rule

Internal reasons to adopt data classification:

  • Reduce cost of storage
    - hierarchical storage management (HSM)
    - binary classification criteria (active data vs. stale data)
  • Data protection
    - encryption
    - confidentiality
  • Knowledge capture
  • Storage reclamation
  • Cloud storage decisions
  • Capacity management
  • eDiscovery

Regardless of the reason, the largest obstacle to classifying a company's data is its lack of structure. This is mainly due to the following issues:

  • Amount of data
  • Amount of data types
  • Heterogeneous system environments and data systems
  • Exponential growth of data volume
  • Non-standardized, historically grown storage structures
  • Lack of process requirements and process monitoring
  • Lack of user limits

There are two colliding trends within today’s enterprise…first, there is a tremendous and unprecedented amount of data growth which is unstructured data. In fact, analysts agree that over 80 percent of the content of an enterprise is unstructured and sits outside of a database. Databases have a multitude of tools that enable the administrator to ascertain the content of that database, but unstructured data presents a completely new and unanswered challenge. It is very difficult to get a holistic picture of the type of data, how old the data is, how frequently is the data being used, who is using the majority of it and other details that would help to determine how to treat this data.

Unfortunately, the counter trend is that storage budgets and staffing trends are not keeping pace with the growth in data. In fact, research firms like IDC predict that between 2008 and 2012, the industry will see data growth almost quintuple yet staffing and budgets will stay relatively flat, only growing at 1.1 times over that period. As a result, organizations today are overwhelmed by digital information and tend to default to one simple retention policy: save everything.

However, the reality is that not all data is created equal and not all data has the same value to the business. For example, many times there is a plethora of MP3 files or JPEG image files which can consist of something as innocuous as an iTunes music library, or pictures of the family vacation. These files are obviously not critical or even applicable to the business. Yet this data gets stored on the most expensive storage platform, and is even backed up by the enterprise every day. Then there is inactive or stale data. This is data which is potentially six weeks, or six months, or even six years old and it is being treated exactly the same as the current and most active data in the organization. Another issue is that some data actually requires higher performance than is available because it has been put on a slower platform. In addition to this large growth in data, the need to keep it for longer periods of time and to protect it is also growing alarmingly. Finally, this data is backed up and mirrored with disaster recovery plans being built around it.

Consequently, organizations buy more storage than they require to house extraneous information and employees’ personal files alongside valuable corporate data. Not only are storage volumes greater than they need to be, they are also more expensive creating the need for more administration so operational costs bloat as well. This creates massive pressure on the IT organization because of the dramatic growth in both the capital outlays and the operational costs associated with storing information.

 



Goals of Data Classification

  • Availability, integrity and confidentiality for all identified assets
  • Return on investment by implementing controls where they are needed the most
  • Map data protection levels with organizational needs
  • Mitigate threats of unauthorized access and disclosure
  • Comply with legal and regulation requirements

The steps to develop and roll out a data classification program are:

  1. Compile an inventory of all information assets
  2. Define levels of protection for information assets
  3. Define a classification criteria
  4. Develop information classification policies
  5. Define information handling and labeling procedures
  6. Assign responsibility for classification to the owner of information
  7. Assign a security classification to all information assets
  8. Classify information according to sensitivity and how much protection is required
  9. Apply the classification system to documents, records, data files, and disks.
  10. Develop information handling procedures for each class of information
  11. Develop information labeling procedures for each class of information
  12. Integrate into security awareness and training programs

A data classification policy should cover:

  • Information as assets of individual business units
  • Declare business unit managers as information owners
  • Declare IT as data custodians
  • Specify the Data Classification scheme
  • Definitions for each classification
  • Criteria for each classification
  • Roles and responsibilities of the classification team



The Solution

dataglobal's dg classification allows administrators to discover data resources within an enterprise, apply uniform classification rules to these data resources for all of the unstructured data across the entire enterprise, create and manage a searchable classification matrix of detailed metadata, move data to the corresponding tier and later filter the metadata to conduct detailed searches.

dg classification – workflowToday’s organization may possess millions or billions of files. dg classification has no limit on the number of files it can support, so it can scale to multiple data centers to deal with the current volume as well as expected growth rates. In addition, because of the massively parallel multithreaded agent architecture, it can perform this scanning at the fastest rate in the industry. If the agents need to be upgraded, this can be quickly accomplished as well. The agents can quickly report on the file constitution throughout the enterprise, enabling organizations to obtain daily reports on the composition of their data stores, instead of only being able to perform scans during the small windows when the system is available worldwide. The only product in the industry able to perform at this level, this enables the organization to obtain the real-time intelligence needed to make storage and information decisions easily.

dg classification also interfaces with all platforms within the enterprise environment, offering a true heterogeneous solution, and because it is a single platform with the multiple tools needed to asses, analyze AND act upon the data, there is no need to have to use another policy manager or data mover and face possible compatibility issues.

dg classification is compatible with all types of unstructured data, and is able to scale to fit most any size enterprise, data center or multiple data centers with its ability to scan over ½ billion files within an hour. Rules can be modified to meet the needs of any type of business from legal to medical to insurance. In addition, it can locate and protect specific data involved with litigation, enabling data involved in a search to be frozen to prevent modification or deletion -- even if deletion had been previously approved.

The fully integrated dg product range offers the ability to perform data discovery, classification, search, migration, archiving and file expiration. During the discovery process, dg classification identifies the files and data types available in the IT infrastructure. The classification that takes place works on the discovered data, applying metadata to each file and file type based on a defined set of rules. These rules define whether the data is classified by application, by company group -- such as finance or manufacturing -- by type, by date or a myriad of other categories, though the actual categorizations depend on the specific needs of each particular business. The dg product range provides expiration, custom tasks, and reporting to make the classified data actionable.

dg classification – file-servers



Windows FCI Support

Windows Server 2008 R2 - File Classification Infrastructuredataglobal performs classification at a more comprehensive and detailed level than any other product because of its integration with the Microsoft Windows Server 2008 R2 File Classification Interface (FCI). With dg classification and FCI, the system can discover data, extract classification properties, classify the data, store the classification properties and then apply the policy based on the classification. When dg classification scans the data, it uses a file stream to filter the data into appropriate categories such as user, department, file size, file age or file type. By integrating with FCI, this classification can be more detailed and comprehensive than ever before in the industry as FCI enables custom tagging that goes way beyond conventional OS metadata categories, providing a rich endto-end classification solution. As a result, categorizations can be customized even further to meet the distinct needs of organizations in different industries. The file classifications can then be funneled into dg product range analysis, migration or archival modules to take further action on the data to meet compliance, eDiscovery, storage reclamation and knowledge reuse needs.

dg classification and Microsoft FCI metadata is stored in the alternate file stream of the file, ensuring that the classification always stays with the file no matter where it is moved or migrated. As a result, the system does not have to generate an additional database of the classification, placing another load on the storage resources in the organization. Search capabilities utilize this metadata to locate files based on criteria that goes well beyond conventional metadata, like filenames or creation dates and can actually examine files and documents for data patterns to make contextual decisions about the data. Search features are particularly important when data is being classified for archival or compliance purposes.

dataglobal extends FCI-like capabilites to Windows 2000 - 2008 platforms in order to classify existing data as well. In addition, dg classification and Microsoft FCI combined solution enables all servers to be managed from a central management console rather than on a per server basis. This enables data classification capabilities to be leveraged on an enterprise scale in physical and virtual environments with multiple clusters. In addition, dg classification enables FCI classifications to be searched and retrieved faster than any other storage platform due to its its parallel simutaneous access to its lightweight agent archictecture. At the same time, with dg classification, the scripting burden of FCI is reduced substantially, enabling it to be more easily used within a large organization.

>> to Microsoft's FCI website



Classification Cube

The goal of the framework is to provide an independent approach for data classification that can ultimately fulfill all GRC (Governance, Risk, Compliance) requirements or internal guidelines.

In addition, all documents will be classified according to the following criteria:

  • Business processes (Under which main process can this document be filed?)
  • Retention
  • Security

You can see these three criteria below.

Classification Cube

Characteristics of the axes

Retention:

Taking into consideration any country-, company- or industry-specific requirements, the respective retention times can be derived from the valence stage. To do so, retention times have to be linked to the respective document type ...

Characteristic Description

Daily Value

Data is subject to no retention time and can/could be deleted at any time, such as:
• memos
• notes for the file
• copies from meeting minutes
Retain if necessary; deletion date can be determined internally
Approval Value
Data is subject to no retention time or it is not clear if retention time requirements exist.
Examples:
• inquiries/offers without following order
Retain if necessary; deletion date can be determined internally
Legal Value
For such data there is a legally required retention time.
Examples:
• accounting vouchers
• business reports
• invoices and expense vouchers
• employment contracts
Retention period depends on content, national legal requirements, jurisdiction and other regulations.
Archive Value
This data must be permanently retained.
Examples:
• deeds
• blueprints
• company history data

 

Security:

Characteristic Description
Public
Data is allowed to be sent to a recipient (e.g. outgoing invoice) or to any number of recipients (e.g. an advertisement).
Internal
Any employee inside the company is allowed to access this data (e.g. telephone lists, organigrams).
Department Specific
Only members from certain departments are allowed to access this data (e.g. application documents in HR).
Confidental
Data is subject to confidentiality and can be accessed only by a few selected people (e.g. Coca-Cola recipe).

 

Business processes:

The distribution of the process axis can be determined based on company or industry features. The following characteristics can be applied to most companies:

  • Finance
  • Purchasing
  • Human Resources
  • Legal
  • IT
  • Sales and Distribution

Depending on the individual business purpose, further characteristics may be required and/or make sense:

  • Research and Development
  • Marketing
  • Production



Examples for the assignment of document types to containers


Example 1 / Business process: Finance
 

Classification Cube Example 1


Public Internal Field/Department Confidentiality
Daily Value
• advertising


• file notes
• meeting minutes (of high-level management)
Approval Value
• outstanding offers

• open unaccepted offers
• statistics regarding strategy implementation, alignment
Legal Value
• business reports

• business reports
• all documents needed for the creation of annual statements
• transfer prices
• tax statements 
Archive Value
• ad-hoc reports
• company history data
• press releases

• business letters (accepted offers, price and order confirmations, delivery notes, project correspondence, invoices, cancellations, contract terminations)
• margin development
• shareholder agreement
• notarial documents 


Example 2. / Business process: Human Resources

Classification Cube Example 2


Public Internal Field/Department Confidentiality
Daily Value




Approval Value
 
• organigrams
• timekeeping
• tax return
Legal Value
   
• application documents
• payroll
• labor contracts
• target achievement meetings
• payroll for high-level management
Archive Value
   
• company retirement commitments
• company retirement
commitments for high-level management