“Data: Types, Structures and Storage Systems”
Data is a raw culmination of facts, figures and information that can be in both structured and unstructured formats. A lot of Data is being generated from various sources and this humungous data is required for executing the operations like analysis, Machine Learning and optimizations for making predictions and decisions. Data can contain a lot of unwanted information that we may require to eliminate first in order to make good use of it in any business or machine model. After data is extracted from sources, it is sent to data storage systems like Data Lakes, Data warehouses or Data bases. Data Lakes can contain structured, unstructured as well as semi structured data extracted from data sources like CSV files or JSON. Whereas Data Warehouses and Databases contain only structured data that had already been transformed after loading via data pipelines onto the data warehouses.
The Data preprocessing or data wrangling part involves cleaning the data to eliminate duplicates, fill missing values(via Imputations) and replace any missing data with another values. After this step, the loaded data via Data pipelines is either transformed in Data lakes (ELT or Extract, Load and Transform) or the extracted data is loaded after transformation (ETL or Extract, Transform, Load) in Data storage systems like Data warehouses or Databases. In this way, the efficient storage of data is ensured. Data Analysis tools like MS Excel, Microsoft Power BI, Tableau and Qlikview are considered key to Business Intelligence and Data Visualization process.
Charts and graphs help Business Intelligence specialists, Data Analysts and Stakeholders interpret their data and generate key insights from the results. The Data Visualization tools generate outputs for presentation which is then deployed as dashboards on platforms for sharing content among the Data Science community. Below is a pictorial representation of the various types of charts and graphs which are recommended for data visualization in purposes based on Comparison, Distribution, Relationship and Composition of the data set:
Now that we know what are Data Storage Systems like Data Bases and Data Lakes, let’s understand about Data Structures. Data structures are objects to represent the data as an arrangement or organization. Data structures contain data of different data types. Few examples of data structures are Arrays, Lists, Matrices, Vectors, Trees, Linked Lists, Data Frames, Sets, Hash Maps, Dictionaries and so on. Examples of data types are integer, string, character, numeric, double, float etc. Data types specify what type of data is associated with data objects and instances that are storing the data. Manipulations and operations can be done on data stored in data structures and database systems so that data can be filtered, extracted, modified, deleted, altered and analyzed. There are various languages to analyze data, with important ones being: Structured Query Language (SQL), NoSQL, R, Scala, Python and its libraries like NumPy and Pandas.
SQL is used for RDBMS (Relational Data Base Management System) and NoSQL for Non RDBMS. Relational Data Base Systems store the structured data in the form of relations (tables having columns and rows) and Non RDBMS store unstructured data in non relational formats like in JSON or other formats. Few examples of RDBMS operating on SQL include MySQL Database, Oracle etc. MongoDB is an instance of Non RDBMS, that operates on NoSQL. Apart from this, T-SQL is a language used for executing transactions using SQL commands. T-SQL finds extensive use in Banks, hospitals and Retail stores, businesses where transactions are made. Important characteristics of any RDBMS performing transactions are ACID properties, which are:
* Atomicity: All or none of the transactions. This means either all transactions must be executed successfully and committed without failure or else none of them should be committed in case a failure occurs during the process.
* Consistency: Ensures that the database is consistent at any time of the operations and all transactions are executed consistently.
* Isolation: Each transaction executes independent of another during its process of operations. Even concurrent transactions can be done on the same Database by multiple users but without affecting the execution of other transactions taking place on the database.
* Durability: The database is in a durable state before and after any kind of transaction takes place, which means it successfully commits all the transactions after commit operation and rollbacks to the specified save point after Rollback operation, ensuring that the Database is always in durable state.
This was all about how transactions or units of work are carried out in any bank using databases for their data. Data is indeed, the new oil of the century. Good quality data is something which every industry requires in order to carry forward its day to day activities. Dirty data can be cleaned via different techniques and then qualitative data can be used in training machine models to achieve business oriented targets and make predictions. Quality data always leads to apt decisions after meaningful insights are generated from interpretation of patterns hidden in the data. Analytics is fundamentally of types: Descriptive analytics, Predictive analytics, Prescriptive analytics, Business analytics and Text analytics. These require high quality data which is refined and without any anomalies or outliers. Outlier and Anomaly detection is a hot topic for project and research which incorporates Exploratory Data Analysis (EDA), Machine Learning Algorithms and Data Wrangling methodologies. At times, null values or missing values in data have their own significance and there is no need for imputations. Business Oriented ML problem can be solved with Data Analytics, Statistical Modeling and Hypothesis Testing. From the data set, feature extraction can be carried out for feature labeling, classification, regression or any analytics tasks as well.
Data Structures and containers to store data arrangements vary from programming language to language as well. Python is recommended for data structures and analytics. SQL is popular for database management and queries, whereas R language is versatile language for pictorial analysis, Statistical modeling and visualizing. I hope my article gave you a gist about Data in a nutshell and with this, the sheer importance of it in various sectors of industry can be comprehended.
Follow me on LinkedIn to connect on Data Science and Analytics, Software Concepts or academic/industry projects!
Subscribe to my YouTube channel
Thank you!