Regardless of size, every organization is managing a sizeable volume of data that is produced by its numerous data points and operational procedures. Businesses can sometimes manage these data using Excel sheets, Access databases, or other technologies of a similar nature. It is time to consider Big Data and analytics, though, when data cannot fit into such tools and human error rates rise above acceptable levels as a result of intense manual processing. Before discussing the management and its terminologies we will discuss the data and big data. Before we are going to discuss what are the three vs of big data, we need to understand first what is data and big data.
- Data
Data can be defined as symbols that are recognizable with facts and figures. These are also in raw form. It is also called unprocessed information. Basically, information and data work together categorically. Data may consist of several raw forms which are not accessible directly. After applying some operation to the data, it becomes information which is the useable form of data that can be used according to the requirement. To answer the question what are the 3 vs of big data we should know what is big data and what are its terminologies. We have elaborated on big data in detail as follows.
- Big Data
Big data refers to the collection of a huge amount of raw facts and figures of data that is growing with time exponentially. Multiple tools in the software industry have provided for managing the data in several manners. This software has its operations for data management or data warehousing. It depends on the software and also depends on the user in choosing the software according to the requirement as the spreadsheet is a software which saves data in form of rows and columns with basic operations, on the other hand, Microsoft Excel is a software which acts same as spreadsheet software but also provides advanced and modernized features, operation and functions.
But in the Big data case, these developed tools or software don’t store the data traditionally due to the complexities and huge amount of data flow on run-time. When the data size is huge normal tools can’t handle the data which may cause complexities. The following are some types based on big data.
- Structured Big Data
- Unstructured Big Data
- Unstructured Big Data
- Structured Big Data
Structured data refers to any data that can be accessed, processed, and stored in a fixed format.
This type of data can be handled easily through several expertise and terminologies. Because all the data formatting and sequence are completely managed and accurate. This makes the tools easier to handle the data and extract the data. The main problem is the quantity of data or sizing of the data which is increasing too rapid.
Examples Of Structured Data:
Std-id | Std-Name | Gender | Department | Fees | Age |
2365 | Rajesh Kulkarni | Male | Finance | 650000 | 15 |
3398 | Pratibha Joshi | Female | Admin | 650000 | 17 |
7465 | Shushil Roy | Male | Admin | 500000 | 19 |
7500 | Shubhojit Das | Male | Finance | 500000 | 18 |
7699 | Priya Sane | Female | Finance | 550000 | 17 |
The structured data is always in manageable form as we can see above the table of student data all the details and columns are in their normalized form.
- Semi-Structured Data
Semi-Structured data is near to structured data. this form of data required some basic terminologies and operations to become structured data. as structured data is also manageable data that is extracted easily with the help of tools. Semi-structured data includes both kinds of data. The following example is elaborating on the semi-structured data in the database containing the data without any specified relational operations
Example of Semi Structured
Personal data stored in the XML file
<rec><name>SAMR Rao</name><sex>Male</sex><age>35</age></rec> <rec><name>Sam R.</name><sex>Female</sex><age>41</age></rec> <rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec> <rec><name>Loreal Roy</name><sex>Male</sex><age>26</age></rec> <rec><name>Jeremiah J.</name><sex>Male</sex><age>45</age></rec> |
- Unstructured Data
Unstructured data is any data whose shape or organization is unknown. Unstructured data is very complex and massive to handle. It makes many hurdles and halts the situation for the data management tools. Unstructured data is frequently found in heterogeneous data sources that combine simple text files with photos, videos, and other types of data. Organizations nowadays have a lot of data at their disposal, but since this data is in its raw or unstructured form, they are unable to value-add from it.
Example of Unstructured Data
Google searched page is the example:
Furthermore, Data can be described into three characteristics. These characteristics are also called the 3 Vs of big data.
- The volume of Big Data
- The velocity of Big Data
- Variety of Big Data
The volume of Big Data:
Because volume can be large, it is the V of big data that is most frequently used. We are referring to data volumes that are so large they are virtually incomprehensible. Many corporations and industries use a lot of data, maybe because they have a lot of customers or because they feed AI with data. This includes the intelligent appliances in our homes that are always absorbing information from their environment or services like Uber, which has millions of users at any given moment and adds a tonne of data to the mix.
The volume of data is growing from a variety of sources, including the clinic [imaging files, genomics/proteomics, and other “omics” datasets, bio signal data sets (solid and liquid tissue and cellular analysis), and electronic health records], patient sources (such as wearables, biosensors, symptoms, and adverse events), and external sources like data from insurance claims and published literature. One entire genome binary alignment map file, for instance, can be 90 gigabytes or more. There are also several big data volume producers most of them are defined below.
- Big Data Volume of Facebook
For instance, Facebook saves pictures. When you start to comprehend that Facebook has more users than there are people in China, that statement doesn’t even begin to make sense. Each of those users has a sizable number of images stored. 250 billion pictures are currently being stored on Facebook.
Can you picture it? Seriously. Do it now. Try to mentally process 250 billion images. Attempt this. Facebook had 2.5 trillion postings as of 2016. Seriously, it’s difficult to even imagine such a large amount.
Therefore, when we start talking about volume in the big data era, we’re talking about absurdly massive amounts of data. We’ll have more and more sizable collections as time goes on. As an illustration, we add connected smart sensory-based devices to almost everything or every object in the world such as smart refrigerators, smart air conditioners, smart watches and so.
- Big Data volume of Smart sensors:
Think about the volume of data being retrieved from each one. In my garage, I have a temperature sensor. With just one sensor, 525,950 data points can be collected annually at a level of granularity of one measurement per minute. Consider a plant with a thousand sensors; the temperature alone would represent half a billion data points.
Or, take a look at our modern world of connected apps. Everyone has a smartphone in their pocket. Let’s take a quick look at a to-do list app as an example. Vendors are maintaining app data in the cloud more frequently so that customers may access their to-do lists from different devices.
SaaS-based app vendors typically have a lot of data to keep because many apps use a freemium model, where a free version is used as a loss leader for a premium version.
- Android Play Big Data volume
According to Android Play, the to-do manager I use, Todoist, has about 10 million active installs. All the installs on the Web and iOS are not included in that number. Each of those users has lists of things, and it is necessary to keep all of that information. Todoist defiantly lacks the scale of Facebook, yet they continue to keep a staggering amount of data than nearly any application did even a decade ago. Of course, there are also all of the internal company data repositories, which include everything from the energy sector to healthcare to national security.
The velocity of Big Data:
The speed at which data is generated is referred to as “velocity.” The real potential in the data is determined by how quickly it is generated and processed to satisfy requests.
The speed at which data enters from sources such as business processes, application logs, networks, social media websites, sensors, mobile devices, etc. is referred to as big data velocity. There is an enormous and constant influx of data.
Velocity in the context of big data refers to the rate of data inflow. Using the earlier Facebook example, 900 million photos are posted by users every single day, despite the social media giant’s 250 billion image storage capability. This massive volume of data needs to be processed, filed, and retrieved every day. Sensor data is another instance of velocity. There will be an increasing number of connected sensors as the Internet of Things grows rapidly. Effectively, this will result in practically constant data transmission.
The speed at which this volume of data is handled is a crucial element since total research complexity is increasing and more clinical data points must be processed in the same period of time or less. The goal is to more quickly identify efficacy and safety signals by using varied and vast data inputs. If toxicities are discovered, this result may allow for quicker submissions to regulatory agencies, earlier go/no-go determinations, and the ability to limit the application of an effectiveness or safety signal to specific comorbidities, drug interactions, or demographics. Further, there are some examples of big data velocity such as Twitter feed, data security analysis, and IoT
- Big Data velocity IoT ( Internet of Things)
The most rapidly growing area of information technology is the Internet of Things. IoT is made up of just two letters. “Internet” and “Things” are the other two. A network is a collection of connected computers that must be able to send and receive data from one another in order to function. The implementation of the idea of a worldwide infrastructure of network-based actual smart sensory devices, accessible anytime, anywhere, to connect anything, has helped the IoT area gain prominence in recent years. It offers an enormous selection of network-based technologies, which can be either wired or wireless depending on the situation. The term “Internet of Things” can be broken down into numerous categories, such as a global network where communication between human-to-human, things-to-things, things-to-human, and human-to-things.
- Twitter Feed
The constant posting of new feeds on Twitter is also called “the firehouse”. Because millions of news feeds are produced in real-time. The velocity on Twitter is much higher than on the other platform.
- Cyber Security:
Sinister payloads may be concealed in that flow of data coming over the firewall as a result of the increase in cyberattacks, cybercrime, and cyber espionage. That data flow must be looked at and examined for anomalies and patterns of activity that raise red lights in order to prevent compromise. As more and more data is encrypted for security, this is becoming more difficult. At the same time, malicious actors are encrypting packets to conceal their malware payloads.
Variety of the Big Data:
Variety is the third area of big data. When talking about big data diversity, it means that the data might vary greatly from one application to another, with a large portion of it also being unstructured data.
As in the past, all the data may not necessarily fit neatly into one database application. Emails provide a good illustration of big data’s diversity. Since each message has a unique destination, time stamp, potential attachments, and text that is unique, no two messages are ever the same. Emails are a type of data that, like audio recordings, films, and images, tends to be exceedingly diverse and unstructured.
A wider range of data sources is being used for aggregate comparisons across clinical trials as well as independent, cross-comparison analysis. Inputs both structured and unstructured are included. It has often become obsolete and ineffective to review existing data using conventional approaches. A good example of the need of having efficient machine-driven systems and technology platforms for both absorbing and processing this volume of data is the over one million biomedical articles that are published annually (about two papers per minute).