Showing posts with label Big Data. Show all posts
Showing posts with label Big Data. Show all posts

Sunday, June 02, 2024

What exactly is involved in Data Governance?

Data governance involves a holistic approach to managing data throughout its lifecycle, from creation to retirement and it encompasses a set of processes and practices aimed at ensuring the availability, quality, security, and proper management of an organization’s data. . It ensures data accuracy, security, and alignment with business goals. 

Key activities involved in data governance are:

1. Data Classification and Contextualization:

  • Identify and classify data based on its sensitivity, criticality, and business context.
  • Apply metadata tags to data to enhance visibility and understanding.

2. Data Profiling and Data Mapping:

  • Profile data to understand its characteristics, patterns, and quality.
  • Create data maps to visualize data flows and relationships across systems.

3. Data Lineage:

  • Establish data lineage to track data movement from source to destination.
  • Understand how data is transformed, aggregated, and used within the organization.

4. Metadata Management:

  • Maintain metadata repositories that describe data attributes, definitions, and ownership.
  • Ensure consistent metadata across systems.

5. Data Ownership and Stewardship:

  • Assign data ownership to specific individuals or teams.
  • Define roles and responsibilities for data stewardship.

6. Data Security and Privacy:

  • Implement security measures to protect data from unauthorized access.
  • Comply with privacy regulations (e.g., GDPR, CCPA) by managing access and consent.

7. Data Quality Control:

  • Monitor data quality through data profiling, validation, and cleansing.
  • Address data anomalies and inconsistencies.

8. Data Access and Authorization:

  • Define access controls based on user roles and permissions.
  • Ensure appropriate data access for authorized users.

9. Data Risk Management:

  • Assess data risks (e.g., data breaches, data loss) and develop mitigation strategies.
  • Monitor and respond to data security incidents.

10. Data Sharing and Dissemination:

  • Establish guidelines for sharing data within and outside the organization.
  • Facilitate secure data exchange with partners and stakeholders.

11. Compliance Monitoring and Auditing:

  • Regularly audit data governance processes and policies.
  • Ensure ongoing compliance with data regulations and internal standards.

12. Data Governance Council or Committee:

Form a cross-functional group responsible for setting data governance policies and making strategic decisions.


Friday, April 26, 2024

SQL: A Practical Introduction for Querying Databases (IBM)

This is a refresher course. I already have a diploma in Oracle RDBMS.

Structured Query Language

  • A language for relational databases
  • Used to query data
Data

Collection of facts (words, numbers), pictures. 
  • Data needs to be secured, and accessed when required.
  • The above can be achieved using a Database.
Database
  • Is a repository of data / it is a program that stores data.
  • Provides functionality for adding, modifying and querying data.
Different kinds of datbases
  • Relational database
    • Data is stored in tabular form - columns & rows
    • Like in spreadsheet
    • Cols - has properties about each item - such as last name, first name, email address, etc.
    • Table is a collection of related things - example employees, salary, etc.
    • In a relational database, you can form relationship between tables. 
    • Emp table, Salary table, etc. etc.
    • RDBMS
  • DMBS
    • Database Management System - set of software tools for the data in the database is called DBMS.
    • Database is a repository of data  
    • Terms database, database server, database system, data server, DBMS are all used interhangeably.
RDBMS
  • MySQL, DB2, Oracle
SQL
  • Create table
  • Insert data to a table
  • Select statement to see the data in a table
  • Update data in a table
  • Delete data from the table

Types of SQl Statements

- Data Definition Language
- Data Manipulation Language
- Data Control Language
- Transaction Control Language
Image courtesy: Geeksforgeeks.com

Results coming in from an SQL query is called a table or a Result Set.

Define Table structure of an existing table

sp_help BOOK;



DML - Read and Modify data

Select Statement
  • Select col1, col2, col3 from FilmLocations;
  • Select * from FilmLocations;
  • Select count(*) from FilmLocations;
  • Select DISTINCT(Directors) from FilmLocations;
  • Select * from FilmLocations LIMIT 25;
  • SELECT DISTINCT Title FROM FilmLocations WHERE ReleaseYear=2015 LIMIT 3 OFFSET 5; Retrieve the next 3 film names distinctly after first 5 films released in 2015.
INSERT Statement
  • Insert into TableName ColumnName1, ColumnName2, ColumnNameN values <Value1>, <Value2>, <ValueN>;
  • Inserting 1 row at a time:
    • Insert into AUTHOR (Author_ID, Lastname, Firstname, Email, City, Country) values ('A1', 'Chong', 'Raul', 'rfc@ibm.com', 'Toronto', 'CA');
  • Multiple rows can be inserted.
    • Insert into Author (Author_ID, Lastname, Firstname, Email, City, Country) 
                    values
                    
                    ('A1', 'Chong', 'Raul', 'rfc@ibm.com', 'Toronto', 'CA')
                    ('A2', 'Ahuga', 'Rav', 'ra@ibm.com', 'Toronto', 'CA')

                     ); 
  • Insert into Instructor(ins_id, lastname, firstname, city, country) VALUES (8, 'Ryan', 'Steve', 'Barlby', 'GB'), (9, 'Sannareddy', 'Ramesh', 'Hyderabad', 'IN');   

Entity name - Author
Entity Attibutes - Author_ID, Country, City, Email, FirstName, LastName.

Update statement
  • Alter data in a table using UPDATE statement.
  • Update TableName SET ColumnName1=Value1 WHERE [condition];
  • Example, Update AUTHOR SET Lastname='KATTA', Firstname='Lakshmi' WHERE AUTHOR_Id='A2';
  • If where clause is not specified all rows will be updated.

Delete statement
  • Read and modify data.
  • Delete from TableName WHERE <condition>;
  • Delete from AUTHOR WHERE Author_ID in (a1, a2); both the rows will be deleted.
  • If where clause is not specified all rows will be deleted.


Relational Databases

1. RELATIONAL MODEL



2. E-R MODEL (Entity - Relationship Model)

Book - entity (drawn as rectanges in E-R diagrams)
Attribute - Title, Description, etc. (drawn as ovals in E-R diagrams)


  • Entity Book becomes a table in a database. 
  • Attributes become the columns.


PK uniquely identifies each Tuple / Row in a table. 



Foreign Keys are Primary Keys defined in other tables. They create link between the tables. In the above example, Author ID is a primary key in Author table, but in the Author List table, it is a Foreign Key linking the table Author List to the table Author.

Common Datatypes include Characters (VARCHAR), Numbers, and Date/Times.

DDL VS DML

  • DDL - define, change and drop data. 
    • CREATE
    • ALTER
    • TRUNCATE
    • DROP
  • DML - read and modify data in tables. 
    • Also known as CRUD operations.
    • INSERT
    • SELECT
    • UPDATE
    • DELETE
CREATE



Create table AUTHOR (

AUTHOR_ID  Int PRIMARY KEY NOT NULL,
Lastname varchar (30) NOT NULL,
firstname varchar (30) NOT NULL,
email varchar (30),
city varchar (30),
country varchar (30)

)

select * from AUTHOR

Create table BOOK (

BOOK_ID  Int PRIMARY KEY NOT NULL,
Title varchar (30),
Edition Int,
BYear Int,
Price Decimal (5,2),
ISBN varchar (6),
Pages Int,
Aisle Int,
Description varchar (80)
)

Select * from BOOK;


Create table BORROWER (

BORROWER_ID  Int PRIMARY KEY NOT NULL,
Lastname varchar (30) NOT NULL,
firstname varchar (30) NOT NULL,
email varchar (30),
Phone Int,
Address varchar (80),
City varchar (30),
Country varchar (30),
BRDescription varchar (80)

)

select * from BORROWER;


Create table AUTHOR_LIST (
AUTHOR_ID Int Foreign Key REFERENCES AUTHOR(AUTHOR_ID),
BOOK_ID Int Foreign Key REFERENCES BOOK(BOOK_ID),
AuthRole varchar (30)

)

Create table COPY (
COPY_ID Int PRIMARY KEY NOT NULL,
BOOK_ID Int Foreign Key REFERENCES BOOK(BOOK_ID),
STATUS varchar (6)
)

ALTER, DROP, TRUNCATE Tables

  1. Alter
    1. Add / Remove columns
    2. Modify datatype of a col
    3. Add / remove keys
    4. Add / remove constraints
            Alter table <table_name>
            Add col <datatype>,
            Add col2<datatype>;

Add cols

Alter table AUTHOR 
Add Qualification varchar(30);

Delete cols           

           Alter table BOOK
           drop COLUMN Qual;

Modify datatype of a col

ALTER TABLE AUTHOR
ALTER COLUMN Qualification VARCHAR(4); //Instead of Varchar(30) or you can modify datatype to another (string to int, etc.). 

Constraints

Create table myKeyConstraints (

sno int PRIMARY KEY NOT NULL,

firstname varchar(30),

lastname varchar(30)

);

sp_help mykeyconstraints; // Check the constraint name highlighted below....


Drop constraint

alter table myKeyConstraints
drop constraint PK__myKeyCon__DDDF64469E1B9A37;

// Constraint is gone...


PRIMARY KEY = UNIQUE + NOT NULL








Thursday, April 25, 2024

Hadoop Ecosystem

 Courtesy: data-flair.training, bizety

  • Hadoop provides a distributed storage and processing framework.
    • Hadoop is a framework for distributed storage and processing of large datasets across clusters of computers. 
    • It includes two main components: 
      • Hadoop Distributed File System (HDFS) for storage and 
      • MapReduce for processing. 
    • Hadoop is designed to store and process massive amounts of data in a fault-tolerant and scalable manner.
  • Hive is a data warehouse infrastructure built on top of Hadoop. It provides a SQL-like interface for querying and analyzing data stored in Hadoop. Hive is suitable for data warehousing and analytics use cases.
  • PySpark enables data processing using the Spark framework with Python.
  • YARN manages cluster resources to support various processing frameworks.
  • HBase provides a scalable NoSQL database solution for real-time access to large datasets.

Saturday, July 08, 2023

Dataset vs. Database

 In the context of big data, the terms "dataset" and "database" refer to different concepts and have distinct meanings. 

A dataset refers to a collection of data, while a database is a software system used to store and manage structured data. Datasets can be stored in databases, but databases can contain multiple datasets along with the necessary infrastructure to manage and manipulate the data.

Dataset: A dataset is a collection of related and structured data that is organized for a specific purpose. It represents a single unit of information that can be analyzed and processed. A dataset can consist of various types of data, such as text, numbers, images, or any other form of digital information. In the context of big data, datasets often refer to large and complex collections of data that are generated from various sources.

Datasets in big data are typically used for analysis, machine learning, and other data-driven tasks. They may include structured data (e.g., from relational databases), semi-structured data (e.g., JSON or XML documents), or unstructured data (e.g., text documents, images, videos). Datasets can be stored and accessed in various formats, such as CSV, JSON, Parquet, or databases.

Database: A database, on the other hand, is a software system used to store, manage, and organize structured data. It is a structured collection of data that is organized, indexed, and stored in a manner that allows for efficient retrieval, modification, and querying. Databases provide mechanisms for storing and retrieving data, enforcing data integrity, and supporting data manipulation operations.

Databases in the context of big data can refer to traditional relational databases, such as MySQL, Oracle, or SQL Server, as well as newer types of databases designed for big data processing, like Apache Hadoop, Apache Cassandra, or MongoDB. These big data databases are specifically designed to handle the challenges of storing and processing large volumes of data across distributed systems.

Example:


In this example, the dataset represents a collection of sales data. Each row corresponds to a separate purchase, and the columns represent different attributes of the purchase, such as the customer's name, the item purchased, the price, and the date. The dataset can be further expanded with more records to include a larger set of sales data.

From the above, it looks like a Database Table and Dataset are the same, however they are not. A dataset and a database table are similar in the sense that they both represent structured collections of data. However, there are some differences between the two:

A dataset and a database table are similar in the sense that they both represent structured collections of data. While a database table is a specific construct within a database management system, a dataset is a more general term that can encompass different types of structured data, including tables. Datasets can be more versatile, portable, and independent, while database tables are tightly coupled with the database management system and its specific rules and constraints.

1. Structure: A database table is a specific construct within a database management system (DBMS) that organizes data in rows and columns. Each column represents a specific attribute or field, while each row represents a record or entry in the table. On the other hand, a dataset is a more general term that refers to a collection of related data, which can be organized in various formats and structures, including tables. A dataset can contain multiple tables or other data structures, depending on the context.

2. Scope and Purpose: A database table is primarily used within a database management system to store and manage structured data. It is typically part of a larger database schema that includes multiple tables and relationships between them. The purpose of a database table is to provide a structured storage mechanism for data and enable efficient querying and manipulation operations. A dataset, on the other hand, can have a broader scope and purpose. It can represent a single table or a collection of tables, as well as other types of data such as files, documents, or images. Datasets are often used for analysis, machine learning, or other data-driven tasks, and they may include data from multiple sources or formats.

3. Independence: A database table is tightly linked to a specific database instance and is managed within the database management system. It is subject to the rules and constraints defined by the DBMS, such as data types, integrity constraints, and indexing. In contrast, a dataset can be more independent and portable. It can be stored and accessed in different formats and locations, such as CSV, JSON, Parquet files, or even distributed file systems. Datasets can be shared, transferred, and processed across different systems and tools without being tied to a particular database management system.

Friday, September 17, 2021

Big Data

What is Big Data?

Extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions.


Characteristics of Big Data
  1. Volume
  2. Velocity
  3. Variety
  4. Variability
  5. Veracity
  6. Visualisation
  7. Value

If we already have automation, what's the need for Agents?

“Automation” and “agent” sound similar — but they solve very different classes of problems. Automation = Fixed Instruction → Fixed Outcome ...