Thursday, August 1, 2019

Tabular Data

Tabular data are opposed to relational data, like SQL database. In tabular data, everything is arranged in columns and rows. Every row have the same number of column (lacking information or missing value substituted by "N/A" (also zero values, as SQL NULL value, are not allowed in tabular data structure). The first line of tabular data is most of the time a header, describing the content of each column. The most used format of tabular data in data science is CSV (Comma-Separated Values). Every column is surrounded by a character (a tabulation, a coma ..), delimiting this column from its two neighbors.
The best is to think of tabular data as being "organized by row" where each row corresponds to a unique identifier such as the time a measurement was made (opposite in SQL where keys are used as unique identifier). For example you can store phone-book as tabular data and each row shows persons Name-Surname and Phone Number. To find relations between rows in tabular data you'll need first load all data in memory and only after that can find relations between rows (example: find all persons with numbers starting with +994 which is code of Azerbaijan). If this phone-book will be in relation structure, then one phone-book table:
  1. tabular data:
    • name;surname;address;zip;phone-number
    • name1;surname1;addressX;zipA;phone1,phone2
    • name2;surname2;addressY;zipB;phone1
    • name3;surname3;addressZ;zipA;phone1,phone2
  2. due to First Normal Form (1NF) - no repeating groups ("phone" is group - two columns like "phone1" and "phone2", or one column "phone" with "phone1,phone2" data are not allowed by 1NF). 1NF adds redundant/repeated values to data:
    • name;surname;address;zipCode;phoneNumber
    • name1;surname1;addressX;zipA;phone1
    • name1;surname1;addressX;zipA;phone2
    • name2;surname2;addressY;zipB;phone1
    • name3;surname3;addressZ;zipA;phone1
    • name3;surname3;addressZ;zipA;phone2
  3. due to Second Normal Form (2NF) - 1NF + all the non-key columns are dependent on the table’s primary key, the table serves a single purpose (each column must depend on the primary key and serve to describe what the primary key identifies, if not - move that column into another table). If we add primary-key rowID, then this key will uniquely describe each row having unique number for that person, but person itself is not describes purpose of the primary-key, so we'll move all person related data to the other table. Main idea of the 2NF is to reduce amount of redundant/repeated data. 
      1. We use table to store all person related stuff (name, surname, address, zip-code):
        • personID;name;surname;address;zipZode
        • 100;name1;surname1;addressX;zipA
        • 200;name2;surname2;addressY;zipB
        • 300;name3;surname3;addressZ;zipA
      2. Now our phone-numbers table will be (we must add rowID to uniquely identify each row) and it is in 2NF:
        • rowID;personID;phoneNumber
        • 1;100;phone1
        • 2;100;phone2
        • 3;200;phone1
        • 4;300;phone1
        • 5;300;phone2
  4. due to Third Normal Form (3NF) - 2NF + contains only columns that are non-transitively dependent on the primary key. Non-transitively dependent means non-through dependent. Dependence - age depends on birth-date.  Transitive dependency - we have 3 columns PK, BMI (Body Mass Index) , oWtf (Over Weight True-Flase), here PK helps to find BMI and oWtf, but oWtf also depends on BMI as BMI>25 is overweight, so oWft relies on PK through BMI.  So all columns in table are dependent only on primary-key (2NF) and not on other columns:
    1. Our phone-number table is 3NF:
      • rowID;personID;phoneNumber
      • 1;100;phone1
      • 2;100;phone2
      • 3;200;phone1
      • 4;300;phone1
      • 5;300;phone2
    2. But our person table is only 2NF, each column related to primary-key (PK), bot not 3NF, because we can use PK to find address of  a person and also can use PK to find zip code of the person, but at the same time address depends on zip code, this is transitive dependency:
      1. Move address an zip code to the separate table:
        • addrID;address;zipCode
        • 111;addressX;zipA
        • 222;addressY;zipB
        • 333;addressZ;zipA
      2. Now our person table will be:
        • personID;name;surname;addrID
        • 100;name1;surname1;111
        • 200;name2;surname2;222
        • 300;name3;surname3;333

So we can say that relational data structures:

  1.  are the same like tabular but with applied normalization, so that one table of tabular data becomes several relational tables - relations
  2. allows zero values while tabular data doesn't
  3. to query tabular data you need to load all data into RAM

No comments:

Post a Comment