Creating a Database From Spreadsheets

 This post is inspired by my recent work with a client, where we navigated the transition from bunch of spreadsheets to a database system to accommodate their growing data needs and ambitions. I'll explore the two main approaches to data storage and management: databases and spreadsheets, highlighting their differences and when it's right time to move your data to a database. From this experience I'll also present a migration pipeline, share 8 valuable tips and showcase an example with code similar to what we implemented. For any feedback or inquiries, feel free to reach out!


Relational Data Model
Let’s first look at the most ubiquitous way to organize or store data. It’s the table format and is queried by SQL. This dated way back in the early 70s and still leads the world of databases by far. It relies on the relational data model that defines a relation as a set of tuples. Each tuple represents a row and a collection of rows in the same domain forms a table. Like numbers can be manipulated with operations like adding or multiplying the manipulation between these relations is mathematical operations from set theory. Relational Database Management Systems (RDBMS) builds upon this foundation and also guarantees ACID properties.


ACID properties can be explained simply by banking example:
Atomicity guarantees that a transaction either fully occurs or not. When you send money to your friend, either the full amount is transferred out or none
Consistency ensures validity. When you send money the total amount should not be changed, meaning no amount of money within a system appears or disappears
Isolation means concurrent transactions do not interfere with each other and a process updates the system at a time. Imagine two people making $70 spending on a $100 account simultaneously, isolation ensures only one can make the transaction
Durability is the guarantee that the successful transactions are permanent. If you deposit money to your account just before a power outage in the bank that amount will persist when the system is back up.
Relational databases are the most common way to organize and manipulate data because they provide data integrity with ACID properties and any query relies on the mathematical foundations with decades old optimization efforts. 


Cell Structured Data Model
Spreadsheets are another way to store tabular formatted data but with much more relaxed requirements. The format of a spreadsheet is cell structured data model, each cell is defined by its row and column and attains a value individually. Users can still define a relationship between cells but it’s not enforced. This brings flexibility to users so that the restrictions in RDBMS are gone. It is accessible to most, user friendly environment making spreadsheets the most widely used kind of software for businesses. It is a great tool for leveraging simple calculations and storing small amounts of data (<1M rows). The problem starts when businesses use spreadsheets to integrate into their processes and probably need to limit the data format to ensure integrity.


When to Use Databases Instead of Spreadsheets?
There are several reasons to not use spreadsheets and move to a relational database. I’ll list some of them from my experience:
 Scalability. Data can scale vertically with more rows of data, or horizontally with more related tables, therefore more attributes. Spreadsheets can accommodate certain limits of rows and columns and defining relations among sheets can be frustrating because it’s not designed to have complex relations among the other sheets. Spreadsheets do not scale as well as the volume and the complexity of data rise up. 
 Response Time. More data and more relations among the attributes/cells within the sheets or among other sheets make manipulating data slower. The velocity of analysis can be crucial for many business purposes.
 Automation. Whether it’s AI automation or RPA, data is required to move around by the systems called data pipelining. Spreadsheets are not meant to move data in or out. Some services make this possible but it’s not a viable choice comparing it with an RDBMS. 
 Security. Many spreadsheet software grant access roles to users but comparing it with databases it’s simple like read/edit. Databases grant user-specific roles and they can encrypt sensitive data. These roles can be extended with file transmission protocols enabling a secure sharing environment. 
 Data Integrity. Spreadsheets are flexible enough that a user can make mistakes filling cells with inconsistent values or types. The transactions are not fully ACID compliant in spreadsheets, which may lead to data integrity problems. Of course, if all the above listed are not your concern, you might use spreadsheets in your business applications. In fact, many times spreadsheets are the best way to go. Especially when there should be no preset structure of data, no attribute type, or a very small amount of data to make simple calculations. Cell structured data model brings its advantages focused on more flexible and presentable functionalities, like filling a cell with color for soft deleting.


8 Tips for Migrating from Spreadsheets to Databases
Here are my tips and caveats on migration from spreadsheets to RDBMS:
1- Design the destination schema. ER diagram illustrates the table’s attributes and how the relations among the tables are defined among these attributes. This part is important if there are many sheets that are related to each other
2- Find the data points specific to the cell structured data format and add them as a column if necessary. For example I mention the color filling, my clients fill a row that shouldn’t be used for their application. This is a way to soft delete an element. A simple Excel function can transfer this information into a column, like an isDeleted column with a boolean value
3- Transform data to a tabular structure (normalize) if needed. RDBMS requires data to be in a tabular format but normalizing data brings its own problems. Deciding which normal forms to be loaded into a database requires evaluating the tradeoffs of using NFs. mainly the tradeoff I had was redundancy vs complexity, since there are no storage problems and low macros, data duplication happened to normalize. Be careful with normalizing the merged cells
4- Data types might not always align. First of all spreadsheets do not enforce a common type under a column, so defining a data validation function within a column or just formatting the column might be required. If there are multiple sheets to be merged make sure they’re the same type, otherwise SQL’s union function will raise an error. Also when determining the type of the table when creating from a spreadsheet, make sure to write DDL from scratch, defining the data types directly from the spreadsheet itself
5- Make sure to define primary keys, if there is not a unique value for each column, then it can be added before loading it to a database. I used python’s uuid library to add a unique identifier to each column and make that a primary key for the table
6- Exporting data to csv or other formats might lose some of the information within the data, especially the ones specific to cell structured data format. The example I gave before, filling the cell with color, or strikethrough a text, is out of the limitations of csv format. There are special connectors for extracting all the information within a spreadsheet. I used pygsheets for Google Sheets to transform data in Python. You also need to enable Google’s Sheets API and authorize it to extract the data from the Sheets. 
7- Some databases do not accept special characters. There should be a mapping between special characters to UTF-8. In my case, I map Turkish characters to English ones with a simple dictionary. 
8- Indexing and cleaning could be necessary for the organization. My connector took empty spaces as well, strangely taking different numbers of empty columns from the same structured sheets. I don’t know the exact reason but it could be about the used cells previously. I drop the columns with no values to union all the sheets in the same context. The same structured sheets may also have different starting points, when ingesting it all in a loop I put a try-except-finally statement to overcome this


Data Pipeline Orchestration

There are 30 times more people who know how to use a spreadsheet than to write SQL. It’s surely more common and easier to adapt and that’s why it’s a must-have for most business units. It’s either moving data from a spreadsheet or loading the result of a query to a spreadsheet for business people to work on, the process of moving data from source to destination is required. So far I have written about migrating data from spreadsheets to databases but sometimes we need a constant transfer of data in either way as batches. The process of transferring data from a source to a destination is called data pipelining. In the past cron jobs were state of art for scheduling the transfer of data but as the volume and variety of data increased and storage systems advanced, so we developed more advanced ways to work with them. These modern frameworks are called orchestration tools and they let the user build, schedule, and monitor the data pipelines. Extraction can be from different sources with predefined requirements thus making the data flow more granular. It can be parameterized and configured to reuse. For my use case, I used Prefect as an orchestration tool. It defines each pipelining action as a task and then lets the user manage the sequence by organizing tasks into a workflow. Users can decide how to take action when a task fails, the number of retries, log and handle the errors. 


Example Use Case with Code
Let’s say that we’re building a database of marketing channels. Each channel and its metrics are stored in a separate sheet in Google Sheets and financial tables are stored in Excel files. The aim is to make a data pipeline that takes the local Excel files and Google Sheets from API, and then sends merged data to Bigquery Data Warehouse. 
1- Define extraction tasks from the local file system and GSheets API, and save it to a dataframe

@task
    def extract_from_gsheets(sheet):
        gc = pygsheets.authorize(service_file='cred_sheets.json')
        combined_data = pd.DataFrame()

    data_dict = {}
        for i in range(14):
            sh = gc.open(sheet_name)
        wks = sh[i]
            data = wks.get_as_df(start='A1', include_tailing_empty=False, headers=1).reset_index(drop=True)
            data.name = wks.title  
            data = data.drop(columns=[col for col in data.columns if col == '']) 
            data_dict[wks.title] = data 
        
    combined_data = pd.concat(data_dict.values(), ignore_index=True)

        return combined_data


2- Then transform the data by converting it to a suitable encoding, adding primary keys if they lack, rearrange columns, etc.

@task
    def transform_data(combined_data):
        combined_data['id'] = [uuid.uuid4().hex for _ in range(len(combined_data))]
        cols = list(combined_data.columns)
        cols.insert(0, cols.pop(cols.index('id')))
        combined_data = combined_data.rename(columns=replace_turkish_chars)
    combined_data = combined_data[cols]
        return combined_data 


3- Load it to the DW via an endpoint

@task
    def load_to_postgres(transformed_data):
        engine = create_engine(f'postgresql://{postgres_user}:{postgres_password}@{postgres_host}:{postgres_port}/{postgres_db}')
        transformed_data.to_sql('Tutors', engine, if_exists='append', index=False)


4- Build the data flow by organizing these tasks specifying how frequent the workflow starts or which events triggers the flow
5- This flow can be deployed locally or to Prefect Cloud

@flow(name="ETL - Google Sheets to PostgreSQL", log_prints=True)
    def etl_flow():
    data = extract_from_gsheets(sheet_name)
    transformed_data = transform_data(data)
    load_to_postgres(transformed_data)

    etl_flow()
    if __name__ == "__main__":
        etl_flow.serve(name="my-first-deployment",
                          tags=["onboarding"],
                          interval= 3600)