Working with Data from Different Sources.

Eliud Nduati
6 min readFeb 5, 2023

--

As a data scientist, it’s not all the time that you will be working with data in CSV format. With many ways of storing data available and numerous more of sharing data from different sources, it is important to know how to load and use them in your data science project.

Photo by Tobias Fischer on Unsplash

In this article, we will discuss the different file types that data might come in, how they are different from each other, and different methods for loading data from these different sources.

Meme

CSV file

CSV stands for comma-separated values and is a widely used file format for storing data in tabular form. It is a plain text file that uses commas to separate values in rows and columns. CSV files can be easily opened and edited in text editors and spreadsheets. They are commonly used for data exchange between different software applications.

import pandas as pd
# load csv file
df = pd.read_csv('data.csv')
# display the first 5 rows of the dataframe
print(df.head())

In this example, the pd.read_csv() function is used to load the data from the 'data.csv' file into a pandas DataFrame. The head() function is then used to display the first 5 rows of the dataframe.

Text files

Text files are plain text documents that contain plain text without any formatting. They can be opened and edited in any text editor. Text files can be used to store data in a human-readable format and are commonly used for storing log files, configuration files, and other types of data.

with open('data.txt', 'r') as file:
data = file.read()
print(data)

In this example, the open() function is used to open the 'data.txt' file in read mode, and the read() function is used to read the entire contents of the file. The print() function is then used to display the contents of the file.

ZIP files

ZIP files are compressed files that contain one or more files or directories. They are commonly used for archiving and compressing files to save space. To extract the contents of a ZIP file, you will need to use a tool such as WinZip or 7-Zip.

import zipfile
with zipfile.ZipFile('data.zip', 'r') as zip_ref:
zip_ref.extractall()

This code uses the zipfile module to open 'data.zip' file in read mode and extracts all files in the same directory.

JSON file

JSON stands for JavaScript Object Notation and is a lightweight data-interchange format. It is used for storing and exchanging data on the web. JSON files are text files that use a specific format to store data in a structured way. They can be easily read and parsed by most programming languages.

import json
# load json file
with open('data.json', 'r') as file:
data = json.load(file)
# display data
print(data)

In this example, the json.load() function is used to parse the JSON data from the 'data.json' file and store it in a Python dictionary. The print() function is then used to display the contents of the dictionary.

SQL

SQL stands for Structured Query Language and is a programming language used for managing and manipulating relational databases. SQL is used to create, modify, and query databases. Data can be loaded from SQL databases using SQL queries.

import pandas as pd
import sqlite3
# create a connection to the SQLite database
conn = sqlite3.connect('data.db')
# load data from SQL table
df = pd.read_sql_query("SELECT * FROM data", conn)
# display the first 5 rows of the dataframe
print(df.head())

In this example, the sqlite3 module is used to connect to an SQLite database and the pd.read_sql_query() function is used to execute a SELECT statement and load the data into a pandas DataFrame. The head() function is then used to display the first 5 rows of the dataframe.

Web pages from web scraping

Web scraping is the process of automatically extracting data from web pages. Data can be scraped from web pages using libraries such as BeautifulSoup or Scrapy. The data can then be stored in a file format such as CSV or JSON.

import requests
from bs4 import BeautifulSoup
# make a request to the website
response = requests.get('https://www.example.com')
# parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')
# extract the data you want from the HTML using BeautifulSoup
data = soup.find_all('div', {'class': 'example'})
# print the data
print(data)

In this example, the requests module is used to make a request to a website, and the BeautifulSoup library is used to parse the HTML content. The find_all() function is then used to extract specific elements from the HTML, such as all div elements with the class 'example'. The data is then stored in a list and converted to a pandas DataFrame for further analysis.

Excel files

Excel files are spreadsheets that are commonly used for storing and analyzing data. They can be opened and edited using Microsoft Excel or other spreadsheet software. Excel files can also be read and parsed using libraries such as pandas.

import pandas as pd
# load excel file
df = pd.read_excel('data.xlsx')
# display the first 5 rows of the dataframe
print(df.head())

In this example, the pd.read_excel() function is used to load the data from the 'data.xlsx' file into a pandas DataFrame. The head() function is then used to display the first 5 rows of the dataframe.

Pickle file

A pickle file is a file format used for storing Python objects in a binary format. It is used for serializing and deserializing Python objects and can be used for storing data in a format that can be easily loaded into Python.

import pickle
# load pickle file
with open('data.pkl', 'rb') as file:
data = pickle.load(file)
# display data
print(data)

In this example, the pickle.load() function is used to load the data from the 'data.pkl' file. The print() function is then used to display the contents of the file.

Reading image files

Image files can be read using libraries such as OpenCV or Pillow. They can be used to load and manipulate image data in a data science project.

from PIL import Image
# load image file
img = Image.open('image.jpg')
# display image
img.show()

In this example, the Image.open() function from the PIL library is used to open an image file and the show() function is used to display it.

Reading multiple files at once

Multiple files can be read at once using functions such as glob or os.listdir. These functions can be used to read multiple files with a specific file extension or in a specific directory.

import glob
# get a list of all csv files in the directory
files = glob.glob('*.csv')
# load data from each file into a list of dataframes
df_list = [pd.read_csv(f) for f in files]
# concatenate the dataframes into a single dataframe
df = pd.concat(df_list)

In this example, the glob.glob() function is used to get a list of all CSV files in the current directory. A list comprehension is then used to load the data from each file into a list of pandas DataFrames. The pd.concat() function is then used to concatenate the dataframes into a single dataframe.

Loading images in python

Images can be loaded in Python using libraries such as OpenCV or Pillow. These libraries provide functions for reading and manipulating image data, including loading images from files, resizing images, and converting images to different file formats.

from PIL import Image
import numpy as np
# load image file
img = Image.open('image.jpg')
# convert image to numpy array
img_array = np.array(img)
# display image array
print(img_array)

In this example, the Image.open() function from the PIL library is used to open an image file and the np.array() function is used to convert the image to a numpy array. This image array can then be used for image processing and manipulation in python.

Conclusion

It is worth noting that CSV, Excel, and Text files are most commonly used in data analysis and data visualization, while JSON and SQL are used in web scraping, storing, and transferring data between different applications. Zip and pickle files are used for compressing and serializing data.

It is also important to mention that depending on the size of the data, it may be more efficient to load it in chunks, especially when working with large files. pandas provide several options to accomplish this, such as pd.read_csv() chunksize parameter, pd.read_sql() chunksize parameter, and so on.

--

--