CSV and Panda

CSV files

CSV stands for “comma-spearated values”, which allows data to be saved in a tabular format. They are commonly used to store spreedsheet data. CSV files are essentially text files and can be opened the same way as text files. However, there are python modules dedicated to handling csv files. Here we mainly focus on the csv module.

One situation to use the csv module is when a data set is too larger to easily work with in Excel. Suppose there are millions of data stored in Excel, inorder to get the data in some lines, we have to scroll down many many time to find it. So, Excel is great fro small data sets, but if they are too large then Excel will be slow and unwieldy and it’s better to use python.

Basic csv file

The CSV file is essentially a file ending with “.csv”. Suppose we have a csv file called “clorox2020.csv“, which contains the data of a stock market of clorox in 2020. The data is in the format:

Date,Open,High,Low,Close,Adj Close,Volume
2020-01-02,153.380005,153.410004,151.550003,152.610001,149.262665,1292500
2020-01-03,152.570007,153.880005,151.240005,152.910004,149.556076,809800
2020-01-06,153.070007,153.940002,152.289993,153.369995,150.005981,869300
2020-01-07,153.190002,153.289993,151.509995,151.520004,148.196564,1198200
2020-01-08,151.779999,153.339996,150.949997,151.970001,148.636688,901200
2020-01-09,152.229996,153.009995,151.720001,152.800003,149.448486,504900
......

Notice that the data in the csv file is separated by comma (“,”). The first line, in this case, is a header with a list of the column names in the file. The header is highly recommended but not necessarily, it allows the file to be self-documenting. If there is a header, make sure the header list is delimited in the same way as the rest of the file.

CSV module

In order to read in the csv file to python, we can import and use the csv module. There is actually more than one way to read in a csv file. Here we will use two way: reader() and DictReader().

The reader() function will read in the whole csv file and save it into a iterable reader object. We can iterable it using a loop. The DictReader() function will read in the whole csv file but every row will be stored in a dictionary. The key is their column name, the value is the data of the column.

# reading the cvs file with csv module row by row
import csv

with open("clorox2020.csv", "r") as f:
    csv_reader = csv.reader(f)
    for row in csv_reader:
        print(row)

Output:

['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
['2020-01-02', '153.380005', '153.410004', '151.550003', '152.610001', '149.262665', '1292500']
['2020-01-03', '152.570007', '153.880005', '151.240005', '152.910004', '149.556076', '809800']
['2020-01-06', '153.070007', '153.940002', '152.289993', '153.369995', '150.005981', '869300']
['2020-01-07', '153.190002', '153.289993', '151.509995', '151.520004', '148.196564', '1198200']
['2020-01-08', '151.779999', '153.339996', '150.949997', '151.970001', '148.636688', '901200']
['2020-01-09', '152.229996', '153.009995', '151.720001', '152.800003', '149.448486', '504900']
['2020-01-10', '152.690002', '154.080002', '152.320007', '153.669998', '150.299408', '523600']
['2020-01-13', '153.770004', '156.259995', '153.350006', '156.199997', '152.773926', '1064900']
......

Notice that in this case, the header will be also read in as a row of data.

# reading the csv file using csv module as a dictionary
import csv

with open("clorox2020.csv", "r") as f:
    csv_dict = csv.DictReader(f)
    for row in csv_dict:
        print(row)

Output:

{'Date': '2020-01-02', 'Open': '153.380005', 'High': '153.410004', 'Low': '151.550003', 'Close': '152.610001', 'Adj Close': '149.262665', 'Volume': '1292500'}
{'Date': '2020-01-03', 'Open': '152.570007', 'High': '153.880005', 'Low': '151.240005', 'Close': '152.910004', 'Adj Close': '149.556076', 'Volume': '809800'}
{'Date': '2020-01-06', 'Open': '153.070007', 'High': '153.940002', 'Low': '152.289993', 'Close': '153.369995', 'Adj Close': '150.005981', 'Volume': '869300'}
{'Date': '2020-01-07', 'Open': '153.190002', 'High': '153.289993', 'Low': '151.509995', 'Close': '151.520004', 'Adj Close': '148.196564', 'Volume': '1198200'}
{'Date': '2020-01-08', 'Open': '151.779999', 'High': '153.339996', 'Low': '150.949997', 'Close': '151.970001', 'Adj Close': '148.636688', 'Volume': '901200'}
......

Pandas module

The best way for dealing with csv data is through the pandas module. The pandas module is a powerful tool, which can do more than just read in the csv file. For example, we can use pandas to search through our data by row or by column. What’s more, pandas can be used to dealing some other files like html, json, xml, etc.

# use pandas to read in csv file
import pandas

data = pandas.read_csv("clorox2020.csv")
print(type(data))
print(data)

Output:

<class 'pandas.core.frame.DataFrame'>
           Date        Open        High  ...       Close   Adj Close   Volume
0    2020-01-02  153.380005  153.410004  ...  152.610001  149.262665  1292500
1    2020-01-03  152.570007  153.880005  ...  152.910004  149.556076   809800
2    2020-01-06  153.070007  153.940002  ...  153.369995  150.005981   869300
3    2020-01-07  153.190002  153.289993  ...  151.520004  148.196564  1198200
4    2020-01-08  151.779999  153.339996  ...  151.970001  148.636688   901200
..          ...         ...         ...  ...         ...         ...      ...
218  2020-11-11  199.270004  203.500000  ...  202.589996  202.589996  1810500
219  2020-11-12  204.619995  207.619995  ...  207.529999  207.529999  1821100
220  2020-11-13  207.899994  208.889999  ...  207.380005  207.380005  1545400
221  2020-11-16  203.000000  207.880005  ...  207.669998  207.669998  1587900
222  2020-11-17  207.000000  209.490005  ...  209.100006  209.100006  1433200

[223 rows x 7 columns]

Notice that the data that read in was stored in the pandas object called “DataFrame”. And when printing the data object, not every row and column is printed, and the data was printed in a human-friendly way.

Instead of showing the whole file, we can make it only show the first n rows using the function head().

import pandas

# show the first 10 rows
data = pandas.read_csv("clorox2020.csv")
print(data.head(10))

Output:

         Date        Open        High  ...       Close   Adj Close   Volume
0  2020-01-02  153.380005  153.410004  ...  152.610001  149.262665  1292500
1  2020-01-03  152.570007  153.880005  ...  152.910004  149.556076   809800
2  2020-01-06  153.070007  153.940002  ...  153.369995  150.005981   869300
3  2020-01-07  153.190002  153.289993  ...  151.520004  148.196564  1198200
4  2020-01-08  151.779999  153.339996  ...  151.970001  148.636688   901200
5  2020-01-09  152.229996  153.009995  ...  152.800003  149.448486   504900
6  2020-01-10  152.690002  154.080002  ...  153.669998  150.299408   523600
7  2020-01-13  153.770004  156.259995  ...  156.199997  152.773926  1064900
8  2020-01-14  155.729996  156.320007  ...  156.250000  152.822815  1166600
9  2020-01-15  156.250000  157.539993  ...  156.669998  153.233612   619200

[10 rows x 7 columns]

The DataFrame object is a dictionary-like object, in which the key is the column name. The data stored under every key is in a list-like object called “Series”. We can access some specific data using keys and indices.

# get the "Close" value of "Date" 2020-01-15 (line 9)
import pandas

data = pandas.read_csv("clorox2020.csv")
print(data["Close"][9])

Output:

156.669998

Leave a Reply

Your email address will not be published. Required fields are marked *