CSV files
CSV stands for “comma-spearated values”, which allows data to be saved in a tabular format. They are commonly used to store spreedsheet data. CSV files are essentially text files and can be opened the same way as text files. However, there are python modules dedicated to handling csv files. Here we mainly focus on the csv module.
One situation to use the csv module is when a data set is too larger to easily work with in Excel. Suppose there are millions of data stored in Excel, inorder to get the data in some lines, we have to scroll down many many time to find it. So, Excel is great fro small data sets, but if they are too large then Excel will be slow and unwieldy and it’s better to use python.
Basic csv file
The CSV file is essentially a file ending with “.csv”. Suppose we have a csv file called “clorox2020.csv“, which contains the data of a stock market of clorox in 2020. The data is in the format:
Date,Open,High,Low,Close,Adj Close,Volume 2020-01-02,153.380005,153.410004,151.550003,152.610001,149.262665,1292500 2020-01-03,152.570007,153.880005,151.240005,152.910004,149.556076,809800 2020-01-06,153.070007,153.940002,152.289993,153.369995,150.005981,869300 2020-01-07,153.190002,153.289993,151.509995,151.520004,148.196564,1198200 2020-01-08,151.779999,153.339996,150.949997,151.970001,148.636688,901200 2020-01-09,152.229996,153.009995,151.720001,152.800003,149.448486,504900 ......
Notice that the data in the csv file is separated by comma (“,”). The first line, in this case, is a header with a list of the column names in the file. The header is highly recommended but not necessarily, it allows the file to be self-documenting. If there is a header, make sure the header list is delimited in the same way as the rest of the file.
CSV module
In order to read in the csv file to python, we can import and use the csv module. There is actually more than one way to read in a csv file. Here we will use two way: reader() and DictReader().
The reader() function will read in the whole csv file and save it into a iterable reader object. We can iterable it using a loop. The DictReader() function will read in the whole csv file but every row will be stored in a dictionary. The key is their column name, the value is the data of the column.
# reading the cvs file with csv module row by row
import csv
with open("clorox2020.csv", "r") as f:
csv_reader = csv.reader(f)
for row in csv_reader:
print(row)Output:
['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'] ['2020-01-02', '153.380005', '153.410004', '151.550003', '152.610001', '149.262665', '1292500'] ['2020-01-03', '152.570007', '153.880005', '151.240005', '152.910004', '149.556076', '809800'] ['2020-01-06', '153.070007', '153.940002', '152.289993', '153.369995', '150.005981', '869300'] ['2020-01-07', '153.190002', '153.289993', '151.509995', '151.520004', '148.196564', '1198200'] ['2020-01-08', '151.779999', '153.339996', '150.949997', '151.970001', '148.636688', '901200'] ['2020-01-09', '152.229996', '153.009995', '151.720001', '152.800003', '149.448486', '504900'] ['2020-01-10', '152.690002', '154.080002', '152.320007', '153.669998', '150.299408', '523600'] ['2020-01-13', '153.770004', '156.259995', '153.350006', '156.199997', '152.773926', '1064900'] ......
Notice that in this case, the header will be also read in as a row of data.
# reading the csv file using csv module as a dictionary
import csv
with open("clorox2020.csv", "r") as f:
csv_dict = csv.DictReader(f)
for row in csv_dict:
print(row)Output:
{'Date': '2020-01-02', 'Open': '153.380005', 'High': '153.410004', 'Low': '151.550003', 'Close': '152.610001', 'Adj Close': '149.262665', 'Volume': '1292500'}
{'Date': '2020-01-03', 'Open': '152.570007', 'High': '153.880005', 'Low': '151.240005', 'Close': '152.910004', 'Adj Close': '149.556076', 'Volume': '809800'}
{'Date': '2020-01-06', 'Open': '153.070007', 'High': '153.940002', 'Low': '152.289993', 'Close': '153.369995', 'Adj Close': '150.005981', 'Volume': '869300'}
{'Date': '2020-01-07', 'Open': '153.190002', 'High': '153.289993', 'Low': '151.509995', 'Close': '151.520004', 'Adj Close': '148.196564', 'Volume': '1198200'}
{'Date': '2020-01-08', 'Open': '151.779999', 'High': '153.339996', 'Low': '150.949997', 'Close': '151.970001', 'Adj Close': '148.636688', 'Volume': '901200'}
......
Pandas module
The best way for dealing with csv data is through the pandas module. The pandas module is a powerful tool, which can do more than just read in the csv file. For example, we can use pandas to search through our data by row or by column. What’s more, pandas can be used to dealing some other files like html, json, xml, etc.
# use pandas to read in csv file
import pandas
data = pandas.read_csv("clorox2020.csv")
print(type(data))
print(data)Output:
<class 'pandas.core.frame.DataFrame'>
Date Open High ... Close Adj Close Volume
0 2020-01-02 153.380005 153.410004 ... 152.610001 149.262665 1292500
1 2020-01-03 152.570007 153.880005 ... 152.910004 149.556076 809800
2 2020-01-06 153.070007 153.940002 ... 153.369995 150.005981 869300
3 2020-01-07 153.190002 153.289993 ... 151.520004 148.196564 1198200
4 2020-01-08 151.779999 153.339996 ... 151.970001 148.636688 901200
.. ... ... ... ... ... ... ...
218 2020-11-11 199.270004 203.500000 ... 202.589996 202.589996 1810500
219 2020-11-12 204.619995 207.619995 ... 207.529999 207.529999 1821100
220 2020-11-13 207.899994 208.889999 ... 207.380005 207.380005 1545400
221 2020-11-16 203.000000 207.880005 ... 207.669998 207.669998 1587900
222 2020-11-17 207.000000 209.490005 ... 209.100006 209.100006 1433200
[223 rows x 7 columns]Notice that the data that read in was stored in the pandas object called “DataFrame”. And when printing the data object, not every row and column is printed, and the data was printed in a human-friendly way.
Instead of showing the whole file, we can make it only show the first n rows using the function head().
import pandas
# show the first 10 rows
data = pandas.read_csv("clorox2020.csv")
print(data.head(10))Output:
Date Open High ... Close Adj Close Volume
0 2020-01-02 153.380005 153.410004 ... 152.610001 149.262665 1292500
1 2020-01-03 152.570007 153.880005 ... 152.910004 149.556076 809800
2 2020-01-06 153.070007 153.940002 ... 153.369995 150.005981 869300
3 2020-01-07 153.190002 153.289993 ... 151.520004 148.196564 1198200
4 2020-01-08 151.779999 153.339996 ... 151.970001 148.636688 901200
5 2020-01-09 152.229996 153.009995 ... 152.800003 149.448486 504900
6 2020-01-10 152.690002 154.080002 ... 153.669998 150.299408 523600
7 2020-01-13 153.770004 156.259995 ... 156.199997 152.773926 1064900
8 2020-01-14 155.729996 156.320007 ... 156.250000 152.822815 1166600
9 2020-01-15 156.250000 157.539993 ... 156.669998 153.233612 619200
[10 rows x 7 columns]The DataFrame object is a dictionary-like object, in which the key is the column name. The data stored under every key is in a list-like object called “Series”. We can access some specific data using keys and indices.
# get the "Close" value of "Date" 2020-01-15 (line 9)
import pandas
data = pandas.read_csv("clorox2020.csv")
print(data["Close"][9])Output:
156.669998
