Comma-Separated Values (CSV, sometimes called character-separated values because the separating characters can also be other than commas), whose files store tabular data (numbers and text) in plain text. Plain text means that the file is a sequence of characters and does not contain data that must be interpreted like binary numbers. a CSV file consists of any number of records separated by some sort of line break; each record consists of fields separated by other characters or strings, most often commas or tabs. Usually, all records have exactly the same sequence of fields.
Read out the data is generally character type, if it is a number need to be converted to a number artificially
Read data in rows
Columns are separated by a half-comma or tab, generally a half-comma
Generally for the beginning of each line without spaces, the first line is the attribute column, the data column between the spacer for the interval without spaces, no blank lines between the lines.
No blank lines between rows is very important, if there is a blank line or a space at the end of the row in the data set, reading the data will generally be an error, triggering [list index out of range] error
1.Writing and reading CSV files using python I/O
Writing csv files using PythonI/O
Here is the code to download the "birthweight.dat" low birthweight dat file from the author's source, process it, and save it to a csv file.
import csv import os import numpy as np import random import requests # name of data file # Data set name birth_weight_file = 'birth_weight.csv' # download data and create data file if file does not exist in current directory # If there is no birth_weight.csv dataset in the current folder then download the dat file and generate the csv file if not os.path.exists(birth_weight_file): birthdata_url = 'https://github.com/nfmcclure/tensorflow_cookbook/raw/master/01_Introduction/07_Working_with_Data_Sources/birthweight_data/birthweight.dat' birth_file = requests.get(birthdata_url) birth_data = birth_file.text.split('\r\n') # split function, split function as a line, windows line break number for '\r\n', after each line there is a '\r\n' symbol. birth_header = birth_data[0].split('\t') # The header of each column, marked in the first row, is the first data of birth_data. And use tabs as division. birth_data = [[float(x) for x in y.split('\t') if len(x) >= 1] for y in birth_data[1:] if len(y) >= 1] print(np.array(birth_data).shape) # (189, 9) # This is a list data form is not numpy array can not use the np,shape function, but we can use the np.array function to list objects into numpy arrays after using the shape property to view. with open(birth_weight_file, "w", newline='') as f: # with open(birth_weight_file, "w") as f: writer = csv.writer(f) writer.writerows([birth_header]) writer.writerows(birth_data) f.close()
Common Errors:list index out of range
One of the main points we need to talk about is with open(birth_weight_file, "w", newline='') as f:. If you don't add the parameter newline='' to the csv file, but use with open(birth_weight_file, "w") as f: statement. Then the generated table will have empty lines.
Not only when using python I/O to read and write csv data, but also when using the rest of the methods to read and write csv data, or after downloading a good csv data set from the Internet, you need to check whether there are any spaces after each line, or whether there are any extra empty lines. Avoid unnecessary errors ~ affect the judgment of data analysis.
Read csv files using Python I/O
The python I/O method for reading is to create a new List and then store the data into an empty List object in the order of first and last (similar to a two-dimensional array in C), or use np.array(List name) to convert it to a numpy array if you need to.
birth_data = [] with open(birth_weight_file) as csvfile: csv_reader = csv.reader(csvfile) # Read files in csvfile using csv.reader birth_header = next(csv_reader) # Retrieve the title of each column in the first row for row in csv_reader: # Save the data from the csv file to birth_data birth_data.append(row) birth_data = [[float(x) for x in row] for row in birth_data] # Convert data from string form to float form birth_data = np.array(birth_data) # Convert list arrays into array arrays for easy viewing of data structure birth_header = np.array(birth_header) print(birth_data.shape) # Use .shape to view the structure. print(birth_header.shape) # # (189, 9) # (9,)
2.Read CSV files using Pandas
import pandas as pd csv_data = pd.read_csv('birth_weight.csv') # Read training data print(csv_data.shape) # (189, 9) N = 5 csv_batch_data = csv_data.tail(N) # Take the last 5 bars of data print(csv_batch_data.shape) # (5, 9) train_batch_data = csv_batch_data[list(range(3, 6))] # Take the values of 3 to 5 columns of these 20 data (indexes start from 0) print(train_batch_data) # RACE SMOKE PTL # 184 0.0 0.0 0.0 # 185 0.0 0.0 1.0 # 186 0.0 1.0 0.0 # 187 0.0 0.0 0.0 # 188 0.0 0.0 1.0
3.Read CSV files using Tensorflow
'''Read csv data using Tensorflow''' filename = 'birth_weight.csv' file_queue = tf.train.string_input_producer([filename]) # Set the file name queue so that you can read files from folders in bulk reader = tf.TextLineReader(skip_header_lines=1) # Use tensorflow text line reader and set to ignore the first line key, value = reader.read(file_queue) defaults = [[0.], [0.], [0.], [0.], [0.], [0.], [0.], [0.], [0.]] # Set the data format of column properties LOW, AGE, LWT, RACE, SMOKE, PTL, HT, UI, BWT = tf.decode_csv(value, defaults) # Encode the read data into the default format we set vertor_example = tf.stack([AGE, LWT, RACE, SMOKE, PTL, HT, UI]) # The middle 7 columns of attributes are read as training features vertor_label = tf.stack([BWT]) # The obtained BWT values are read to represent the training labels # Used to add a batch_size dimension to the retrieved data and read it out in batch mode. You can set properties such as batch data size, whether to read data repeatedly, capacity size, end-of-queue size, read thread, etc. example_batch, label_batch = tf.train.shuffle_batch([vertor_example, vertor_label], batch_size=10, capacity=100, min_after_dequeue=10) # Initialize Session with tf.Session() as sess: coord = tf.train.Coordinator() # Thread Manager threads = tf.train.start_queue_runners(coord=coord) print(sess.run(tf.shape(example_batch))) # [10 7] print(sess.run(tf.shape(label_batch))) # [10 1] print(sess.run(example_batch)[3]) # [ 19. 91. 0. 1. 1. 0. 1.] coord.request_stop() coord.join(threads) ''' Turning the thread manager on and off is necessary for all I/O operations using Tensorflow with tf.Session() as sess: coord = tf.train.Coordinator() # Thread Manager threads = tf.train.start_queue_runners(coord=coord) # Your code here~ coord.request_stop() coord.join(threads) '''