Files

Source: this section is based on both [ThinkCS] and [PythonForBeginners].

About files

While a program is running, its data is stored in random access memory (RAM). RAM is fast and inexpensive, but it is also volatile, which means that when the program ends, or the computer shuts down, data in RAM disappears. To make data available the next time the computer is turned on and the program is started, it has to be written to a non-volatile storage medium, such a hard drive, usb drive, or CD-RW.

Data on non-volatile storage media is stored in named locations on the media called files. By reading and writing files, programs can save information between program runs.

Working with files is a lot like working with a notebook. To use a notebook, it has to be opened. When done, it has to be closed. While the notebook is open, it can either be read from or written to. In either case, the notebook holder knows where they are. They can read the whole notebook in its natural order or they can skip around.

All of this applies to files as well. To open a file, we specify its name and indicate whether we want to read or write.

Writing our first file

Let's begin with a simple program that writes four lines of text into a file:

file = open("testfile.txt","w")

file.write("Hello World\n")
file.write("This is our new text file\n")
file.write("and this is another line.\n")
file.write("Why? Because we can.\n")

file.close()

Opening a file creates what we call a file handle. In this example, the variable file refers to the new handle object. Our program calls methods on the handle, and this makes changes to the actual file which is usually located on our disk.

On line 1, the open function takes two arguments. The first is the name of the file, and the second is the mode. Mode "w" means that we are opening the file for writing.

With mode "w", if there is no file named testfile.txt on the disk, it will be created. If there already is one, it will be replaced by the file we are writing.

To put data in the file we invoke the write method on the handle, shown in lines 2, 3, 4 and 5 above. In bigger programs, lines 2--5 will usually be replaced by a loop that writes many more lines into the file.

Closing the file handle (line 6) tells the system that we are done writing and makes the disk file available for reading by other programs (or by our own program).

A handle is somewhat like a TV remote control

We're all familiar with a remote control for a TV. We perform operations on the remote control --- switch channels, change the volume, etc. But the real action happens on the TV. So, by simple analogy, we'd call the remote control our handle to the underlying TV.

Sometimes we want to emphasize the difference --- the file handle is not the same as the file, and the remote control is not the same as the TV. But at other times we prefer to treat them as a single mental chunk, or abstraction, and we'll just say "close the file", or "flip the TV channel".

Reading a Text File in Python

There are several ways to read a text file in Python. If you just need to extract a string that contains all characters in the file, you can use the following method:

file.read()

For example, the following Python code would print out the file we have just created on the console.

file = open("testfile.txt", "r")
print(file.read())
file.close ()

The output of this command will display all the text inside the file, the same text we told the interpreter to add earlier:

Hello World
This is our new text file
and this is another line.
Why? Because we can.

Another way to read a file is to read a certain number of characters. For example, with the following code the Python interpreter will read the first five characters of text from the file and return it as a string:

file = open("testfile.txt", "r")
print(file.read(5))
file.close ()

Notice how we’re using the same file.read() method, only this time we specify the number of characters to process. This time the text displayed will be:

Hello

Finally, if you would want to read the file line by line – as opposed to pulling the content of the entire file in a string at once – then you can use the readline() method. Why would you want to use something like this? Let’s say you only want to see the first line of the file – or the third. You would execute the readline() method as many times as possible to get the data you were looking for. Each time you run the method, it will return a string of characters that contains the next line of information from the file. For example:

file = open("testfile.txt", "r")
print(file.readline())
print(file.readline())
file.close ()

This command would print the first two lines of the file, like so:

Hello World

This is our new text file

Note that an empty line is printed between these two lines. This is because, by default, the print() command always prints a newline after every string. The string that we are printing here, however, ends with a newline itself: this newline was read from the input file, and was not removed by Python.

The additional newline can be avoided using the following approach. We can tell the print command to end the line being printed not by a newline character, for example the empty character '':

file = open("testfile.txt", "r")
print(file.readline(),end="")
print(file.readline(),end="")
file.close ()

Now we get the same result but without empty lines in between:

Hello World
This is our new text file

Related to the readline() method is the` readlines() method.

file = open("testfile.txt", "r")
print(file.readlines())
file.close ()

The output you would get is a list containing each line as a separate element:

['Hello World\n', 'This is our new text file\n', 'and this is another line.\n', 'Why? Because we can.\n']

Notice how every line is ended with a n, the newline character.

If you would now wish to determine, for example, the third line in the file, we could use the following code (we use the index 2 instead of 3 since the first element of a list is at position 0):

file = open("testfile.txt", "r")
print(file.readlines()[2])

which prints:

and this is another line.

Looping over a file object

Using the readlines() notation, we can write code as follows:

file = open("testfile.txt", "r")
for line in file.readlines ():
    print(line,end='')
file.close ()

While correct, this code is not very memory efficient. It would read the entire file in a list, and then traverse this list. When you want to read all the lines from a file in a more memory efficient, and fast manner, using a for-loop, Python provides a method that is both simple and easy to read:

file = open("testfile.txt", "r")
for line in file:
    print(line,end='')
file.close ()

In this case, Python will avoid loading the entire file in memory. Note how we used the print statement with a second argument again, to avoid having undesired newlines. The code above will print:

Hello World
This is our new text file
and this is another line.
Why? Because we can.

Using the File write method to add

One thing you’ll notice about the file write method is that it only requires a single parameter, which is the string you want to be written. This method can also be used to add information or content to an existing file. You just need to make sure to open the file in append mode "a" to make sure you append, instead of overwriting the existing file.

file = open("testfile.txt", "a")
file.write("This is a test\n")
file.write("To add more lines.\n")
file.close()

This will amend our current file to include the two new lines of text. If you don't believe it, open the changed file in your text editor, or write a Python code fragment to print its current contents.

Closing a File

When you’re done reading or writing a file, it is good practice to call the close() method. By calling this method, you tell the operating system that your program has finished working on the file, and that the file can now be read or written by other programs on your computer. For instance, as long as your program is reading a file, your operating system may decide not to allow other programs to change the file.

While in principle you could keep a file open during the execution of the program, hence, it is a matter of good manners towards other programs to close your files when you don't need access to them any more. For this reason, in our examples we are always closing our files.

It’s important to understand that when you use the close() method, any further attempts to use the file object will fail.

Writing multiple lines at once

You can also use the writelines method to write (or append) multiple lines to a file at once:

file = open("testfile.txt", "a")
lines_of_text = ["One line of text here\n", "and another line here\n", "and yet another here\n", "and so on and so forth\n"]
file.writelines(lines_of_text)
file.close()

Splitting lines in a text file

Methods on strings are very useful when processing files. As a final example, let’s explore how to split a file in the words contained in the file. Using the split method in strings discussed earlier, we can write:

file = open("testfile.txt", “r”):
data = file.readlines()
for line in data:
    words = line.split()
    print(words)

The output for this will be something like (depending on what your testfile currently contains):

['One', 'line', 'of', 'text', 'here']
['and', 'another', 'line', 'here']
['and', 'yet', 'another', 'here']
['and', 'so', 'on', 'and', 'so', 'forth']

The reason the words are presented in this manner is because they are stored – and returned – as a list.

Working with binary files

Files that hold photographs, videos, zip files, executable programs, etc. are called binary files: they're not organized into lines, and cannot be opened with a normal text editor. Python works just as easily with binary files, but when we read from the file we're going to get bytes back rather than a string. Here we'll copy one binary file to another:

f = open("somefile.zip", "rb")
g = open("thecopy.zip", "wb")

while True:
    buf = f.read(1024)
    if len(buf) == 0:
         break
    g.write(buf)

f.close()
g.close()

There are a few new things here. In lines 1 and 2 we added a "b" to the mode to tell Python that the files are binary rather than text files. In line 5, we see read can take an argument which tells it how many bytes to attempt to read from the file. Here we chose to read and write up to 1024 bytes on each iteration of the loop. When we get back an empty buffer from our attempt to read, we know we can break out of the loop and close both the files.

If we set a breakpoint at line 6, (or print type(buf) there) we'll see that the type of buf is bytes. We don't do any detailed work with bytes objects in this textbook.

Directories

Files on non-volatile storage media are organized by a set of rules known as a file system. File systems are made up of files and directories, which are containers for both files and other directories.

When we create a new file by opening it and writing, the new file goes in the current directory (wherever we were when we ran the program). Similarly, when we open a file for reading, Python looks for it in the current directory.

If we want to open a file somewhere else, we have to specify the path to the file, which is the name of the directory (or folder) where the file is located:

>>> wordsfile = open("/usr/share/dict/words", "r")
>>> wordlist = wordsfile.readlines()
>>> print(wordlist[:6])
['\n', 'A\n', "A's\n", 'AOL\n', "AOL's\n", 'Aachen\n']

This (Unix) example opens a file named words that resides in a directory named dict, which resides in share, which resides in usr, which resides in the top-level directory of the system, called /. It then reads in each line into a list using readlines, and prints out the first 5 elements from that list.

A Windows path might be "c:/temp/words.txt" or "c:\\temp\\words.txt". Because backslashes are used to escape things like newlines and tabs, we need to write two backslashes in a literal string to get one! So the length of these two strings is the same!

We cannot use / or \ as part of a filename; they are reserved as a delimiter between directory and filenames.

The file /usr/share/dict/words should exist on Unix-based systems, and contains a list of words in alphabetical order.

Glossary

delimiter

A sequence of one or more characters used to specify the boundary between separate parts of text.

directory

A named collection of files, also called a folder. Directories can contain files and other directories, which are referred to as subdirectories of the directory that contains them.

file

A named entity, usually stored on a hard drive, floppy disk, or CD-ROM, that contains a stream of characters.

file system

A method for naming, accessing, and organizing files and the data they contain.

handle

An object in our program that is connected to an underlying resource (e.g. a file). The file handle lets our program manipulate/read/write/close the actual file that is on our disk.

mode

A distinct method of operation within a computer program. Files in Python can be opened in one of four modes: read ("r"), write ("w"), append ("a"), and read and write ("+").

non-volatile memory

Memory that can maintain its state without power. Hard drives, flash drives, and rewritable compact disks (CD-RW) are each examples of non-volatile memory.

path

A sequence of directory names that specifies the exact location of a file.

text file

A file that contains printable characters organized into lines separated by newline characters.

socket

One end of a connection allowing one to read and write information to or from another computer.

volatile memory

Memory which requires an electrical current to maintain state. The main memory or RAM of a computer is volatile. Information stored in RAM is lost when the computer is turned off.

References

[ThinkCS]

How To Think Like a Computer Scientist --- Learning with Python 3

[PythonForBeginners]

https://www.pythonforbeginners.com/files/reading-and-writing-files-in-python