Read Two Lines at a Time Python
How to excerpt specific portions of a text file using Python
Updated: 06/30/2020 by Computer Promise
Extracting text from a file is a common task in scripting and programming, and Python makes it easy. In this guide, nosotros'll discuss some elementary means to extract text from a file using the Python 3 programming linguistic communication.
Make sure you're using Python 3
In this guide, we'll be using Python version 3. Most systems come pre-installed with Python 2.7. While Python 2.vii is used in legacy code, Python three is the present and hereafter of the Python language. Unless you lot have a specific reason to write or support Python 2, nosotros recommend working in Python 3.
For Microsoft Windows, Python three tin can exist downloaded from the Python official website. When installing, make sure the "Install launcher for all users" and "Add together Python to PATH" options are both checked, as shown in the image beneath.
On Linux, you tin install Python three with your package manager. For example, on Debian or Ubuntu, you tin install it with the following control:
sudo apt-get update && sudo apt-get install python3
For macOS, the Python iii installer tin be downloaded from python.org, as linked above. If y'all are using the Homebrew package director, it can also exist installed past opening a terminal window (Applications → Utilities), and running this control:
brew install python3
Running Python
On Linux and macOS, the command to run the Python 3 interpreter is python3. On Windows, if you installed the launcher, the command is py. The commands on this page employ python3; if y'all're on Windows, substitute py for python3 in all commands.
Running Python with no options starts the interactive interpreter. For more data nearly using the interpreter, run across Python overview: using the Python interpreter. If you accidentally enter the interpreter, you lot tin go out it using the command get out() or quit().
Running Python with a file proper noun will translate that python program. For instance:
python3 program.py
...runs the plan contained in the file program.py.
Okay, how tin nosotros use Python to excerpt text from a text file?
Reading information from a text file
First, let's read a text file. Let'south say nosotros're working with a file named lorem.txt, which contains lines from the Lorem Ipsum example text.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
Annotation
In all the examples that follow, we work with the four lines of text contained in this file. Re-create and paste the latin text above into a text file, and relieve information technology as lorem.txt, so you can run the case code using this file as input.
A Python program can read a text file using the built-in open() function. For example, the Python 3 program below opens lorem.txt for reading in text way, reads the contents into a string variable named contents, closes the file, and prints the data.
myfile = open("lorem.txt", "rt") # open up lorem.txt for reading text contents = myfile.read() # read the entire file to string myfile.close() # shut the file print(contents) # impress string contents
Here, myfile is the name nosotros requite to our file object.
The "rt" parameter in the open up() role ways "nosotros're opening this file to read text information"
The hash mark ("#") ways that everything on that line is a comment, and it's ignored past the Python interpreter.
If you save this program in a file called read.py, you can run it with the following command.
python3 read.py
The command to a higher place outputs the contents of lorem.txt:
Lorem ipsum dolor sit down amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
Using "with open"
Information technology's important to close your open files every bit soon equally possible: open the file, perform your operation, and close it. Don't go out information technology open for extended periods of fourth dimension.
When you're working with files, it's good practise to use the with open...every bit compound statement. It's the cleanest style to open up a file, operate on information technology, and close the file, all in one easy-to-read block of code. The file is automatically airtight when the lawmaking block completes.
Using with open...every bit, we can rewrite our program to wait like this:
with open ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading text contents = myfile.read() # Read the entire file to a cord print(contents) # Print the string
Note
Indentation is important in Python. Python programs employ white space at the get-go of a line to define scope, such as a block of code. We recommend y'all utilise four spaces per level of indentation, and that y'all employ spaces rather than tabs. In the post-obit examples, make sure your lawmaking is indented exactly as it'south presented hither.
Example
Salve the plan every bit read.py and execute it:
python3 read.py
Output:
Lorem ipsum dolor sit down amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
Reading text files line-by-line
In the examples so far, we've been reading in the whole file at once. Reading a full file is no large deal with small-scale files, merely generally speaking, information technology's not a cracking thought. For one thing, if your file is bigger than the amount of available memory, you lot'll run across an error.
In almost every example, information technology's a better idea to read a text file one line at a fourth dimension.
In Python, the file object is an iterator. An iterator is a type of Python object which behaves in sure ways when operated on repeatedly. For instance, yous tin can utilise a for loop to operate on a file object repeatedly, and each time the same operation is performed, y'all'll receive a different, or "side by side," effect.
Example
For text files, the file object iterates one line of text at a time. It considers i line of text a "unit of measurement" of data, then we tin utilise a for...in loop statement to iterate 1 line at a time:
with open ('lorem.txt', 'rt') as myfile: # Open up lorem.txt for reading for myline in myfile: # For each line, read to a cord, print(myline) # and impress the string.
Output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
Detect that we're getting an extra line break ("newline") afterwards every line. That'due south considering two newlines are being printed. The first one is the newline at the finish of every line of our text file. The second newline happens because, by default, print() adds a linebreak of its own at the stop of any you lot've asked it to impress.
Permit's store our lines of text in a variable — specifically, a listing variable — and then we can look at information technology more closely.
Storing text data in a variable
In Python, lists are similar to, just not the aforementioned as, an array in C or Java. A Python list contains indexed information, of varying lengths and types.
Example
mylines = [] # Declare an empty listing named mylines. with open ('lorem.txt', 'rt') equally myfile: # Open up lorem.txt for reading text data. for myline in myfile: # For each line, stored every bit myline, mylines.suspend(myline) # add together its contents to mylines. print(mylines) # Print the listing.
The output of this programme is a picayune different. Instead of printing the contents of the listing, this program prints our listing object, which looks like this:
Output:
['Lorem ipsum dolor sit amet, consectetur adipiscing elit.\north', 'Nunc fringilla arcu congue metus aliquam mollis.\northward', 'Mauris nec maximus purus. Maecenas sit amet pretium tellus.\northward', 'Quisque at dignissim lacus.\n']
Here, nosotros meet the raw contents of the list. In its raw object grade, a list is represented as a comma-delimited list. Here, each element is represented equally a cord, and each newline is represented equally its escape character sequence, \n.
Much similar a C or Coffee assortment, the listing elements are accessed by specifying an index number later the variable proper noun, in brackets. Index numbers start at zero — other words, the nthursday chemical element of a list has the numeric index northward-1.
Notation
If you're wondering why the index numbers start at nil instead of one, you lot're not alone. Calculator scientists have debated the usefulness of zero-based numbering systems in the by. In 1982, Edsger Dijkstra gave his opinion on the subject area, explaining why zero-based numbering is the all-time mode to index information in computer science. You tin read the memo yourself — he makes a compelling argument.
Case
We tin can print the start element of lines past specifying index number 0, independent in brackets afterward the name of the listing:
print(mylines[0])
Output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis.
Example
Or the third line, past specifying alphabetize number ii:
print(mylines[two])
Output:
Quisque at dignissim lacus.
But if we endeavour to admission an index for which there is no value, we get an fault:
Example
impress(mylines[3])
Output:
Traceback (most contempo call final): File <filename>, line <linenum>, in <module> print(mylines[3]) IndexError: list alphabetize out of range
Example
A list object is an iterator, and then to print every chemical element of the list, we tin iterate over information technology with for...in:
mylines = [] # Declare an empty listing with open up ('lorem.txt', 'rt') as myfile: # Open up lorem.txt for reading text. for line in myfile: # For each line of text, mylines.append(line) # add together that line to the list. for element in mylines: # For each element in the list, print(element) # print information technology.
Output:
Lorem ipsum dolor sit down amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
But we're even so getting extra newlines. Each line of our text file ends in a newline grapheme ('\n'), which is being printed. As well, after press each line, print() adds a newline of its own, unless yous tell it to do otherwise.
We can alter this default beliefs by specifying an terminate parameter in our print() call:
print(element, end='')
By setting stop to an empty cord (two single quotes, with no infinite), we tell impress() to print nothing at the finish of a line, instead of a newline character.
Example
Our revised program looks similar this:
mylines = [] # Declare an empty list with open ('lorem.txt', 'rt') as myfile: # Open file lorem.txt for line in myfile: # For each line of text, mylines.append(line) # add that line to the listing. for element in mylines: # For each element in the list, print(element, finish='') # print it without extra newlines.
Output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
The newlines y'all see here are really in the file; they're a special character ('\due north') at the end of each line. We want to get rid of these, then we don't take to worry about them while nosotros process the file.
How to strip newlines
To remove the newlines completely, we can strip them. To strip a string is to remove one or more characters, usually whitespace, from either the beginning or cease of the cord.
Tip
This process is sometimes also chosen "trimming."
Python 3 cord objects have a method called rstrip(), which strips characters from the right side of a string. The English language reads left-to-right, so stripping from the right side removes characters from the end.
If the variable is named mystring, we tin strip its correct side with mystring.rstrip(chars), where chars is a cord of characters to strip. For example, "123abc".rstrip("bc") returns 123a.
Tip
When y'all correspond a string in your programme with its literal contents, it's chosen a string literal. In Python (as in most programming languages), string literals are always quoted — enclosed on either side past unmarried (') or double (") quotes. In Python, single and double quotes are equivalent; you can use one or the other, every bit long as they match on both ends of the string. It's traditional to represent a human-readable string (such as Hello) in double-quotes ("How-do-you-do"). If you're representing a unmarried character (such as b), or a unmarried special graphic symbol such as the newline graphic symbol (\due north), it's traditional to utilize single quotes ('b', '\n'). For more than information well-nigh how to employ strings in Python, you can read the documentation of strings in Python.
The statement string.rstrip('\n') will strip a newline character from the right side of cord. The following version of our programme strips the newlines when each line is read from the text file:
mylines = [] # Declare an empty list. with open ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading text. for myline in myfile: # For each line in the file, mylines.append(myline.rstrip('\north')) # strip newline and add to list. for element in mylines: # For each chemical element in the list, impress(chemical element) # print it.
The text is at present stored in a list variable, so individual lines can exist accessed by alphabetize number. Newlines were stripped, so we don't have to worry virtually them. Nosotros tin can always put them back afterwards if nosotros reconstruct the file and write it to disk.
Now, let's search the lines in the list for a specific substring.
Searching text for a substring
Let'south say we desire to locate every occurrence of a certain phrase, or even a single letter. For instance, perhaps nosotros demand to know where every "e" is. We tin can accomplish this using the string's find() method.
The list stores each line of our text equally a string object. All cord objects have a method, discover(), which locates the first occurrence of a substrings in the string.
Let's employ the notice() method to search for the letter "due east" in the first line of our text file, which is stored in the list mylines. The offset element of mylines is a string object containing the first line of the text file. This string object has a detect() method.
In the parentheses of find(), we specify parameters. The start and just required parameter is the cord to search for, "e". The statement mylines[0].find("due east") tells the interpreter to search forrard, starting at the get-go of the string, i character at a time, until it finds the letter "e." When it finds one, it stops searching, and returns the alphabetize number where that "due east" is located. If it reaches the end of the string, it returns -i to bespeak nada was establish.
Example
impress(mylines[0].notice("e"))
Output:
3
The return value "iii" tells us that the alphabetic character "e" is the quaternary character, the "e" in "Lorem". (Remember, the index is naught-based: index 0 is the first graphic symbol, 1 is the 2d, etc.)
The find() method takes ii optional, additional parameters: a start index and a stop alphabetize, indicating where in the string the search should begin and finish. For example, string.find("abc", ten, 20) searches for the substring "abc", but only from the 11th to the 21st character. If end is not specified, observe() starts at alphabetize start, and stops at the end of the string.
Example
For instance, the following statement searchs for "e" in mylines[0], commencement at the fifth character.
print(mylines[0].notice("e", 4))
Output:
24
In other words, starting at the fifth character in line[0], the first "e" is located at alphabetize 24 (the "e" in "nec").
Instance
To offset searching at index 10, and terminate at alphabetize 30:
print(mylines[1].find("east", 10, 30))
Output:
28
(The first "east" in "Maecenas").
If notice() doesn't locate the substring in the search range, it returns the number -1, indicating failure:
print(mylines[0].find("eastward", 25, 30))
Output:
-1
At that place were no "e" occurrences between indices 25 and thirty.
Finding all occurrences of a substring
But what if we want to locate every occurrence of a substring, not but the first one nosotros encounter? We tin iterate over the cord, starting from the index of the previous lucifer.
In this case, nosotros'll utilise a while loop to repeatedly find the letter "e". When an occurrence is found, we call notice again, starting from a new location in the cord. Specifically, the location of the terminal occurrence, plus the length of the cord (and then nosotros can move forward past the concluding one). When find returns -1, or the start index exceeds the length of the string, we cease.
# Build assortment of lines from file, strip newlines mylines = [] # Declare an empty listing. with open up ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading text. for myline in myfile: # For each line in the file, mylines.append(myline.rstrip('\n')) # strip newline and add to list. # Locate and print all occurences of letter "e" substr = "east" # substring to search for. for line in mylines: # string to be searched index = 0 # current alphabetize: character being compared prev = 0 # previous alphabetize: last grapheme compared while alphabetize < len(line): # While index has not exceeded cord length, index = line.discover(substr, alphabetize) # set index to first occurrence of "e" if alphabetize == -1: # If nothing was found, break # exit the while loop. impress(" " * (alphabetize - prev) + "e", end='') # impress spaces from previous # match, then the substring. prev = index + len(substr) # remember this position for next loop. index += len(substr) # increment the index past the length of substr. # (Repeat until alphabetize > line length) print('\n' + line); # Print the original string nether the due east's
Output:
e e e e e Lorem ipsum dolor sit amet, consectetur adipiscing elit. eastward due east Nunc fringilla arcu congue metus aliquam mollis. e e e eastward east e Mauris nec maximus purus. Maecenas sit down amet pretium tellus. e Quisque at dignissim lacus.
Incorporating regular expressions
For complex searches, use regular expressions.
The Python regular expressions module is called re. To employ it in your program, import the module before you utilize information technology:
import re
The re module implements regular expressions by compiling a search design into a pattern object. Methods of this object can and so be used to perform match operations.
For example, let'due south say you want to search for whatever word in your document which starts with the letter d and ends in the letter r. We can achieve this using the regular expression "\bd\w*r\b". What does this hateful?
character sequence | pregnant |
---|---|
\b | A word boundary matches an empty string (anything, including goose egg at all), just only if it appears before or later a non-word character. "Word characters" are the digits 0 through 9, the lowercase and uppercase letters, or an underscore ("_"). |
d | Lowercase alphabetic character d. |
\west* | \w represents any word character, and * is a quantifier meaning "zero or more of the previous character." So \west* will match nil or more give-and-take characters. |
r | Lowercase letter r. |
\b | Word boundary. |
And then this regular expression will friction match any string that tin can be described every bit "a word boundary, and so a lowercase 'd', and so nothing or more give-and-take characters, then a lowercase 'r', then a word boundary." Strings described this manner include the words destroyer, dour, and doctor, and the abridgement dr.
To apply this regular expression in Python search operations, we first compile it into a pattern object. For instance, the post-obit Python statement creates a pattern object named pattern which we can employ to perform searches using that regular expression.
blueprint = re.compile(r"\bd\w*r\b")
Note
The letter of the alphabet r earlier our string in the statement above is important. It tells Python to translate our string as a raw string, exactly as nosotros've typed it. If nosotros didn't prefix the string with an r, Python would interpret the escape sequences such as \b in other ways. Whenever you demand Python to translate your strings literally, specify it equally a raw cord by prefixing it with r.
Now nosotros can utilize the pattern object'southward methods, such as search(), to search a cord for the compiled regular expression, looking for a friction match. If it finds one, information technology returns a special result called a match object. Otherwise, information technology returns None, a congenital-in Python constant that is used like the boolean value "false".
import re str = "Good morning time, physician." pat = re.compile(r"\bd\west*r\b") # compile regex "\bd\west*r\b" to a pattern object if pat.search(str) != None: # Search for the blueprint. If found, print("Found it.")
Output:
Found it.
To perform a example-insensitive search, you lot tin specify the special constant re.IGNORECASE in the compile step:
import re str = "How-do-you-do, DoctoR." pat = re.compile(r"\bd\due west*r\b", re.IGNORECASE) # upper and lowercase will match if pat.search(str) != None: print("Plant it.")
Output:
Institute it.
Putting it all together
So now we know how to open a file, read the lines into a list, and locate a substring in whatsoever given list element. Allow'due south use this cognition to build some example programs.
Impress all lines containing substring
The program beneath reads a log file line by line. If the line contains the word "error," it is added to a list called errors. If not, it is ignored. The lower() string method converts all strings to lowercase for comparison purposes, making the search case-insensitive without altering the original strings.
Annotation that the find() method is called straight on the result of the lower() method; this is called method chaining. Also, annotation that in the print() statement, nosotros construct an output cord by joining several strings with the + operator.
errors = [] # The list where nosotros will store results. linenum = 0 substr = "error".lower() # Substring to search for. with open ('logfile.txt', 'rt') as myfile: for line in myfile: linenum += 1 if line.lower().observe(substr) != -1: # if example-insensitive match, errors.append("Line " + str(linenum) + ": " + line.rstrip('\n')) for err in errors: impress(err)
Input (stored in logfile.txt):
This is line 1 This is line 2 Line iii has an error! This is line 4 Line 5 also has an error!
Output:
Line 3: Line 3 has an mistake! Line 5: Line 5 also has an error!
Excerpt all lines containing substring, using regex
The plan below is like to the to a higher place program, but using the re regular expressions module. The errors and line numbers are stored as tuples, e.grand., (linenum, line). The tuple is created by the additional enclosing parentheses in the errors.append() statement. The elements of the tuple are referenced similar to a list, with a zip-based index in brackets. Equally constructed hither, err[0] is a linenum and err[1] is the associated line containing an fault.
import re errors = [] linenum = 0 pattern = re.compile("error", re.IGNORECASE) # Compile a case-insensitive regex with open ('logfile.txt', 'rt') equally myfile: for line in myfile: linenum += i if pattern.search(line) != None: # If a lucifer is plant errors.append((linenum, line.rstrip('\n'))) for err in errors: # Iterate over the listing of tuples impress("Line " + str(err[0]) + ": " + err[i])
Output:
Line vi: Mar 28 09:x:37 Fault: cannot contact server. Connection refused. Line x: Mar 28 10:28:15 Kernel error: The specified location is not mounted. Line 14: Mar 28 eleven:06:xxx ERROR: usb 1-i: can't set config, exiting.
Extract all lines containing a phone number
The program below prints any line of a text file, info.txt, which contains a United states of america or international phone number. Information technology accomplishes this with the regular expression "(\+\d{one,2})?[\south.-]?\d{three}[\south.-]?\d{4}". This regex matches the following phone number notations:
- 123-456-7890
- (123) 456-7890
- 123 456 7890
- 123.456.7890
- +91 (123) 456-7890
import re errors = [] linenum = 0 pattern = re.compile(r"(\+\d{1,2})?[\due south.-]?\d{3}[\s.-]?\d{4}") with open ('info.txt', 'rt') as myfile: for line in myfile: linenum += one if pattern.search(line) != None: # If pattern search finds a match, errors.append((linenum, line.rstrip('\n'))) for err in errors: print("Line ", str(err[0]), ": " + err[ane])
Output:
Line 3 : My phone number is 731.215.8881. Line 7 : You lot can reach Mr. Walters at (212) 558-3131. Line 12 : His agent, Mrs. Kennedy, can be reached at +12 (123) 456-7890 Line 14 : She can also be contacted at (888) 312.8403, extension 12.
Search a dictionary for words
The program below searches the dictionary for any words that offset with h and terminate in pe. For input, information technology uses a dictionary file included on many Unix systems, /usr/share/dict/words.
import re filename = "/usr/share/dict/words" pattern = re.compile(r"\bh\due west*pe$", re.IGNORECASE) with open(filename, "rt") as myfile: for line in myfile: if pattern.search(line) != None: print(line, end='')
Output:
Promise heliotrope hope hornpipe horoscope hype
Source: https://www.computerhope.com/issues/ch001721.htm
0 Response to "Read Two Lines at a Time Python"
Postar um comentário