A file must always be opened before using it and closed when the program is finished using it.
File Input/OutputFile Input/Ouput (IO) requires 3 steps: Show
Python provides built-in functions and modules to support these operations. Opening/Closing a File
Reading/Writing Text FilesThe Reading Line/Lines from a Text File
Writing Line to a Text File
Examples>>> f = open('test.txt', 'w') >>> f.write('apple\n') >>> f.write('orange\n') >>> f.write('pear\n') >>> f.close() >>> f = open('test.txt', 'r') >>> f.readline() 'apple\n' >>> f.readlines() ['orange\n', 'pear\n'] >>> f.readline() '' >>> f.close() >>> f = open('test.txt', 'r') >>> f.read() 'apple\norange\npear\n' >>> f.close() >>> f = open('test.txt') >>> line = f.readline() >>> while line: line = line.rstrip() print(line) line = f.readline() apple orange pear >>> f.close() Processing Text File Line-by-LineWe can use a with open('path/to/file.txt', 'r') as f: for line in f: line = line.strip() The try: f = open('path/to/file.txt') for line in f: line = line.strip() finally: f.close() Example: Line-by-line File CopyThe following script copies a file into another line-by-line, prepending each line with the line number.
Binary File Operations[TODO] Intro
For example [TODO] Directory and File ManagementIn Python,
directory and file management are supported by modules Path Operations Using Module os.pathIn Python, a path could refer to:
A path could be absolute (beginning with root) or relative to the current working directory (CWD). The path separator is platform-dependent (Windows use Checking Path Existence and Type
For examples, >>> import os >>> os.path.exists('/usr/bin') True >>> os.path.isfile('/usr/bin') False >>> os.path.isdir('/usr/bin') True Forming a New PathThe path separator is platform-dependent (Windows use
For examples, >>> import os >>> print(os.path.sep) / >>> print(os.path.join(os.path.sep, 'etc', 'apache2', 'httpd.conf')) /etc/apache2/httpd.conf >>> print(os.path.join('..', 'apache2', 'httpd.conf')) ../apache2/httpd.conf Manipulating Directory-name and Filename
For example, to form an absolute path of a file called os.path.join(os.path.dirname(os.path.abspath('in.txt')), 'out.txt') os.path.join(os.path.dirname('in.txt'), 'out.txt') For example, import os print('__file__:', __file__) print('dirname():', os.path.dirname(__file__)) print('abspath():', os.path.abspath(__file__)) print('dirname(abspath()):', os.path.dirname(os.path.abspath(__file__))) When a module is loaded in Python, $ python3 ./test_ospath.py $ python3 test_ospath.py $ python3 ../parent_dir/test_ospath.py $ python3 /path/to/test_ospath.py Handling Symlink (Unixes/Mac OS)
For example, import os print('__file__:', __file__) print('abspath():', os.path.abspath(__file__)) print('realpath():', os.path.realpath(__file__)) $ python3 test_realpath.py # Same output for abspath() and realpath() becuase there is no symlink $ ln -s test_realpath.py test_realpath_link.py $ python3 test_realpath_link.py #abspath(): /path/to/test_realpath_link.py #realpath(): /path/to/test_realpath.py (symlink resolved) Directory & File Managament Using Modules os and shutilThe modules However,
Directory Management
File Management
For examples [TODO], >>> import os >>> dir(os) ...... >>> help(os) ...... >>> help(os.getcwd) ...... >>> os.getcwd() ... current working directory ... >>> os.listdir() ... contents of current directory ... >>> os.chdir('test-python') >>> exec(open('hello.py').read()) >>> os.system('ls -l') >>> os.name 'posix' >>> os.makedir('sub_dir') >>> os.makedirs('/path/to/sub_dir') >>> os.remove('filename') >>> os.rename('oldFile', 'newFile') List a Directory
For examples, >>> import os >>> help(os.listdir) ...... >>> os.listdir() [..., ..., ...] >>> for f in sorted(os.listdir('/usr')): print(f) ...... >>> for f in sorted(os.listdir('/usr')): print(os.path.abspath(f)) ...... List a Directory Recursively via os.walk()
For example,
List a Directory Recursively via Module glob (Python 3.5)[TODO] Intro
Copying File
Shell Command [TODO]
Environment Variables [TODO]
fileinput ModuleThe import fileinput def main(): lineNumber = 0 for line in fileinput.input(): line = line.rstrip() lineNumber += 1 print('{}: {}'.format(lineNumber, line)) if __name__ == '__main__': main() Text ProcessingFor simple text string operations such as string search and replacement, you can use the built-in string functions (e.g.,
String OperationsThe built-in class Strip whitespaces (blank, tab and newline)
Uppercase/Lowercase
Find
For examples, >>> s = '/test/in.txt' >>> s.find('in') 6 >>> s[0 : s.find('in')] + 'out.txt' '/test/out.txt' Find and Replace
For examples, >>> s = 'hello hello hello, world' >>> help(s.replace) >>> s.replace('ll', '**') 'he**o he**o he**o, world' >>> s.replace('ll', '**', 2) 'he**o he**o hello, world' Split into Tokens and Join
For examples, >>> 'apple, orange, pear'.split() ['apple,', 'orange,', 'pear'] >>> 'apple, orange, pear'.split(', ') ['apple', 'orange', 'pear'] >>> 'apple, orange, pear'.split(', ', maxsplit=1) ['apple', 'orange, pear'] >>> ', '.join(['apple', 'orange, pear']) 'apple, orange, pear' Regular Expression in Module reReferences:
I assume that you are familiar with regex, otherwise, you could read:
The >>> import re >>> dir(re) ...... >>> help(re) ...... Backslash (\), Python Raw String r'...' vs Regular StringRegex's syntax uses backslash (
On the other hand, Python' regular strings also use backslash for escape sequences, e.g., To
write the regex pattern Python's solution is using raw string with a prefix Furthermore, Python denotes parenthesized back references (or capturing groups) as I suggest that you use raw strings for regex pattern strings and replacement strings. Compiling (Creating) a Regex Pattern Object
For examples, >>> import re >>> p1 = re.compile(r'[1-9][0-9]*|0') >>> type(p1) Invoking Regex OperationsYou can invoke most of the regex functions in two ways:
Find using finaAll()
For examples, >>> p1 = re.compile(r'[1-9][0-9]*|0') >>> p1.findall('123 456') ['123', '456'] >>> p1.findall('abc') [] >>> p1.findall('abc123xyz456_7_00') ['123', '456', '7', '0', '0'] >>> re.findall(r'[1-9][0-9]*|0', '123 456') ['123', '456'] >>> re.findall(r'[1-9][0-9]*|0', 'abc') [] >>> re.findall(r'[1-9][0-9]*|0', 'abc123xyz456_7_00') ['123', '456', '7', '0', '0'] Replace using sub() and subn()
For examples, >>> p1 = re.compile(r'[1-9][0-9]*|0') >>> p1.sub(r'**', 'abc123xyz456_7_00') 'abc**xyz**_**_****' >>> p1.subn(r'**', 'abc123xyz456_7_00') ('abc**xyz**_**_****', 5) >>> p1.sub(r'**', 'abc123xyz456_7_00', count=3) 'abc**xyz**_**_00' >>> re.sub(r'[1-9][0-9]*|0', r'**', 'abc123xyz456_7_00') 'abc**xyz**_**_****' >>> re.sub(p1, r'**', 'abc123xyz456_7_00') 'abc**xyz**_**_****' >>> re.subn(p1, r'**', 'abc123xyz456_7_00', count=3) ('abc**xyz**_**_00', 3) >>> re.subn(p1, r'**', 'abc123xyz456_7_00', count=10) ('abc**xyz**_**_****', 5) Notes: For simple string replacement, use Using Parenthesized Back-References \1, \2, ... in Substitution and PatternIn Python, regex parenthesized back-references (capturing groups) are denoted as For examples, >>> re.sub(r'(\w+) (\w+)', r'\2 \1', 'aaa bbb ccc') 'bbb aaa ccc' >>> re.sub(r'(\w+) (\w+)', r'\2 \1', 'aaa bbb ccc ddd') 'bbb aaa ddd ccc' >>> re.subn(r'(\w+) (\w+)', r'\2 \1', 'aaa bbb ccc ddd eee') ('bbb aaa ddd ccc eee', 2) >>> re.subn(r'(\w+) \1', r'\1', 'hello hello world again again') ('hello world again', 2) Find using search() and Match Object
The
For example, >>> p1 = re.compile(r'[1-9][0-9]*|0') >>> inStr = 'abc123xyz456_7_00' >>> m = p1.search(inStr) >>> m <_sre.SRE_Match object; span=(3, 6), match='123'> >>> m.group() '123' >>> m.span() (3, 6) >>> m.start() 3 >>> m.end() 6 >>> m = p1.search(inStr, m.end()) >>> m <_sre.SRE_Match object; span=(9, 12), match='456'> >>> m = p1.search(inStr) >>> while m: print(m, m.group()) m = p1.search(inStr, m.end()) <_sre.SRE_Match object; span=(3, 6), match='123'> 123 <_sre.SRE_Match object; span=(9, 12), match='456'> 456 <_sre.SRE_Match object; span=(13, 14), match='7'> 7 <_sre.SRE_Match object; span=(15, 16), match='0'> 0 <_sre.SRE_Match object; span=(16, 17), match='0'> 0 To retrieve the back-references (or capturing groups) inside the Match object:
>>> p2 = re.compile('(A)(\w+)', re.IGNORECASE) >>> inStr = 'This is an apple.' >>> m = p2.search(inStr) >>> while m: print(m) print(m.group()) print(m.groups()) for idx in range(1, m.lastindex + 1): print(m.group(idx), end=',') print() m = p2.search(inStr, m.end()) <_sre.SRE_Match object; span=(8, 10), match='an'> an ('a', 'n') a,n, <_sre.SRE_Match object; span=(11, 16), match='apple'> apple ('a', 'pple') a,pple, Find using match() and fullmatch()
The For example, >>> p1 = re.compile(r'[1-9][0-9]*|0') >>> m = p1.match('aaa123zzz456') >>> m >>> m = p1.match('123zzz456') >>> m <_sre.SRE_Match object; span=(0, 3), match='123'> >>> m = p1.fullmatch('123456') >>> m <_sre.SRE_Match object; span=(0, 6), match='123456'> >>> m = p1.fullmatch('123456abc') >>> m Find using finditer()
The >>> p1 = re.compile(r'[1-9][0-9]*|0') >>> inStr = 'abc123xyz456_7_00' >>> p1.findall(inStr) ['123', '456', '7', '0', '0'] >>> for s in p1.findall(inStr): print(s, end=' ') 123 456 7 0 0 >>> for m in p1.finditer(inStr): print(m) <_sre.SRE_Match object; span=(3, 6), match='123'> <_sre.SRE_Match object; span=(9, 12), match='456'> <_sre.SRE_Match object; span=(13, 14), match='7'> <_sre.SRE_Match object; span=(15, 16), match='0'> <_sre.SRE_Match object; span=(16, 17), match='0'> >>> for m in p1.finditer(inStr): print(m.group(), end=' ') 123 456 7 0 0 Spliting String into Tokens
The >>> p1 = re.compile(r'[1-9][0-9]*|0') >>> p1.split('aaa123bbb456ccc') ['aaa', 'bbb', 'ccc'] >>> re.split(r'[1-9][0-9]*|0', 'aaa123bbb456ccc') ['aaa', 'bbb', 'ccc'] Notes: For simple delimiter, use Web ScrapingReferences:
Web Scraping (or web harvesting or web data extraction) refers to reading the raw HTML page to retrieve desired data. Needless to say, you need to master HTML, CSS and JavaScript. Python supports web scraping via packages requests and BeautifulSoup (bs4). Install PackagesYou could install the relevant packages using $ pip install requests $ pip install bs4 Step 0: Inspect the Target Webpage
Step 1: Send a HTTP GET request to the target URL to retrieve the raw HTML page using module requests>>> import requests >>> url = "http://your_target_webpage" >>> response = requests.get(url) >>> type(response) Step 2: Parse the HTML Text into a Tree-Structure using BeautifulSoup and Search the Desired Data>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(response.text, "html.parser") >>> type(soup) You could write out the selected data to a file: with open(filename, 'w') as fp: for row in rows: fp.wrire(row + '\n') You could also use >>> import csv >>> with open(filename, 'w') as fp: writer = csv.DictWriter(fp, ['colHeader1', 'colHeader2', 'colHeader3']) writer.writeheader() for row in rows: writer.writerow(row) Step 3: Download Selected Document Using urllib.requestYou may want to download documents such as text files or images. >>> import urllib.request >>> downloadUrl = '.....' >>> file = '......' >>> urllib.request.urlretrieve(download_url, file) Step 4: DelayTo avoid spamming a website with download requests (and flagged as a spammer), you need to pause your code for a while. >>> import time >>> time.sleep(1) REFERENCES & RESOURCES Is the process of inspecting data given to the program by the user and determining if it is valid?The process of inspecting data given to a program by the user and determining if it is valid is called: input validation.
When you open a file with the PrintWriter class the class can potentially throw an IOException?When you open a file with the PrintWriter class, the class can potentially throw an IOException. You can use the PrintWriter class to open a file and write data to it. This type of loop will always be executed at least once.
When the answer statement is encountered in a loop all the statements in the body of the loop that appear after it are ignored and the loop prepares for the next iteration?When the break statement is encountered in a loop, all the statements in the body of the loop that appear after it are ignored, and the loop prepares for the next iteration. In a for loop, the control variable is always incremented.
What loop is ideal in situations where the exact number of iterations is known?For-loops are typically used when the number of iterations is known before entering the loop. For-loops can be thought of as shorthands for while-loops which increment and test a loop variable.
|