I'm working my way through an excellent webscraping tutorial, and I'm on Lesson 10 here:
https://programminghistorian.org/en/lessons/counting-frequencies But when I run the recommended code, using functions to delete stop words, I'm still left with a lot of numeric tokens (strings) that are counted in the word frequency list. I'd like to get rid of every numerical character.
Any recommendations as to how to use either the re module or some other function/module that would delete numerical tokens?
This is my code (I'm working in Thonny 3.2.7 (which uses Python 3.7.7 as the interpreter, and Tk 8.6.8) on Mac Catalina):
#
html-to-freq-2.py
import urllib.request, urllib.error, urllib.parse
import obo
url = '
http://www.oldbaileyonline.org/browse.jsp?id=t17800628-33&div=t17800628-33'
response = urllib.request.urlopen(url)
html =
response.read()
text = obo.stripTags(html).lower()
fullwordlist = obo.stripNonAlphaNum(text)
wordlist = obo.removeStopwords(fullwordlist, obo.stopwords)
dictionary = obo.wordListToFreqDict(wordlist)
sorteddict = obo.sortFreqDict(dictionary)
for s in sorteddict: print(str(s))
#Note that the "obo" module is a .py file created with code recommended by the tutorial. That code is:
#For processing obo file Old Baily Online. Webscraping.
#Telling the module which tag to look for to start scraping: startLoc =
#And which tag defines the end point, where you want to stop scraping: endLoc =
def stripTags(pageContents):
pageContents = str(pageContents)
startLoc = pageContents.find("")
endLoc = pageContents.rfind("")
pageContents = pageContents[startLoc:endLoc]
return pageContents
# Given a list of words, remove any that are in a list of stop words.
def removeStopwords(wordlist, stopwords):
return [w for w in wordlist if w not in stopwords]
#Setting the list of stop words. This, per ProgrammingHistorian, is supposed to go at the beginning of
obo.py
stopwords = ['a', 'about', 'above', 'across', 'after', 'afterwards']
stopwords += ['again', 'against', 'all', 'almost', 'alone', 'along']
stopwords += ['already', 'also', 'although', 'always', 'am', 'among']
stopwords += ['amongst', 'amoungst', 'amount', 'an', 'and', 'another']
stopwords += ['any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere']
stopwords += ['are', 'around', 'as', 'at', 'back', 'be', 'became']
stopwords += ['because', 'become', 'becomes', 'becoming', 'been']
stopwords += ['before', 'beforehand', 'behind', 'being', 'below']
stopwords += ['beside', 'besides', 'between', 'beyond', 'bill', 'both']
stopwords += ['bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant']
stopwords += ['co', 'computer', 'con', 'could', 'couldnt', 'cry', 'de']
stopwords += ['describe', 'detail', 'did', 'do', 'done', 'down', 'due']
stopwords += ['during', 'each', 'eg', 'eight', 'either', 'eleven', 'else']
stopwords += ['elsewhere', 'empty', 'enough', 'etc', 'even', 'ever']
stopwords += ['every', 'everyone', 'everything', 'everywhere', 'except']
stopwords += ['few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first']
stopwords += ['five', 'for', 'former', 'formerly', 'forty', 'found']
stopwords += ['four', 'from', 'front', 'full', 'further', 'get', 'give']
stopwords += ['go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her']
stopwords += ['here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers']
stopwords += ['herself', 'him', 'himself', 'his', 'how', 'however']
stopwords += ['hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed']
stopwords += ['interest', 'into', 'is', 'it', 'its', 'itself', 'keep']
stopwords += ['last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made']
stopwords += ['many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine']
stopwords += ['more', 'moreover', 'most', 'mostly', 'move', 'much']
stopwords += ['must', 'my', 'myself', 'name', 'namely', 'neither', 'never']
stopwords += ['nevertheless', 'next', 'nine', 'no', 'nobody', 'none']
stopwords += ['noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of']
stopwords += ['off', 'often', 'on','once', 'one', 'only', 'onto', 'or']
stopwords += ['other', 'others', 'otherwise', 'our', 'ours', 'ourselves']
stopwords += ['out', 'over', 'own', 'part', 'per', 'perhaps', 'please']
stopwords += ['put', 'rather', 're', 's', 'same', 'see', 'seem', 'seemed']
stopwords += ['seeming', 'seems', 'serious', 'several', 'she', 'should']
stopwords += ['show', 'side', 'since', 'sincere', 'six', 'sixty', 'so']
stopwords += ['some', 'somehow', 'someone', 'something', 'sometime']
stopwords += ['sometimes', 'somewhere', 'still', 'such', 'system', 'take']
stopwords += ['ten', 'than', 'that', 'the', 'their', 'them', 'themselves']
stopwords += ['then', 'thence', 'there', 'thereafter', 'thereby']
stopwords += ['therefore', 'therein', 'thereupon', 'these', 'they']
stopwords += ['thick', 'thin', 'third', 'this', 'those', 'though', 'three']
stopwords += ['three', 'through', 'throughout', 'thru', 'thus', 'to']
stopwords += ['together', 'too', 'top', 'toward', 'towards', 'twelve']
stopwords += ['twenty', 'two', 'un', 'under', 'until', 'up', 'upon']
stopwords += ['us', 'very', 'via', 'was', 'we', 'well', 'were', 'what']
stopwords += ['whatever', 'when', 'whence', 'whenever', 'where']
stopwords += ['whereafter', 'whereas', 'whereby', 'wherein', 'whereupon']
stopwords += ['wherever', 'whether', 'which', 'while', 'whither', 'who']
stopwords += ['whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with']
stopwords += ['within', 'without', 'would', 'yet', 'you', 'your']
stopwords += ['yours', 'yourself', 'yourselves']
stopwords += ['p']
#Stripping out all non-alphanumeric characters, then importing the regex module "re"
#Also using Unicode to make sure all characters from all languages can be scraped by the module (that we're creating)
#And splitting the text
def stripNonAlphaNum(text):
import re
return re.compile(r'\W+', re.UNICODE).split(text)
# Given a list of words, return a dictionary of
# word-frequency pairs.
def wordListToFreqDict(wordlist):
wordfreq = [wordlist.count(p) for p in wordlist]
return dict(list(zip(wordlist,wordfreq)))
# Sort a dictionary of word-frequency pairs in order of descending frequency.
def sortFreqDict(freqdict):
aux = [(freqdict[key], key) for key in freqdict]
aux.sort()
aux.reverse()
return aux
#I want to point out to those who are, like me, learning the rudiments of webscraping, the first lesson in this awesome tutorial is here:
https://programminghistorian.org/en/lessons/introduction-and-installation Thanks for any help you can give me.
Joel
Help me understand python 3.7.7 basics and input out put process. How to write in pseudocode and write a basic program.
Help me understand chapters 1-13 of my tex book
Help me understand chapters 1 - 13 of my textbook program examples.
Heres a few examples I can always paste more.
def main(): print("This program converts kilometers to miles.") print()
kms = eval(input("Enter the distance in kilometers: ")) miles = kms * 0.62 print("The distance is", miles, "miles.")
def fibo(n): curr , prev = 1,1 for i in range(n-2): curr, prev = curr+prev, curr return curr
def main(): print("Nth fiboancci number\n") n = int(input("enter n : ")) print("the fibonacci value is" , fibo(n))
main()
main()
# single-shot cannonball animation
from math import sqrt, sin, cos, radians, degrees from graphics import * from projectile import Projectile from button import Button
class InputDialog:
""" A custom window for getting simulation values (angle, velocity, and height) from the user."""
def __init__(self, angle, vel, height): """ Build and display the input window """
self.win = win = GraphWin("Initial Values", 200, 300) win.setCoords(0,4.5,4,.5)
Text(Point(1,1), "Angle").draw(win) self.angle = Entry(Point(3,1), 5).draw(win) self.angle.setText(str(angle))
Text(Point(1,2), "Velocity").draw(win) self.vel = Entry(Point(3,2), 5).draw(win) self.vel.setText(str(vel))
Text(Point(1,3), "Height").draw(win) self.height = Entry(Point(3,3), 5).draw(win) self.height.setText(str(height))
self.fire = Button(win, Point(1,4), 1.25, .5, "Fire!") self.fire.activate()
self.quit = Button(win, Point(3,4), 1.25, .5, "Quit") self.quit.activate()
def getValues(self): """ return input values """
a = float(self.angle.getText()) v = float(self.vel.getText()) h = float(self.height.getText()) return a,v,h
def interact(self): """ wait for user to click Quit or Fire button Returns a string indicating which button was clicker """
while True: pt = self.win.getMouse() if self.quit.clicked(pt): return "Quit" if self.fire.clicked(pt): return "Fire!"
def close(self): """ close the input window """ self.win.close()
class ShotTracker:
""" Graphical depiction of a projectile flight using a Circle """
def __init__(self, win, angle, velocity, height): """win is the GraphWin to display the shot, angle, velocity, and height are initial projectile parameters. """
self.proj = Projectile(angle, velocity, height) self.marker = Circle(Point(0,height), 3) self.marker.setFill("red") self.marker.setOutline("red") self.marker.draw(win)
def update(self, dt): """ Move the shot dt seconds farther along its flight """
self.proj.update(dt) center = self.marker.getCenter() dx = self.proj.getX() - center.getX() dy = self.proj.getY() - center.getY() self.marker.move(dx,dy)
def getX(self): """ return the current x coordinate of the shot's center """ return self.proj.getX()
def getY(self): """ return the current y coordinate of the shot's center """ return self.proj.getY()
def destroy(self): """ undraw the shot """ self.marker.undraw()
def main():
# create animation window win = GraphWin("Projectile Animation", 640, 480, autoflush=False) win.setCoords(-10, -10, 210, 155) Line(Point(-10,0), Point(210,0)).draw(win) for x in range(0, 210, 50): Text(Point(x,-5), str(x)).draw(win) Line(Point(x,0), Point(x,2)).draw(win)
angle, vel, height = 45.0, 40.0, 2.0 inputwin = InputDialog(angle, vel, height)
# event loop while True: # interact with the user choice = inputwin.interact()
if choice == "Quit": # loop exit break
# otherwise choice is "Fire!" # create a shot and track until it hits ground or leaves window angle, vel, height = inputwin.getValues() shot = ShotTracker(win, angle, vel, height) while 0 <= shot.getY() and -10 < shot.getX() <= 210: shot.update(1/50) update(50) #shot.destroy()
win.close()
if __name__ == "__main__": main()
# c11ex19.py # Implementation of sets # Note: Python has a built-in set type, so implementing this is just # an interesting programming exercise.
# using dictionaries---efficient, but objects in set must be immutable
class DictSet:
def __init__(self, elements): self.elements = {} for e in elements: self.elements[e] = True # actual assigned value irrelevant
def addElement(self, x): self.elements[x] = True
def deleteElement(self, x): if x in self.elements: del self.elements[x]
def member(self, x): return x in self.elements
def intersection(self, set2): inter = [] for x in self.elements: if x in set2.elements: inter.append(x) return DictSet(inter)
def union(self, set2): return DictSet(list(self.elements.keys()) + list(set2.elements.keys()))
def subtract(self, set2): els = [] for x in self.elements: if x not in set2.elements: els.append(x) return DictSet(els)
def __str__(self): return "
".format(list(self.elements.keys()))
# Using lists, less efficient but more general
class ListSet:
def __init__(self, elements): els = [] for x in elements: if x not in els: els.append(x) self.elements = els
def addElement(self, x): if x not in self.elements: self.elements.append(x)
def deleteElement(self, x): try: self.elements.remove(x) except: pass
def member(self, x): return x in self.elements
def intersection(self, set2): els = [] for x in self.elements: if x in set2.elements: els.append(x) return ListSet(els)
def union(self,set2): return ListSet(self.elements + set2.elements)
def subtract(self, set2): els = [] for x in self.elements: if x not in set2.elements: els.append(x) return ListSet(els)
def __str__(self): return "".format(self.elements)
def test(): for SetClass in [DictSet, ListSet]: all = SetClass([1,2,3,4,5,6,7,8,9,10]) odds = SetClass([1,3,5,7,9]) evens = SetClass([2,4,6,8,10]) print("all", odds.union(evens)) print("all", evens.union(odds)) print("empty", odds.intersection(evens)) print("odds", all.intersection(odds)) print("odds", all.subtract(evens))
if __name__ == '__main__': test()
def fib(n): print("Computing fib(%d)"%n) if n<3: return_value = 1 else: return_value = fib(n-1)+fib(n-2)
print("Leaving fib({0}) returning {1}".format(n,return_value)) return return_value
def main(): print() print("Let's trace the function that computes Fibonacci numbers") print()
n = int(input("Computing the n-th Fibonacci number. Enter n. ")) f = fib(n)
print() print("Fib({0}) is {1}".format(n,f)) print()
main()
how to understand the graphics.py library?
how python grammatical syntax works properly and how the the python.com site works with each of its various tabs and links.
need to fix this for my mod 4 review part 2 assignment as well to work correctly
- [sales-invoice-finder.py] You are working for a computer parts manufacturer that needs a new program to find sales information based on one of two pieces of information
- an invoice identifier (id) or
- a customer's last name (lname)
Your company logs each parts sale in an Excel file saved as a CSV file called `sales_data.csv` (NOTE: if you think no company would ever do this, you are incorrect: it has been done). The first line in that file contains the column headers: textual descriptions of what data is in each column. The columns are invoice id, a customer's first name, last name, part number, quantity, and total.
Your goal is to make this file searchable. Your program should prompt the user for the following inputs, in this order:
- Whether they want to search using an invoice id (input of `id`) or by a customer's last name (input of `lname`). The program should reject any input that is not `id` or `lname`, forcing the user to choose one of those two options.
- The search term, either an `id` value or an `lname` value
After the user enters these inputs, the program should open the data file (note that the user does not input the data file), then search within the chosen column (i.e., either `id` or `lname`, but not both) for the input value. If the program finds a match in any invoice recrords, then it it should print (not save) all recorded invoices that match. Finally it should print the total number of records found that match the search term.
Here are three examples:
Search by invoice id (id) or customer last name (lname)? firstname ERROR: You must enter either 'id' for invoice id search or 'lname' for customer last name search Search by invoice id (id) or customer last name (lname)? lname Enter your search term: Hutz 87681,Lionel,Hutz,218,1,50.83 34018,Lionel,Hutz,112,3,88.88 34018,Lionel,Hutz,386,3,86.04 34018,Lionel,Hutz,216,1,53.54 4 records found.Search by invoice id (id) or customer last name (lname)? id Enter your search term: 93303 93303,Frank,Grimes,392,2,90.74 93303,Frank,Grimes,142,3,73.2 93303,Frank,Grimes,353,1,45.87 3 records found.Search by invoice id (id) or customer last name (lname)? lname Enter your search term: Maryville 0 records found.
f = open('file.txt', 'r')
for line in f: print(line)
invoiceid = input('invoiceid') customerlastname = input('customerlastname')
Search by ('invoiceid' (id)) or ('customerlastname' (lname))
Enter your search term:
f.close()
Reviewing chapters 1-13 tonight any other sites or tips on what I should learn, do or practice let me know down below with the comments.
Comments: