亚洲十八**毛片_亚洲综合影院_五月天精品一区二区三区_久久久噜噜噜久久中文字幕色伊伊 _欧美岛国在线观看_久久国产精品毛片_欧美va在线观看_成人黄网大全在线观看_日韩精品一区二区三区中文_亚洲一二三四区不卡

COMP 330代做、Python設計程序代寫

時間:2024-04-02  來源:  作者: 我要糾錯



COMP 330 Assignment #5
1 Description
In this assignment, you will be implementing a regularized, logistic regression to classify text documents. The implementation will be in Python, on top of Spark. To handle the large data set that we will be
giving you, it is necessary to use Amazon AWS.
You will be asked to perform three subtasks: (1) data preparation, (2) learning (which will be done via
gradient descent) and (3) evaluation of the learned model.
Note: It is important to complete HW 5 and Lab 5 before you really get going on this assignment. HW
5 will give you an opportunity to try out gradient descent for learning a model, and Lab 5 will give you
some experience with writing efficient NumPy code, both of which will be important for making your A5
experience less challenging!
2 Data
You will be dealing with a data set that consists of around 170,000 text documents and a test/evaluation
data set that consists of 18,700 text documents. All but around 6,000 of these text documents are Wikipedia
pages; the remaining documents are descriptions of Australian court cases and rulings. At the highest level,
your task is to build a classifier that can automatically figure out whether a text document is an Australian
court case.
We have prepared three data sets for your use.
1. The Training Data Set (1.9 GB of text). This is the set you will use to train your logistic regression
model:
https://s3.amazonaws.com/chrisjermainebucket/comp330 A5/TrainingDataOneLinePerDoc.txt
or as direct S3 address, so you can use it in a Spark job:
s3://chrisjermainebucket/comp330 A5/TrainingDataOneLinePerDoc.txt
2. The Testing Data Set (200 MB of text). This is the set you will use to evaluate your model:
https://s3.amazonaws.com/chrisjermainebucket/comp330 A5/TestingDataOneLinePerDoc.txt
or as direct S3 address, so you can use it in a Spark job:
s3://chrisjermainebucket/comp330 A5/TestingDataOneLinePerDoc.txt
3. The Small Data Set (37.5 MB of text). This is for you to use for training and testing of your model on
a smaller data set:
https://s3.amazonaws.com/chrisjermainebucket/comp330 A5/SmallTrainingDataOneLinePerDoc.txt
Some Data Details to Be Aware Of. You should download and look at the SmallTrainingData.txt
file before you begin. You’ll see that the contents are sort of a pseudo-XML, where each text document
begins with a <doc id = ... > tag, and ends with </doc>. All documents are contained on a single
line of text.
Note that all of the Australia legal cases begin with something like <doc id = ‘‘AU1222’’ ...>;
that is, the doc id for an Australian legal case always starts with AU. You will be trying to figure out if the
document is an Australian legal case by looking only at the contents of the document.
1
3 The Tasks
There are three separate tasks that you need to complete to finish the assignment. As usual, it makes
sense to implement these and run them on the small data set before moving to the larger one.
3.1 Task 1
First, you need to write Spark code that builds a dictionary that includes the 20,000 most frequent words
in the training corpus. This dictionary is essentially an RDD that has the word as the key, and the relative
frequency position of the word as the value. For example, the value is zero for the most frequent word, and
19,999 for the least frequent word in the dictionary.
To get credit for this task, give us the frequency position of the words “applicant”, “and”, “attack”,
“protein”, and “car”. These should be values from 0 to 19,999, or -1 if the word is not in the dictionary,
because it is not in the to 20,000.
Note that accomplishing this will require you to use a variant of your A4 solution. If you do not trust
your A4 solution and would like mine, you can post a private request on Piazza.
3.2 Task 2
Next, you will convert each of the documents in the training set to a TF-IDF vector. You will then use
a gradient descent algorithm to learn a logistic regression model that can decide whether a document is
describing an Australian court case or not. Your model should use l2 regularization; you can play with in
things a bit to determine the parameter controlling the extent of the regularization. We will have enough
data that you might find that the regularization may not be too important (that is, it may be that you get good
results with a very small weight given to the regularization constant).
I am going to ask that you not just look up the gradient descent algorithm on the Internet and implement
it. Start with the LLH function from class, and then derive your own gradient descent algorithm. We can
help with this if you get stuck.
At the end of each iteration, compute the LLH of your model. You should run your gradient descent
until the change in LLH across iterations is very small.
Once you have completed this task, you will get credit by (a) writing up your gradient update formula,
and (b) giving us the fifty words with the largest regression coefficients. That is, those fifty words that are
most strongly related with an Australian court case.
3.3 Task 3
Now that you have trained your model, it is time to evaluate it. Here, you will use your model to predict
whether or not each of the testing points correspond to Australian court cases. To get credit for this task,
you need to compute for us the F1 score obtained by your classifier—we will use the F1 score obtained as
one of the ways in which we grade your Task 3 submission.
Also, I am going to ask you to actually look at the text for three of the false positives that your model
produced (that is, Wikipedia articles that your model thought were Australian court cases). Write paragraph
describing why you think it is that your model was fooled. Were the bad documents about Australia? The
legal system?
If you don’t have three false positives, just use the ones that you had (if any).
4 Important Considerations
Some notes regarding training and implementation. As you implement and evaluate your gradient descent algorithm, here are a few things to keep in mind.
2
1. To get good accuracy, you will need to center and normalize your data. That is, transform your data so
that the mean of each dimension is zero, and the standard deviation is one. That is, subtract the mean
vector from each data point, and then divide the result by the vector of standard deviations computed
over the data set.
2. When classifying new data, a data point whose dot product with the set of regression coefs is positive
is a “yes”, a negative is a “no” (see slide 15 in the GLM lecture). You will be trying to maximize the
F1 of your classifier and you can often increase the F1 by choosing a different cutoff between “yes”
and “no” other than zero. Another thing that you can do is to add another dimension whose value is
one in each data point (we discussed this in class). The learning process will then choose a regression
coef for this special dimension that tends to balance the “yes” and “no” nicely at a cutoff of zero.
However, some students in the past have reported that this can increase the training time.
3. Students sometimes face overflow problems, both when computing the LLH and when computing the
gradient update. Some things that you can do to avoid this are, (1) use np.exp() which seems to
be quite robust, and (2) transform your data so that the standard deviation is smaller than one—if you
have problems with a standard deviation of one, you might try 10−2 or even 10−5
. You may need to
experiment a bit. Such are the wonderful aspects of implementing data science algorithms in the real
world!
4. If you find that your training takes more than a few hours to run to convergence on the largest data set,
it likely means that you are doing something that is inherently slow that you can speed up by looking
at your code carefully. One thing: there is no problem with first training your model on a small sample
of the large data set (say, 10% of the documents) then using the result as an initialization, and continue
training on the full data set. This can speed up the process of reaching convergence.
Big data, small data, and grading. The first two tasks are worth three points, the last four points. Since it
can be challenging to run everything on a large data set, we’ll offer you a small data option. If you train your
data on TestingDataOneLinePerDoc.txt, and then test your data on SmallTrainingDataOneLinePerDoc.twe’ll take off 0.5 points on Task 2 and 0.5 points on Task 3. This means you can still get an A, and
you don’t have to deal with the big data set. For the possibility of getting full credit, you can train
your data on the quite large TrainingDataOneLinePerDoc.txt data set, and then test your data
on TestingDataOneLinePerDoc.txt.
4.1 Machines to Use
If you decide to try for full credit on the big data set you will need to run your Spark jobs three to five
machines as workers, each having around 8 cores. If you are not trying for the full credit, you can likely
get away with running on a smaller cluster. Remember, the costs WILL ADD UP QUICKLY IF YOU
FORGET TO SHUT OFF YOUR MACHINES. Be very careful, and shut down your cluster as soon as
you are done working. You can always create a new one easily when you begin your work again.
4.2 Turnin
Create a single document that has results for all three tasks. Make sure to be very clear whether you
tried the big data or small data option. Turn in this document as well as all of your code. Please zip up all
of your code and your document (use .gz or .zip only, please!), or else attach each piece of code as well as
your document to your submission individually. Do NOT turn in anything other than your Python code and
請加QQ:99515681  郵箱:99515681@qq.com   WX:codinghelp













 

標簽:

掃一掃在手機打開當前頁
  • 上一篇:AIC2100代寫、Python設計程序代做
  • 下一篇:COMP3334代做、代寫Python程序語言
  • 無相關信息
    昆明生活資訊

    昆明圖文信息
    蝴蝶泉(4A)-大理旅游
    蝴蝶泉(4A)-大理旅游
    油炸竹蟲
    油炸竹蟲
    酸筍煮魚(雞)
    酸筍煮魚(雞)
    竹筒飯
    竹筒飯
    香茅草烤魚
    香茅草烤魚
    檸檬烤魚
    檸檬烤魚
    昆明西山國家級風景名勝區
    昆明西山國家級風景名勝區
    昆明旅游索道攻略
    昆明旅游索道攻略
  • 短信驗證碼平臺 理財 WPS下載

    關于我們 | 打賞支持 | 廣告服務 | 聯系我們 | 網站地圖 | 免責聲明 | 幫助中心 | 友情鏈接 |

    Copyright © 2025 kmw.cc Inc. All Rights Reserved. 昆明網 版權所有
    ICP備06013414號-3 公安備 42010502001045

    欧美成人精品二区三区99精品| 黄网站在线播放| 国产亚洲精品超碰| 国产亚洲综合精品| 亚洲日本免费| 国产aa精品| 欧美性极品xxxx做受| 国产精品麻豆久久| 欧美精品hd| 欧美中文字幕一区二区三区亚洲| 伊人久久亚洲美女图片| 久久九九精品| 狂野欧美性猛交xxxx巴西| 91精品影视| 精品3atv在线视频| 麻豆mv在线看| 亚洲激情网址| 在线观看不卡视频| 福利成人在线观看| 日韩电影免费网址| 国产色产综合色产在线视频| 欧洲grand老妇人| 91精品国产成人观看| 久久麻豆精品| 久久资源综合| 国产精品久久久久久影院8一贰佰| 亚洲影院免费| 成人免费视频一区| 亚洲一本视频| 精品视频在线一区二区在线| 成人午夜天堂| 欧美日韩在线视频一区| 国产成人精品亚洲日本在线桃色| 亚洲天堂男人| www.爱久久.com| 国产成人夜色高潮福利影视| 日韩成人亚洲| 久久日文中文字幕乱码| 日韩综合小视频| 欧美视频你懂的| 99久久精品一区二区成人| 久久久天堂av| 国产视频精品久久| 婷婷综合福利| 国产精品久久久久aaaa| 欧美日韩一区二区在线播放| 精品久久久久久久久久久| 黑人巨大精品欧美一区免费视频 | 日本超碰一区二区| 香蒸焦蕉伊在线| 中国麻豆视频| 日韩一区二区三| 欧美高清精品3d| 午夜精品福利一区二区蜜股av | 成人精品gif动图一区| 欧美日韩国产色| 中文字幕理伦片免费看| 天堂网在线最新版www中文网| 精品久久久久久久| 国内精品在线播放| 国产精品色在线| 国产在线精品一区在线观看麻豆| 夜夜春成人影院| 成人影院中文字幕| 日本精品网站| 精品国产白色丝袜高跟鞋| 欧洲视频在线免费观看| 色综合久久中文字幕综合网| 五月天激情综合| 久久久久久9999| 亚洲自拍偷拍av| 不卡的av在线| 亚洲美女免费视频| 一区二区三区精品在线| 亚洲综合激情小说| 激情av一区二区| 91黄视频在线| 日韩午夜电影在线观看| 在线观看视频欧美| 国产黄色网页| 视频午夜在线| 巨骚激情综合| 成人福利影视| 青草在线视频| 视频国产精品| 日韩精品视频中文字幕| 99久久夜色精品国产亚洲1000部| 九九热hot精品视频在线播放 | 午夜精品福利影院| 欧美大片网站| 九九综合久久| 91亚洲成人| 亚洲成人直播| 日韩电影一区二区三区| 久久日一线二线三线suv| 成人免费视频一区二区| 亚洲妇熟xx妇色黄| 欧美日韩一区二区三区四区| 爽爽视频在线观看| 亚洲伦理影院| 日韩免费在线电影| 欧美男男freegayvideosroom| 亚洲国产精品成人| 久久综合久久综合久久综合| 久久久不卡网国产精品二区| 欧美视频在线免费看| 欧美日韩伦理片| 成人在线日韩| 麻豆精品国产91久久久久久| 亚洲国产精品麻豆| 免费在线性爱视频| 日韩精品免费视频一区二区三区 | 国产在线资源| 欧美一区二区三区红桃小说| 国产中文字幕精品| 在线免费一区三区| 中日韩高清电影网| 人人狠狠综合久久亚洲婷婷| 成人精品在线视频观看| 天天天天天操| 麻豆传媒在线免费看| 欧美影院精品| 1024日韩| 一区二区三区四区蜜桃| 外国成人在线视频| 免费精品视频在线| 欧美性生交片4| 激情小说 在线视频| 一区二区三区区四区播放视频在线观看 | 亚洲午夜剧场| 亚洲高清毛片| 色诱视频网站一区| 黄色网址视频在线观看| 一本一本久久| 精品福利在线视频| www.亚洲资源| 在线观看欧美理论a影院| 一区二区欧美国产| 成人精品国产| ww亚洲ww在线观看国产| 午夜dj在线观看高清视频完整版 | 99久久精品免费观看| 日本一本视频| 日韩在线不卡| 91精品国产综合久久国产大片| 加勒比久久高清| 亚洲自拍另类综合| 久久久久久久性潮| 亚洲欧美日韩一区| 一区二区精彩视频| 亚洲成人综合网站| 牛牛电影国产一区二区| 精品综合久久久久久8888| 日韩黄色网址| 国产毛片精品视频| 亚洲区欧洲区| 亚洲男人的天堂在线观看| 成人国产精品入口免费视频| 成人午夜免费av| 一区二区三区短视频| 亚洲女同ⅹxx女同tv| 波多野结衣欧美| 精品国产伦一区二区三区免费| 中文成人在线| 色噜噜狠狠成人中文综合| 欧美网站免费| 6080亚洲精品一区二区| 九九综合九九| 精品欧美不卡一区二区在线观看| 国产一区视频导航| 国产激情视频在线看| 中文字幕av在线一区二区三区| 成人黄色av网址| 九色porny蝌蚪视频在线观看| 国产乱理伦片在线观看夜一区| 免费在线播放电影| 亚洲精品国产无套在线观| 国产成人精品一区二区三区视频| 欧美偷拍一区二区| 精品一区二区免费看| 日韩黄色av| 青梅竹马是消防员在线| 亚洲人成小说网站色在线| 中文在线播放一区二区| 中文字幕在线看片| 欧美特级限制片免费在线观看| 久久精品99久久久| 韩国一区二区三区视频| 在线观看精品一区二区三区| 无码av中文一区二区三区桃花岛| 精品美女在线视频| 羞羞视频在线免费看| 久久久亚洲高清| 久久av导航| 日韩欧美少妇| 色婷婷av金发美女在线播放| 丁香激情综合五月| 伊人久久大香线蕉综合四虎小说 | 91精品国产福利| 国产.精品.日韩.另类.中文.在线.播放|