Complete guide on how to use AWS Lambda & DynamoDB for web scraping

Published in

CodeX

15 min readApr 7, 2022

Data collecting is a necessary step in data science. And you can either download pre-collected data or collect it on your own. One of the ways to collect data is web scraping. Oftentimes, this requires you to run your code regularly. To do this, you have two options. First, you can sit down every hour to run your code. Second, you can upload and run your code on a cloud platform. The first choice will be quicker and easier to implement, but the second choice is more reliable. So I suggest you try Amazon Web Services (AWS), one of the cloud services.

What is AWS?

AWS is a cloud computing platform supported by Amazon that includes a mixture of infrastructure as a service (IaaS), platform as a service (Paas), and packaged software as a service (SaaS) offerings [5]. In easy words, AWS lets you run complex programs or services on their servers so that you don’t need a cutting-edge computer. There are lots of services you can choose from. Among them, today we need Lambda, EventBridge, Dynamodb, and Identity & Access Management (IAM) to web scrape and save data.

What is AWS Lambda, EventBridge, DynamoDB, IAM

Lambda

In short words, AWS Lambda lets you write or upload a code and dependencies so that you can use Amazon’s infrastructure to run the code instead of with your computer. The benefit of this is that you can run any code regardless of your computer specifics and don’t need to turn on your computer to run your code. Also, you can integrate it with other AWS services to do more advanced stuff.

By default, it sets a runtime of 60 seconds, but this usually isn’t enough for web scraping. Fortunately, it lets you extend the time to a maximum of 15 minutes, so later when you write a code, you need to be careful not to make your code run over this limit.

EventBridge

On the official document, it says that AWS EventBridge lets you connect your applications with data from a variety of sources. This means it detects an event from other services in real-time and activates an action when it detects an event. So with this service, you can easily set up connections between services or monitor your process.

DynamoDB

Dynamodb is a NoSQL database service you can use from AWS. And because AWS takes care of the configuration process of a database, you can easily create tables, store data, and query rows. One of the benefits of using this is scalability since it uses the cloud to store data.

IAM

However, these services can’t be accessed from the others by default for security reasons. You need to create roles first from IAM and assign those roles to each service to connect with others.

How to use these to web scrape

Before we get into the steps, this post is written in 2022 April, and the AWS website can change anytime. Thus, you should know that the process below might not be accurate if you read this post in the far future. Also, I will use Python 3.9 to web scrape in the steps below.

1. Build web scraping code

Before you start using AWS, it is a good idea to build a working web scraping code elsewhere. This can be Google Colab or local IDE depending on your situation.

The following code is my version of web scraping to collect data from a website that sells used ones. It uses sqlite3 to store data, but it will later change into DynamoDB.

Web scraping code of Daangn website

2. Create a function in Lambda

To upload your code to the AWS server, you first need to create a function in AWS Lambda.

Click “Create function” from Lambda Dashboard

2. Write your function name and choose runtime of your web scraping code. Then, click “Create function”.

3. Modify your code

Depending on how you code, the structure of your code would be different from each other. But AWS lambda requires you to follow certain regulations, so you should change your code accordingly. One of the rules you need to follow is that Lambda runs a single function. By default, it is named lambda_handler(). So you should either call all the necessary functions inside the lambda_handler()

Screenshot after adding the scrape() inside the lambda_handler()

or change its name by editing the Handler from Runtime settings. The “lambda_function” in front of the function name indicates the python file name.

How it looks like after changing the handler.

4. Add dependencies

Another thing to take note of is that you can’t use import statements on packages that aren’t supported in AWS by default. So you should upload the package files along with your code to use the import statements. In my case, to use the bs4 package,

Download the bs4 zip file from the Python Package Index website

2. Unzip the folder & Find a folder that contains the __init__.py file. Generally, a folder with the package name is the one.

3. Zip dependency folders together with your code file

4. Upload it to Lambda through “Upload from” → “.zip file”

When everything is completed, you can find the dependency files on the left side of your code.

Note that I removed the sqlite3 import statement and associated codes since I’m planning to use DynamoDB.

5. Test your function

Even though you confirmed that your code works in step 1, things changed a lot after you last checked. So it’s best to check again, and you can do that with the following steps.

Select your function

2. Click the Test tab

3. Write event name & Choose hello-world template. Then, save it.

4. Test your event by clicking “Test” from the Test tab

5. Images below indicate how your event went

6. You can also test your code from the Code tab → Test button

6. Create table in DynamoDB

Now that I removed lines that store collected data, I need to work on that. And the first thing to do is to create a table in DynamoDB.

From DynamoDB Dashboard, click Create table

2. Write table name, partition key, sort key & Choose data type for the partition and sort key

If you are new to DynamoDB, you would wonder what are the partition and sort keys. To explain these briefly, the partition key is used in the hash function to determine which partition to use to store the data. In contrast, the sort key is used to sort values inside each partition. And you can use these two under two scenarios.

First, you can only use the partition key. In this case, the partition key should be unique in the table so that each item is distinguishable. Thus, it works as a primary key in a SQL table, which is used to access rows.

Second, you can use them both. In this case, the combination of the partition and sort key values should be unique. Thus, it’s possible to have the same partition key value unless the combination is unique by having different sort key values.

So for instance, when you store a list of books in DynamoDB with author and book title as keys, you can set each author and book title as the partition and sort key. This allows you to have the same author name, but the titles that go under the same author should be unique. So in the end, the author-title combination is unique for every row.

In my case, my code finds the most recent ArticleNum value from rows that contain 1 in the Time column. Therefore, I set Time as the partition key and ArticleNum as the sort key.

3. Select Default settings under Settings & Create table

6-1. Add an item to the table

This isn’t a necessary step, but I need this because of the way my code works. As briefly mentioned above, my code finds the most recent post by finding the highest ArticleNum from rows that contain “1” in Time. So my code won’t start unless there is at least one row that gives a starting point. So I needed to add a dummy row.

Click Explore table items from the table you created

2. Click Create item

3. Enter attribute name, value, and type by clicking Add new attribute

381392502 indicates one of the recent posts when I wrote this post

Then, you can see that the item is added to the table.

7. Connecting DynamoDB table with Lambda function

Currently, you have a function that web scrapes and a table that is ready to store data. What you don’t have yet is a connection between the two, and you can achieve that with boto3 which allows you to manage AWS services with Python. If you want to take a deep look into its documentation, follow the link. But if you just want to know how to connect the table with Lambda, see the code below.

From the code above, the first line means that it’s going to call the DynamoDB service that is located in the ‘ap-northeast-2’ region. Then, the second line calls the ‘table_name’ table from the connection that is established from the first line.

In my case, I created my table on the Seoul server. That is why I pass ‘ap-northeast-2’ into the region_name argument. So when you apply it in your code, you should find your region from your table’s general information and change it accordingly.

7-1. Using query on DynamoDB table from Lambda function

Two operations you can perform on the DynamoDB table are Query and Scan. While they seem to perform a similar task, the difference between them is vital.

While Query directly looks up values from the partition that you use as a condition, Scan looks into every single row from a table to find rows that match a condition. Therefore, Query can guarantee you a faster speed than Scan. So it’s generally recommended to use Query instead of Scan unless you don’t know the table’s partition and sort key. You can find a more detailed explanation in the Dynobase article written by Rafal Wilinski. To see briefly how to query with python see the code below.

Need to import Key from boto3.dynamodb.conditions to query

As you can see, the code is quite straightforward. The first line of code returns every row that has value 1 in the Time column, which is the partition key. So I use eq in my case, but if you want a condition to be a range of values, you can use le, lt, ge, gt, begins_with, between instead of eq. Each le, lt, ge, gt stands for less than or equal to, less than, greater than or equal to, greater than. Then, I set the ScanIndexForward into False to sort returned rows in descending order. The default value is True, which is in ascending order.

From the second line, [“Items”] allows you to access a list of rows returned by the query. Then, from each row, you can use column names to access its value. In my case, I only retrieved the value from the first item because I only need the highest ArticleNum.

For a more detailed explanation of the functions you can use, look into the Boto3 documentation.

7-2. Granting access

Even though you wrote every line of code correctly, it won’t work unless you give access to your lambda function to work with the table.

From IAM’s Policies tab, click Create Policy

2. From the first page, select JSON, copy and paste the following code, and modify the Resource parameter based on the General information from your table

Code to paste in policy. 123456789012 stands for account ID. You should replace that and REGION _NAME with yours.

3. Skip Add tags

4. Write policy name and description & Click Create policy

5. From Roles tab under IAM, click Create role

6. Select AWS service and Lambda

7. Search and click the checkbox of the policy you created in step 4

8. Type role name and create the role

9. From the function you created for web scraping, click Configuration → Permissions → Edit

10. Extend the Timeout into 15 minutes & Choose the role you created in step 8

11. Then, when you run your code, you can confirm that the connection is established by checking the value retrieved from the table

7-3. Confirm your web scraping works

At last, now is the time to check the whole process: from scraping data from web pages to storing data to DynamoDB.

When it works as intended, you can see rows are added to the table.

8. Scheduling Lambda function from EventBridge

Lastly, you need to schedule your function so that it runs automatically.

From EventBridge, click Create rule

2. Type rule name and description & Choose Schedule under Rule type

3. Define your schedule under Schedule pattern

4. Select AWS service under Target types & Choose Lambda function and your web scraping function under Select a target

5. Skip Step 4 and create your rule

9. Check whether scheduling works or not

And of course, after you add a feature, you need to check whether it works or not. You can check this from CloudWatch.

Go to Log groups from CloudWatch

2. Select the function you made in Lambda

3. Find a log stream associated with connecting with the EventBridge. If you followed the steps, it would be the latest one.

4. View Timestamp and Message in Log events to confirm whether EventBridge works successfully or not

To check whether the EventBridge works or not, I ran the code every minute

Exporting data from DynamoDB

Once you collect tons of data, now it’s time to download it for examination. One of the ways to do it is through S3.

From S3 Buckets, Click Create bucket

2. Write your bucket name and select region of your location

3. Choose ACLs disabled and block all public access in case of someone else tries to use it. Then, click Create button located at the bottom.

4. From Data Pipeline service, click Create new pipeline

5. Type the pipeline’s name, choose Export DynamoDB table to S3 from Source, and write down the Parameters. For the output folder, you should select the bucket you created in step 3 and type ‘/exports’ at the end to create a folder.

6. Under Pipeline Configuration, select the same bucket for logs and type ‘/logs’ at the end. Then, under the Tags, type ‘dynamodbdatapipeline’ under Key and ‘true’ under Value. Once you are done with settings, click Activate. This will take a while.

7. Now, all you need to do is exporting. So go to the table you want to extract. Then, go to Exports and streams tab and click Export to S3.

8. For the Destination S3 bucket, select the bucket you created

9. Once the Status turns into Completed, click the link under Destination S3 bucket

10. Click AWSDynamoDB/ → the most recent folder → data/ folder

11. Select the file and click Download

12. When you unzip the file, you can check that the data is in JSON format.

Once you are done following all the steps, now you know how to web scrape with a cloud service and download collected data. But you should remember that you are paying Amazon to use these services. Thus, the more resources you use from AWS, the more you will pay at the end of the month. So if you have a limited budget, I highly recommend you check the AWS billing page regularly.

Reference

[1] Boto3. “Boto3 Documentation¶.” Boto3 Documentation — Boto3 Docs 1.21.32 Documentation, https://boto3.amazonaws.com/v1/documentation/api/latest/index.html.

[2] Crosett, Lex. “Ko.” AWS, AWS Database Blog, 23 Jan. 2019, https://aws.amazon.com/ko/blogs/database/how-to-determine-if-amazon-dynamodb-is-appropriate-for-your-needs-and-then-plan-your-migration/.

[3] MongoDB. “Advantages of Nosql.” MongoDB, https://www.mongodb.com/nosql-explained/advantages.

[4] Robinson, Andrew. “Ko.” Amazon, Quintessenz-Verl., 23 Jan. 2018, https://aws.amazon.com/ko/blogs/security/how-to-create-an-aws-iam-policy-to-grant-aws-lambda-access-to-an-amazon-dynamodb-table/.

[5] Taylor, David. “What Is AWS? Amazon Cloud (Web) Services Tutorial.” Guru99, 12 Feb. 2022, https://www.guru99.com/what-is-aws.html.

[6] Wilinski, Rafal. “DynamoDB Scan vs Query — Everything You Need to Know.” Dynobase, 15 May 2020, https://dynobase.dev/dynamodb-scan-vs-query/.