Leaking information when serve static file

Post by Lưu Đại at 07-12-2023
Recently my friend search his phone number on google, the results show a pdf file. That file is a student list file contains all the information of student who graduated at 2020.
It has about 1600 records, each have phone number, email and our social number.

Reason:
Static files usually stored in a separated server and not be authorized.
Google's crawler is too good, it support crawl even xls and pdf file.
Files doesn't have any passwords.

How to fix:
With some important files, I think we should create an api serving those files. Server will pull this file from storage and serve it to client. Client can not access those file directly from file server.
Add robots.txt for page that have important file links so google doesn't crawl those files. Reference
Create an temporary url for static file (like aws s3 presign url), grant it a time to live, if the time is exceeded the old link refuse to serve file.  Reference