Listing Files Within A Bucket Folder – Python

Here’s a short code example in Python to iterate through a folder’s ( thisisafolder ) contents within Google Cloud Storage (GCS). Each filename can be accessed through blobi.name – in the below code sample, we print it out and test whether it ends with .json.

Remember that folders don’t actually exist on GCS, but a folder-like structure can be created by prefixing filenames with the folder name and the forward slash character ( / ).

    client = storage.Client()
    bucket = client.get_bucket("example-bucket-name")
    blob_iterator = bucket.list_blobs(prefix="thisisafolder",client=client)
    #iterate through and print out blob filenames
    for blobi in blob_iterator:
        print(blobi.name)
        if blobi.name.endswith(".json"):
            #do something with blob that ends with ".json"

Finding Interesting Files – The Filetype: Operator

Sometimes, a researcher needs to find something else other than a web page. News releases and raw data are often published for release as PDF files. Microsoft Powerpoint files (.PPTX) are often used to outline new company initiatives. Microsoft Word files (.DOCX) are shared while text is being edited/approved/discussed.

To find these files, the filetype: operator (or its alias, the ext: operator) can be used. For example, if I need to find official releases of employment data, a possible search would be one of the below:

employment data filetype:pdf
employment data ext:pdf
Searching for employment data.

As you can note from the red boxes above, all the results are of .PDF files – as the search query asked for.