A representation is an alternative asset for a file stored in Box. These assets can be PDFs, thumbnails, or text extractions.
Representations are automatically generated for the supported file types, either when uploading to Box or when requesting the asset.
Consider file representations as document avatars. Representations go way beyond thumbnails, they are a way to access the content of a file without having to download it or get a pdf version of a document, even if the document is not a pdf.
This feature has become more relevant with the rise of AI and LLM, as it allows you to extract the content of a file and use it for other purposes, for example sending it to OpenAI.
Not all representations are available for all file types. For example, you can't get a text representation of an image file.
References to our documentation:
Create a files_representations_init.py
file on the root of the project and execute the following code:
"""create sample content to box"""
import logging
from utils.box_client_oauth import ConfigOAuth, get_client_oauth
from workshops.file_representations.create_samples import create_samples
logging.basicConfig(level=logging.INFO)
logging.getLogger("box_sdk_gen").setLevel(logging.CRITICAL)
conf = ConfigOAuth()
def main():
client = get_client_oauth(conf)
create_samples(client)
if __name__ == "__main__":
main()
Result:
INFO:root:Folder workshops with id: 223095001439
INFO:root:Folder file_representations with id: 223939315135
INFO:root: Uploaded Single Page.docx (1294096878155) 11723 bytes
INFO:root: Uploaded JS-Small.js (1294098434302) 3249 bytes
INFO:root: Uploaded HTML.html (1294094879490) 2087 bytes
INFO:root: Uploaded Document (PDF).pdf (1294102659923) 792687 bytes
INFO:root: Uploaded Audio.mp3 (1294103505129) 2772151 bytes
INFO:root: Uploaded Preview SDK Sample Excel.xlsx (1294097951585) 83418 bytes
INFO:root: Uploaded JSON.json (1294102660561) 583 bytes
INFO:root: Uploaded ZIP.zip (1294105019347) 41687 bytes
INFO:root: Uploaded Document (Powerpoint).pptx (1294096083753) 57947 bytes
Next, create a files_representations.py
file on the root of the project that you will use to write your code.
Create a global constant named DEMO_FOLDER
and make it equal to the id of the file_representations
folder, in my case 223939315135
.
Create a global constants for each file with their file id that you got on the previous step. In my case:
DEMO_FOLDER = 223939315135
FILE_DOCX = 1294096878155
FILE_JS = 1294098434302
FILE_HTML = 1294094879490
FILE_PDF = 1294102659923
FILE_MP3 = 1294103505129
FILE_XLSX = 1294097951585
FILE_JSON = 1294102660561
FILE_ZIP = 1294105019347
FILE_PPTX = 1294096083753
"""Box File representations"""
import logging
import json
import requests
import shutil
from typing import List
from box_sdk_gen.client import BoxClient as Client
from box_sdk_gen.schemas import (
File,
FileMini,
Folder,
FileFullRepresentationsEntriesStatusStateField,
FileFullRepresentationsEntriesField,
)
from box_sdk_gen.managers.files import GetFileThumbnailByIdExtension
from utils.box_client_oauth import ConfigOAuth, get_client_oauth
logging.basicConfig(level=logging.INFO)
logging.getLogger("box_sdk_gen").setLevel(logging.CRITICAL)
DEMO_FOLDER = 223939315135
FILE_DOCX = 1294096878155
FILE_JS = 1294098434302
FILE_HTML = 1294094879490
FILE_PDF = 1294102659923
FILE_MP3 = 1294103505129
FILE_XLSX = 1294097951585
FILE_JSON = 1294102660561
FILE_ZIP = 1294105019347
FILE_PPTX = 1294096083753
def main():
"""Simple script to demonstrate how to use the Box SDK"""
conf = ConfigOAuth()
client = get_client_oauth(conf)
user = client.users.get_user_me()
print(f"\nHello, I'm {user.name} ({user.login}) [{user.id}]")
if __name__ == "__main__":
main()
Let's start by creating a couple of methods that list and print all representation for a file object:
def obj_dict(obj):
return obj.__dict__
def file_representations_print(
file_name: str, representations: List[FileFullRepresentationsEntriesField]
):
json_str = json.dumps(representations, indent=4, default=obj_dict)
print(f"\nFile {file_name} has {len(representations)} representations:\n")
print(json_str)
def file_representations(
client: Client, file: FileMini, rep_hints: str = None
) -> List[FileFullRepresentationsEntriesField]:
"""Get file representations"""
file = client.files.get_file_by_id(
file.id, fields=["name", "representations"], x_rep_hints=rep_hints
)
return file.representations.entries
Then use it in your main method with the FILE_DOCX
:
def main():
"""Simple script to demonstrate how to use the Box SDK"""
...
# make sure the file exists
file_docx = client.files.get_file_by_id(FILE_DOCX)
# List all representations for a file
file_docx_representations = file_representations(client, file_docx)
file_representations_print(file_docx.name, file_docx_representations)
Resulting in:
Hello, I'm Free Dev 001 [25428698627]
File Single Page.docx has 9 representations:
...
Quite a lot info there, let's check this one that represents a file thumbnail:
{
"representation": "jpg",
"properties": {
"dimensions": "32x32",
"paged": "false",
"thumb": "true"
},
"info": {
"url": "https://api.box.com/2.0/internal_files/1294096878155/versions/1415005971755/representations/jpg_thumb_32x32"
}
},
In order to get a specific representation, you need to use the representation hints
parameter on the method.
For example, to get the png 320x320 representation of the FILE_DOCX
:
def main():
...
# Get a specific representation
file_docx_representations_png = file_representations(client, file_docx, "[jpg?dimensions=320x320]")
file_representations_print(file_docx.name, file_docx_representations_png)
Resulting in:
[
{
"content": {
"url_template": "https://public.boxcloud.com/api/2.0/internal_files/1294096878155/versions/1478711934034/representations/jpg_320x320/content/{+asset_path}"
},
"info": {
"url": "https://api.box.com/2.0/internal_files/1294096878155/versions/1478711934034/representations/jpg_320x320"
},
"properties": {
"dimensions": "320x320",
"paged": "false",
"thumb": "false"
},
"representation": "jpg",
"status": {
"state": "success"
}
}
]
Notice that the state
is success
, this means that the representation has been generated. If the representation is not available then the state will be none
, pending
, etc.
Now that we have the url_template
we can download the representation.
First let's create the simplest method to download a file from a url:
def do_request(url: str, access_token: str):
resp = requests.get(
url, headers={"Authorization": f"Bearer {access_token}"}
)
resp.raise_for_status()
return resp.content
Next let's create a representation download method:
def representation_download(
access_token: str,
file_representation: FileFullRepresentationsEntriesField,
file_name: str,
):
if (
file_representation.status.state
!= FileFullRepresentationsEntriesStatusStateField.SUCCESS
):
print(
f"Representation {file_representation.representation} is not ready"
)
return
url_template = file_representation.content.url_template
url = url_template.replace("{+asset_path}", "")
file_name = (
file_name.replace(".", "_").replace(" ", "_")
+ "."
+ file_representation.representation
)
content = do_request(url, access_token)
with open(file_name, "wb") as file:
file.write(content)
print(
f"Representation {file_representation.representation}",
f" saved to {file_name}",
)
And finally use it in your main method:
def main():
...
# Download the representation
access_token = client.auth.retrieve_token().access_token
representation_download(access_token, file_docx_representations_png[0], file_docx.name)
My end result:
[
{
"representation": "jpg",
"properties": {
"dimensions": "320x320",
"paged": "false",
"thumb": "false"
},
"info": {
"url": "https://api.box.com/2.0/internal_files/1294096878155/versions/1415005971755/representations/jpg_320x320"
},
"status": {
"state": "success"
},
"content": {
"url_template": "https://public.boxcloud.com/api/2.0/internal_files/1294096878155/versions/1415005971755/representations/jpg_320x320/content/{+asset_path}"
}
}
]
Representation jpg saved to Single_Page_docx.jpg
And a new file has been downloaded to my local folder:
The python SDK as a helper method to get the thumbnail representation of a file:
Let's create a specific method for it:
def file_thumbnail(
client: Client,
file: File,
extension: GetFileThumbnailByIdExtension,
min_h: int,
min_w: int,
) -> bytes:
"""Get file thumbnail"""
thumbnail = client.files.get_file_thumbnail_by_id(
file_id=file.id,
extension=extension,
min_height=min_h,
min_width=min_w,
)
if not thumbnail:
raise Exception(f"Thumbnail for {file.name} not available")
return thumbnail
Notice that the requested thumbnail is not always available. If not we're generating an exception, in which case you should select another representation.
Let's use it in our main method:
def main():
...
# Get thumbnail representation
file_docx_thumbnail = file_thumbnail(
client,
file_docx,
GetFileThumbnailByIdExtension.JPG,
min_h=94,
min_w=94,
)
with open(
file_docx.name.replace(".", "_").replace(" ", "_") + "_thumbnail.jpg",
"wb",
) as file:
shutil.copyfileobj(file_docx_thumbnail, file)
print(
f"\nThumbnail for {file_docx.name} ",
f"saved to {file_docx.name.replace('.', '_')}_thumbnail.jpg",
)
Resulting in:
Thumbnail for Single Page.docx saved to Single Page_docx_thumbnail.jpg
And I have a new file on my local folder:
Some documents can be converted to PDF, let's try it with the FILE_PPTX
:
def main():
...
# Make sure the file exists
file_ppt = client.files.get_file_by_id(FILE_PPTX)
print(f"\nFile {file_ppt.name} ({file_ppt.id})")
# Get PDF representation
file_ppt_repr_pdf = file_representations(client, file_ppt, "[pdf]")
file_representations_print(file_ppt.name, file_ppt_repr_pdf)
access_token = client.auth.retrieve_token().access_token
representation_download(access_token, file_ppt_repr_pdf[0], file_ppt.name)
resulting in:
Representation pdf saved to Document_(Powerpoint)_pptx.pdf
And a new file on my local folder:
Document_(Powerpoint)_pptx.pdf
Representations may not always be available.
Let's create a method that lists the status for a certain representation for all files in a folder:
def folder_list_representation_status(
client: Client, folder: Folder, representation: str
):
items = client.folders.get_folder_items(folder.id).entries
print(
f"\nChecking for {representation} ",
f"status in folder [{folder.name}] ({folder.id})",
)
for item in items:
if isinstance(item, FileMini):
file_repr = file_representations(
client, item, "[" + representation + "]"
)
if file_repr:
state = file_repr[0].status.state.value
else:
state = "not available"
print(f"File {item.name} ({item.id}) state: {state}")
And look for extracted_text
representation on the DEMO_FOLDER
:
def main():
...
# Generate representations
folder = client.folders.get_folder_by_id(DEMO_FOLDER)
folder_list_representation_status(client, folder, "extracted_text")
Which results in:
Checking for extracted_text status in folder [file_representations] (223939315135)
File Audio.mp3 (1294103505129) state: not available
File Document (PDF).pdf (1294102659923) state: none
File Document (Powerpoint).pptx (1294096083753) state: none
File HTML.html (1294094879490) state: none
File JS-Small.js (1294098434302) state: none
File JSON.json (1294102660561) state: none
File Preview SDK Sample Excel.xlsx (1294097951585) state: none
File Single Page.docx (1294096878155) state: none
File ZIP.zip (1294105019347) state: not available
No luck there, in my case I don't have a single text representation available.
However for the ones where the status is none, we can request them to be generated. We do this by executing and HTTP GET on the info URL.
Let's start by specifically request all details for the [extracted_text]
representation of the FILE_PPTX
:
def main()
...
file_ppt_repr = file_representations(client, file_ppt, "[extracted_text]")
file_representations_print(file_ppt.name, file_ppt_repr)
Resulting in:
File Document (Powerpoint).pptx has 1 representations:
[
{
"content": {
"url_template": "https://public.boxcloud.com/api/2.0/internal_files/1294096083753/versions/1478709361496/representations/extracted_text/content/{+asset_path}"
},
"info": {
"url": "https://api.box.com/2.0/internal_files/1294096083753/versions/1478709361496/representations/extracted_text"
},
"properties": {
"dimensions": null,
"paged": null,
"thumb": null
},
"representation": "extracted_text",
"status": {
"state": "none"
}
}
]
Now we get the info url
to trigger the text generation, and list the representation again:
def main()
...
access_token = client.auth.retrieve_token().access_token
if file_ppt_repr[0].status.state == "none":
info_url = file_ppt_repr[0].info.url
do_request(info_url, access_token)
file_ppt_repr = file_representations(client, file_ppt, "[extracted_text]")
file_representations_print(file_ppt.name, file_ppt_repr)
We can see that the state changed to pending
:
[
{
"representation": "extracted_text",
"properties": {},
"info": {
"url": "https://api.box.com/2.0/internal_files/1294096083753/versions/1415005153353/representations/extracted_text"
},
"status": {
"state": "pending"
},
"content": {
"url_template": "https://public.boxcloud.com/api/2.0/internal_files/1294096083753/versions/1415005153353/representations/extracted_text/content/{+asset_path}"
}
}
]
Once it changes to success, all we need to do is download the representation:
def main()
...
representation_download(access_token, file_ppt_repr[0], file_ppt.name)
And a new file showed up on my local folder:
Document_(Powerpoint)_pptx.extracted_text
There are more image representations available:
- Check out a few more representations for each file in the
DEMO_FOLDER
Although the Python SDK does provide a specific method to get thumbnails for a document, most of the time, you'll be using the generic methods:
client.files.get_file_by_id(file.id, fields=["representations"])
to get the list all the representations available for a fileclient.files.get_file_by_id(file.id, fields=["representations"], x_rep_hints=rep_hints)
to get a specific representation- Download the representation using the
url_template
provided by the previous method if it is available. - If the representations are showing a
state
ofnone
then you can trigger them by doing aHTTP GET
using theinfo_url