Better way to manage imported media when importing documents

We're migrating hundreds of documents from our existing site. Each document has at least one image, usually two or three. Having the docs go into a release works great for reviewing them before we ok them for the new site. However, it took a couple attempts to get the import right, and at that point I realized that each time we imported a zip of jsons, a new copy of the media files was loaded into the media area. We now have someone manually removing the duplicates. There are several different features that would either avoid this problem or make it easier for us to deal with it, including:

  • Use the image filename as a unique id and therefore treat subsequent imports as an update rather than a new media asset.
  • Treat the media assets as part of the release - must be published/approved before are adding to the media library.
  • Add a tool for finding and eliminating duplicate media files through the UI.
  • Add API methods to query for media assets and to delete them programmatically.
  • Improve the documentation to make it clear that duplicate media assets will be created in this scenario.
  • Improve the documentation to clarify what fields are used to establish identity for documents and media assets and how that plays out in linking, exporting, and importing, including as it relates to updating content vs creating new content items.

We'd find all these really useful, but any one of them would help us solve the problem we have right now or help the next person to avoid our mistake. If we had to pick one, we'd choose the API methods, because we'll end up wanting to continue to enrich our media assets with more metadata over time and that would make it a lot easier.

Thanks!
Lisa

Hi @lisa,

Thank you for your feedback.

For the elimination of duplicate assets..... - I will check with my dev team if that can be considered for our future release.
For Improve the documentation to make it clear that duplicate media assets..... - I will add it to my tasks bucket.
For Improve the documentation to clarify what fields are used to establish identity for documents and media assets...... - This is already mentioned as part of limitations of import/export that we only use images as part of import job, and for required fields to import images, you can find the necessary information here.

Let me know if you have any doubt.

Thanks,

Priyanka

Thank you for your answers.

Regarding documentation, here are notes that I shared with my colleagues. Unedited, but might be useful to you - things we struggled with and some other things that you might want to address in your documentation. I might have missed some of the information in your docs, but that in itself could point to a need to adjust how it's presented. No need for feedback unless you have something specific - I just wanted to share.

  1. Import process associates the imported documents with a release, which is autonamed if you don't assign it
  2. Import process provides very limited debug info if JSON has errors. You can't copy the error popup so it's inconvenient if you need to search or otherwise use the data.
  3. Can import media by referencing it from the document JSON and including it in the zip
  4. Should be able to update documents and media through import if the slugs/filenames match, but so far this isn't working for me - duplicates are being created
    • Follow up: File identity is determined by filename. When looking for the ID of an exported documented, use the filename - it's not in the JSON. Haven't tested yet if this is also the way to avoid duplication on import, but seems likely.
  5. Can use REST to query the docs if don't want to export the entire repository
  6. Fine to omit most data (eg document language) - will default when you import
  7. Authors must be identified by ID if you include them in the JSON, which means exporting, grepping unique values, and looking up the actual names by opening docs. Or maybe can get the mappings via REST. (Follow up, didn't find a way to do that.)
  8. The import is case-sensitive. Image filenames referenced by the JSON references must match the case of the actual filename.
  9. When you upload a zip that contains JSON for documents and the related media files, the documents are loaded into a release folder. This means that if you are unhappy with the documents, you can easily delete them before they are ever visible in the normal content working area. However, the media files are immediately uploaded into the main media area - they aren't part of the release. If you include a media file that was loaded previously, a duplicate will be created.
  10. Prismic image files cannot be larger than 10MB. This is documented on the Prismic site. If there's an image that's too large in a zip, the import UI just hangs with "uploading" displayed. No useful messages.
  11. JSON zips appear to also have a file size limit of 100MB (not documented that I could tell), which is obviously about the images in the file rather than the JSON. I split the zips into multiple parts to address this. No UI feedback. Because some images are reused in multiple documents and the images must be present in the zip even if they were already loaded into Prismic, this approach creates duplicate images in the Media folder which must be manually removed. (Note: There's also a limit of 200 documents per zip, but that wasn't the issue - and at any rate doesn't appear to apply to the number of images, just the JSON.)
  12. There is no means in Prismic to query against media files via API, nor a way to update them via API outside documents. An API would make it easier to deal with some of the negative side effects of the loading process. Would also help with ongoing review of media for alt text etc.