Microsoft Azure offer ‘cognitive services’ for analysing digital content. This means that you can have direct access to advanced image processing technology that offers services which include:
- an image captioner
- object identifier
- sensitive content detector
All of which without building them yourself algorithms and technology yourself! You do have to pay for extensive use albeit.
AI Image Captioning Test
The first I will be doing is an image captioning test. Captions can be only a sentence long and they describe the entirety of the image. Captions can be detailed but prior to the test I do not expect lengthy captions to be accurate relative to the content of image. Especially given the images I will feed to it. Shorter captions that are simple are more likely to be accurate.
Image 1 — “Pixel City”
The first image is a pixel-drawn cyberpunk city generated using SDXL beta.
prompt:”retro classic zelda style pixel art. cyberpunk tokyo grunge. neon lights and endless nights. view of beautiful skyline vaporware sunset immediate fade into endless night.Shigeru Miyamoto, Hironobu Sakaguchi.”
I think algorithm should pick up buildings and output a caption with the word ‘building’ in it. ‘city’ is another word I would expect to see in it. Anything else would be a surprise.
The fact it recognised it as ‘pixel art’ was quite impressive to me. Nothing else to say about this one apart from the fact it did not pick up anything else. Ranking the content in the image, the city would be the most prominent, followed by the sunset arguably. So if it picked up the sunset it would of been more interesting. It correctly captioned the style of the image nevertheless.
Image 2 — “Not sure its a sunset”
I thought I would make the test image a bit more open to perspective or abstract. This will test the capabilities of the captioning a bit more.
The content of the image is fairly basic with people sitting on a beach viewing a sunset. being the main aspect. The trickiest bit the algorithm will encounter is the blurriness and the fact the figures are not explicitly drawn.
prompt: “oil crayon sketch of art gallery with commemorative display for supernatural awe event glowing visitors .wide shot perspective of someone with white glow sitting at the beach.beautiful gradient explosion of paint sunset vaporwave holosexual grunge effect silver grey visual tint like a movie, yoshitomo nara, watercolor, kazuhiro tsuji, pen and ink detailing”
Again, it performed well to classify those almost random colours as a group of people. Obviously, to a human those are clearly people but I am thinking from the perspective of a computer. What would a computer see?
Conclusions
Since this simply returns captions, the amount of detail is limited and you would likely need to see more natural language to evaluate how it would really describe the image. I am sure the image analysis API has more extensive analysis features to potentially obtain more description so I will be trying that in the near future.