Guidelines for preservable websites

Please follow these guidelines to ensure your website is well-indexed by search engines and that it can be preserved for posterity. Created by special collections and Williams web development, these guidelines are heavily adapted from sources listed below–follow those links for more detailed information.

Last edit: Dec 18 2020

Provide a standard link to all website content (including pages, images, videos):

To be visible to crawlers, links should be in HTML/XHTML format, rather than embedded in Javascript or Flash. (Flash has reached its “end of life” and should no longer be used. See this announcement from Adobe for more information.)

Avoid proprietary formats for important content, especially the homepage 

Do not create home pages relying heavily on images or animations such as Flash, but if you do create such pages also provide alternative text-only HTML versions.

Include a user and/ or xml sitemap 

Sitemaps providing links to all content in a website ensure crawlers will find the content

Omit Robots.txt exclusions or limit them to areas not needed for archiving

Unlike search engines that need to index text only, successful archiving requires access to all files needed to render the website (including stylesheets, images, etc.). Check your robots.txt file to be sure directories containing stylesheets and images are not restricted. By contrast, some content (like calendar functions, databases, shopping baskets) can slow down or trap the crawler and is not needed in archived copies; optionally preventing access to these areas via robots.txt can improve preservability. To provide full access to our crawler specifically, add the following two lines to your robots.txt file.

     User agent: archive.org_bot

     Disallow: 

Maintain stable urls and redirect when necessary

Keeping the URLs for particular content consistent over time minimizes “link rot” within your site and for external sites linking to your content and allows the archives to show the evolution of a page over time. If the URL structure of content on your website must change, be sure to redirect visitors from each changed old URL to the corresponding new URL.

Correctly identify character set encoding

Your web server’s Content-Type field in the HTTP header must correctly identify the character set encoding in order for successful capture and rendering of the archived copy. The meta tag Content-Type in the source code of a page can also identify the character set, and must be consistent with the character set cited in the HTTP header. 

Use Williams webpage templates

The College content management system and corresponding HTML templates are built to be archivable out of the box. When possible, use a standard template to ensure your content can be successfully archived. The templates handle details like character encoding, URLs and robots.txt files.

Sources

Columbia University. Guidelines For Preservable Websites.

Library of Congress. Library of Congress Guide to Creating Preservable Websites

National Archives (UK). “The UK Government Web Archive : guidance for digital and records management teams

Stanford University Libraries. Archivability

UK Web Archive. Technical Information FAQ #2. “Making Your Website Crawler-Friendly.”