Please follow these guidelines to ensure your website is well-indexed by search engines and that it can be preserved for posterity. Created by special collections and Williams web development, these guidelines are heavily adapted from sources listed below–follow those links for more detailed information.
Last edit: Dec 18 2020
Provide a standard link to all website content (including pages, images, videos):
Avoid proprietary formats for important content, especially the homepage
Do not create home pages relying heavily on images or animations such as Flash, but if you do create such pages also provide alternative text-only HTML versions.
Include a user and/ or xml sitemap
Sitemaps providing links to all content in a website ensure crawlers will find the content
Omit Robots.txt exclusions or limit them to areas not needed for archiving
Unlike search engines that need to index text only, successful archiving requires access to all files needed to render the website (including stylesheets, images, etc.). Check your robots.txt file to be sure directories containing stylesheets and images are not restricted. By contrast, some content (like calendar functions, databases, shopping baskets) can slow down or trap the crawler and is not needed in archived copies; optionally preventing access to these areas via robots.txt can improve preservability. To provide full access to our crawler specifically, add the following two lines to your robots.txt file.
User agent: archive.org_bot
Maintain stable urls and redirect when necessary
Keeping the URLs for particular content consistent over time minimizes “link rot” within your site and for external sites linking to your content and allows the archives to show the evolution of a page over time. If the URL structure of content on your website must change, be sure to redirect visitors from each changed old URL to the corresponding new URL.
Correctly identify character set encoding
Your web server’s Content-Type field in the HTTP header must correctly identify the character set encoding in order for successful capture and rendering of the archived copy. The meta tag Content-Type in the source code of a page can also identify the character set, and must be consistent with the character set cited in the HTTP header.
Use Williams webpage templates
The College content management system and corresponding HTML templates are built to be archivable out of the box. When possible, use a standard template to ensure your content can be successfully archived. The templates handle details like character encoding, URLs and robots.txt files.
Columbia University. Guidelines For Preservable Websites.
Library of Congress. Library of Congress Guide to Creating Preservable Websites
National Archives (UK). “The UK Government Web Archive : guidance for digital and records management teams“
Stanford University Libraries. Archivability
UK Web Archive. Technical Information FAQ #2. “Making Your Website Crawler-Friendly.”