Friday, April 29, 2011

Cleaning and Validating user input with htmLawed

htmLawed is a highly customizable single-file PHP script to make text secure, standard and admin policy-compliant for use in the body of HTML 4, XHTML 1 or 1.1, or generic XML documents. It is thus a configurable input (X)HTML filter, processor, purifier, sanitizer, beautifier, etc., and an alternative to the HTMLTidy application.

The lawing in of input text is needed to ensure that HTML code in the text is standard-compliant, does not introduce security vulnerabilities, and does not break the aesthetics, design or layout of web-pages. htmLawed tries to do this by, for example, making HTML well-formed with balanced and properly nested tags, neutralizing code that may be used for cross-site scripting (XSS) attacks, and allowing only specified HTML elements/tags and attributes
.
Features
  • Make HTML markup in text secure and standard-compliant
  • Process text for use in HTML, XHTML or XML documents
  • Restrict HTML elements, attributes or URL protocols using black- or white-lists
  • Balance tags, check element nesting, transform deprecated attributes and tags, make relative URLs absolute, etc.
  • Fast, highly customizable, well-documented
  • Single, 47 kb file
  • Simple HTML Tidy alternative
  • Use to filter, secure & sanitize HTML in blog comments or forum posts, generate XML-compatible feed items from web-page excerpts, convert HTML to XHTML, pretty-print HTML, scrape web-pages, reduce spam, remove XSS code, etc.
Using htmLawed is as simple as it gets. You can either include() the htmLawed.php file or copy-paste the entire code. htmLawed should work with PHP 4.3 and higher.
htmLawed is free and open-source software licensed under GPL license version 3, and copyrighted by Santosh Patnaik. You can find further information, demo & download on htmLawed Websiter.

Some of the Example Usage are::

    $config = array('safe'=>1);
    $out = htmLawed($in);

  Simplest, allowing all valid HTML markup except javascript: --

    $out = htmLawed($in);

  Allowing all valid HTML markup including javascript: --

    $config = array('schemes'=>'*:*');
    $out = htmLawed($in, $config);

  Allowing only safe HTML and the elements a, em, and strong --

    $config = array('safe'=>1, 'elements'=>'a, em, strong');
    $out = htmLawed($in, $config);

  Not allowing elements script and object --

    $config = array('elements'=>'* -script -object');
    $out = htmLawed($in, $config);

  Not allowing attributes id and style --

    $config = array('deny_attribute'=>'id, style');
    $out = htmLawed($in, $config);

  Permitting only attributes title and href --

    $config = array('deny_attribute'=>'* -title -href');
    $out = htmLawed($in, $config);

  Remove bad/disallowed tags altogether instead of converting them to entities --

    $config = array('keep_bad'=>0);
    $out = htmLawed($in, $config);

  Allowing attribute title only in a and not allowing attributes id, style, or scriptable on* attributes like onclick --

    $config = array('deny_attribute'=>'title, id, style, on*');
    $spec = 'a=title';
    $out = htmLawed($in, $config, $spec);

  Some case-studies.

  1. A blog administrator wants to allow only a, em, strike, strong and u in comments, but needs strike and u transformed to span for better XHTML 1-strict compliance, and, he wants the a links to be to http or https resources:

    $processed = htmLawed($in, array('elements'=>'a, em, strike, strong, u', 'make_tag_strict'=>1, 'safe'=>1, 'schemes'=>'*:http, https'), 'a=href');

  2. An author uses a custom-made web application to load content on his web-site. He is the only one using that application and the content he generates has all types of HTML, including scripts. The web application uses htmLawed primarily as a tool to correct errors that creep in while writing HTML and to take care of the occasional bad characters in copy-paste text introduced by Microsoft Office. The web application provides a preview before submitted input is added to the content. For the previewing process, htmLawed is set up as follows:

    $processed = htmLawed($in, array('css_expression'=>1, 'keep_bad'=>1, 'make_tag_strict'=>1, 'schemes'=>'*:*', 'valid_xhtml'=>1));

  For the final submission process, keep_bad is set to 6. A value of 1 for the preview process allows the author to note and correct any HTML mistake without losing any of the typed text.

  3. A data-miner is scraping information in a specific table of similar web-pages and is collating the data rows, and uses htmLawed to reduce unnecessary markup and white-spaces:

    $processed = htmLawed($in, array('elements'=>'tr, td', 'tidy'=>-1), 'tr, td =');

47 comments:

  1. Excellent .. Superb .. I will bookmark your web site and take the feeds additionally I'm satisfied to search out numerous useful information here in the post about Spam Remove, thank you for sharing. . . . .

    ReplyDelete
  2. htmLawed does not remove the content of the SCRIPT tag.
    It removes the tags, but not the scrpit into it.

    any know why?

    Cheers

    ReplyDelete
  3. All the points you described so beautiful. Every time i read your i blog and i am so surprised that how you can write so well.
    python Training institute in Pune
    python Training institute in Chennai
    python Training institute in Bangalore

    ReplyDelete
  4. Great Article… I love to read your articles because your writing style is too good, its is very very helpful for all of us and I never get bored while reading your article because, they are becomes a more and more interesting from the starting lines until the end.
    Devops Training in Bangalore
    Microsoft azure training in Bangalore
    Power bi training in Chennai

    ReplyDelete
  5. Attend The Python training in bangalore From ExcelR. Practical Python training in bangalore Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Python training in bangalore.
    python training in bangalore

    ReplyDelete
  6. Attend The Data Science Course in Bangalore From ExcelR. Practical Data Science Course in Bangalore Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Science Course in Bangalore.
    ExcelR Data Science Course in Bangalore

    ReplyDelete
  7. Informative post indeed, I’ve being in and out reading posts regularly and I see alot of engaging people sharing things and majority of the shared information is very valuable and so, here’s my fine read.
    click here for exam-2018 result
    click here to enter an aws account id
    click here for
    click here for full details and apply online
    click here for membership to full-length episode

    ReplyDelete
  8. I like it and help me to development very well.Thank you for this brief explanation and very nice information.Well, got a good knowledge.
    aws training in chennai | aws training in annanagar | aws training in omr | aws training in porur | aws training in tambaram | aws training in velachery

    ReplyDelete
  9. I am very impressed and inspired by your skill and creativity. I must say you’ve done a very good job with this. Thanks for sharing your valuable information and time.

    Aws Training in Chennai

    Aws Training in Velachery

    Aws Training in Tambaram

    Aws Training in Porur

    Aws Training in Omr

    Aws Training in Annanagar

    ReplyDelete
  10. Wow! Such an amazing and helpful post this is. I really really love it. It's so good and so awesome. I am just amazed. I hope that you continue to do your work like this in the future also.
    Best Gym in Visakhapatnam

    ReplyDelete
  11. I am impressed by the information that you have on this blog. It shows how well you understand this subject. data science course in mysore

    ReplyDelete
  12. href="https://istanbulolala.biz/">https://istanbulolala.biz/
    YOQA7

    ReplyDelete